基于网页结构挖掘的信息提取被引量：2

Extracting Information by Mining Structures of Web Pages

下载PDF

导出

摘要本文提出了两种细粒度的、基于网页结构挖掘的信息提取方法,比较了它们的优缺点,并给出了相应具体实现的性能测试和结果分析。 To simplify the task of obtaining information from the vast number of information sources that are available on the WWW, we have developed two different methods to extract information of fine grain. This paper firstly describes the principles of the two methods, which work by mining structures of Web pages, and then compares the advantages and disadvantages of them. Finally, we test the performance of the two methods and analyze the experiment results.

作者李媛耿桦张甍潘金贵

机构地区南京大学计算机软件新技术国家重点实验室

出处《计算机科学》 CSCD 北大核心 2006年第3期191-193,218,共4页 Computer Science

关键词信息提取网页结构挖掘重复模式时间特征 RSS Information extraction, Mining structures of Web pages, Repeated pattern, Time characteristic, RSS

分类号 TP391.41 [自动化与计算机技术—计算机应用技术] TP311.5 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献8

1Ashish A,Knoblock C.Wrapper generation for semi-structured Internet sources[J].SIGMOD Record,1997,26(4):8～15.
2Cai Deng,Yu Shipeng,Wen Ji-Rong,et al.Extracting Content Structure for Web Pages based on Visual Representation.In:Fifth Asia Pacific Web Conf.(APWeb2003),2003.
3Cai Deng,Yu Shipeng,Wen Ji-Rong,et al.VIPS:aVision-based Page Segmentation Algorithm.Microsoft Technical Report(MSR-TR-2003-79),2003.
4http://www.w3.org/TR/REC-html40/.
5http://www.w3.org/DOM/.
6Han Jiawei,Pei Jian,Yin Yiwen.Mining Frequent Patterns without candidate generation:A Frequent-Pattern Tree Approach.Data Mining and Knowledge Discovery,2004,8:53 ～87.
7Agarwal R,Aggarwal C,Prasad V V V.A tree projection algorithm for generation of frequent item sets.Journal of Parallel and Distributed Computing,2001,61(3):350～371.
8Yu Shipeng.Improving pseudo-relevance feedback in Web Information retrieval using web page segmentation.Trip Report WWW2003,Budapest,Hungary,2003.

同被引文献18

1吴振新.RSS元数据在门户网站建设中的应用[J].现代图书情报技术,2004(10):60-64. 被引量：61
2魏英.Internet环境下自动新闻发布系统[J].计算机应用,2004,24(B12):294-296. 被引量：7
3冯铁,李文锦,张家晨,柴胜.面向Java语言的设计模式抽取方法的研究[J].计算机工程与应用,2005,41(25):28-33. 被引量：8
4江璜.关注RSS安全问题[J].计算机安全,2006(1):74-75. 被引量：3
5Asencio A, Cardman S,Harris D,et al.Relating expectations to automatically recovered design patterns[C].Proceedings of the Ninth IEEE Working Conference on Reverse Engineering,2002.
6Di Lucca G A,Fasolino A R,Tramontana P. Recovering interaction design patterns in web applications[C].Manchester, United Kingdom: Proceedings of the IEEE Ninth European Conference on Software Maintenance and Reengineering, 2005.
7Di Lucca G A,Fasolino A R,Tramontana P. Reverse engineering web applications: the ware approach [J]. Journal of Sotiware Maintenance and Evolution, Research and Practice, 2004,16 (1-2):71-101.
8何昕,谢志鹏.基于简单树匹配算法的Web页面结构相似性度量[C].第24届中国数据库学术会议论文集(研究报告篇).北京:中国科学杂志社,2007:1-6.
9XSLT - Wikipedia. http://zh. wikipedia.org/wiki/XSLT (Accessed Sept. 3,2006 )
10Clean up your Web pages with HTML TIDY.http://www. w3. org/People/Raggett/tidy/ ( Accessed Sept. 5,2006 )

引证文献2

1陈凌晖.基于RSS技术的信息门户个性化信息服务理念与实现[J].现代图书情报技术,2007(1):33-36. 被引量：9
2刘继红,吴军华.Web逆向工程中交互设计模式的抽取方法改进[J].计算机工程与设计,2010,31(5):932-935. 被引量：2

二级引证文献11

1王伟军,熊瑞,成江东.利用DotNetNuke构建基于Web2.0的知识管理平台[J].现代图书情报技术,2007(7):41-45. 被引量：9
2胡潜,汪会玲.基于RSS的个性化推送服务[J].情报杂志,2008,27(10):31-33. 被引量：4
3方辉,谭建荣,谭颖,冯毅雄.基于Web的制造信息主动推荐服务研究[J].计算机集成制造系统,2008,14(11):2253-2260. 被引量：10
4樊五妹.基于RSS技术的图书馆虚拟参考咨询系统的设计与实现[J].江西图书馆学刊,2009,39(2):92-93. 被引量：4
5樊五妹.RSS技术在图书采访中的应用设计[J].现代情报,2009,29(8):180-181. 被引量：3
6郝志勇,庄永龙,张学工.基础医学科研进展信息聚合平台构建[J].医学信息学杂志,2010,31(6):13-16. 被引量：3
7陆媛媛.纸本文献和电子文献在内容与服务方式上的整合[J].滁州学院学报,2012,14(4):124-126.
8刘淑华.J2EE项目中一种新的错误处理方法[J].计算机应用与软件,2013,30(7):143-146. 被引量：7
9张语涵,刘淑华,周永鑫.Java Web应用中错误和异常处理方法研究[J].现代计算机（中旬刊）,2013(8):61-65. 被引量：6
10曹琳琳.数字化信息技术在企业发展中的作用[J].信息系统工程,2014,27(5):111-111.

1田甜,倪林.基于PageRank算法的权威值不均衡分配问题[J].计算机工程,2007,33(18):53-55. 被引量：20
2孙群,漆正东.Web聊天室探测系统的网页获取和改进研究[J].计算机光盘软件与应用,2012,15(3):184-184.

计算机科学

2006年第3期

浏览历史

内容加载中请稍等...

基于网页结构挖掘的信息提取被引量：2

参考文献8

同被引文献18

引证文献2

二级引证文献11

相关作者

相关机构

相关主题

浏览历史

基于网页结构挖掘的信息提取 被引量：2

参考文献8

同被引文献18

引证文献2

二级引证文献11

相关作者

相关机构

相关主题

浏览历史

基于网页结构挖掘的信息提取被引量：2