期刊文献+

基于网页结构挖掘的信息提取 被引量:2

Extracting Information by Mining Structures of Web Pages
下载PDF
导出
摘要 本文提出了两种细粒度的、基于网页结构挖掘的信息提取方法,比较了它们的优缺点,并给出了相应具体实现的性能测试和结果分析。 To simplify the task of obtaining information from the vast number of information sources that are available on the WWW, we have developed two different methods to extract information of fine grain. This paper firstly describes the principles of the two methods, which work by mining structures of Web pages, and then compares the advantages and disadvantages of them. Finally, we test the performance of the two methods and analyze the experiment results.
出处 《计算机科学》 CSCD 北大核心 2006年第3期191-193,218,共4页 Computer Science
关键词 信息提取 网页结构挖掘 重复模式 时间特征 RSS Information extraction, Mining structures of Web pages, Repeated pattern, Time characteristic, RSS
  • 相关文献

参考文献8

  • 1Ashish A,Knoblock C.Wrapper generation for semi-structured Internet sources[J].SIGMOD Record,1997,26(4):8~15.
  • 2Cai Deng,Yu Shipeng,Wen Ji-Rong,et al.Extracting Content Structure for Web Pages based on Visual Representation.In:Fifth Asia Pacific Web Conf.(APWeb2003),2003.
  • 3Cai Deng,Yu Shipeng,Wen Ji-Rong,et al.VIPS:aVision-based Page Segmentation Algorithm.Microsoft Technical Report(MSR-TR-2003-79),2003.
  • 4http://www.w3.org/TR/REC-html40/.
  • 5http://www.w3.org/DOM/.
  • 6Han Jiawei,Pei Jian,Yin Yiwen.Mining Frequent Patterns without candidate generation:A Frequent-Pattern Tree Approach.Data Mining and Knowledge Discovery,2004,8:53 ~87.
  • 7Agarwal R,Aggarwal C,Prasad V V V.A tree projection algorithm for generation of frequent item sets.Journal of Parallel and Distributed Computing,2001,61(3):350~371.
  • 8Yu Shipeng.Improving pseudo-relevance feedback in Web Information retrieval using web page segmentation.Trip Report WWW2003,Budapest,Hungary,2003.

同被引文献18

  • 1吴振新.RSS元数据在门户网站建设中的应用[J].现代图书情报技术,2004(10):60-64. 被引量:61
  • 2魏英.Internet环境下自动新闻发布系统[J].计算机应用,2004,24(B12):294-296. 被引量:7
  • 3冯铁,李文锦,张家晨,柴胜.面向Java语言的设计模式抽取方法的研究[J].计算机工程与应用,2005,41(25):28-33. 被引量:8
  • 4江璜.关注RSS安全问题[J].计算机安全,2006(1):74-75. 被引量:3
  • 5Asencio A, Cardman S,Harris D,et al.Relating expectations to automatically recovered design patterns[C].Proceedings of the Ninth IEEE Working Conference on Reverse Engineering,2002.
  • 6Di Lucca G A,Fasolino A R,Tramontana P. Recovering interaction design patterns in web applications[C].Manchester, United Kingdom: Proceedings of the IEEE Ninth European Conference on Software Maintenance and Reengineering, 2005.
  • 7Di Lucca G A,Fasolino A R,Tramontana P. Reverse engineering web applications: the ware approach [J]. Journal of Sotiware Maintenance and Evolution, Research and Practice, 2004,16 (1-2):71-101.
  • 8何昕,谢志鹏.基于简单树匹配算法的Web页面结构相似性度量[C].第24届中国数据库学术会议论文集(研究报告篇).北京:中国科学杂志社,2007:1-6.
  • 9XSLT - Wikipedia. http://zh. wikipedia.org/wiki/XSLT (Accessed Sept. 3,2006 )
  • 10Clean up your Web pages with HTML TIDY.http://www. w3. org/People/Raggett/tidy/ ( Accessed Sept. 5,2006 )

引证文献2

二级引证文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部