期刊文献+

基于扩展标记树的网页正文抽取 被引量:2

Content Extraction of Web Page Based on Extended Label Tree
下载PDF
导出
摘要 本文给出了一种基于扩展标记树的网页正文抽取方法,通过构建网页扩展标记树,实现对网页的清理和抽取辅助信息的完善,并设置节点坐标定位节点位置;以构成正文内容的文本节点作为正文区域标志,挑选具有最大文本覆盖范围的近邻文本节点集,并进行修正形成正文区域;通过近邻优先遍历算法,实现标题节点的定位和附加属性的抽取。实验结果表明:该方法可以实现常规文章类网页的高精度抽取,并具有良好的适应性。 A content extraction method based on extended label tree is proposed.Web page cleaning and auxiliary information for extracting purpose are realized,and the coordinates of position are also set during the construction phase of extended label tree.Text nodes are regarded as the identifiers of the content region,then,the neighbor text node set with maximum coverage is selected and revised to form the final content region.Through the neighbor first traversal algorithm,the title node is located and additional properties are extracted.Experimental results show that the proposed method can achieve high-precision for common article page extraction and has good adaptability.
作者 夏天
出处 《广西师范大学学报(自然科学版)》 CAS 北大核心 2011年第1期133-137,共5页 Journal of Guangxi Normal University:Natural Science Edition
基金 国家自然科学基金资助项目(09CTQ027) 教育部科学技术研究重点项目(109005) 中国人民大学科学研究基金项目(22382078)
关键词 网页正文抽取 扩展标记树 近邻优先遍历 Web page content extraction extended label tree neighbor first traversal
  • 相关文献

参考文献10

  • 1刘兵.Web数据挖掘[M].北京:清华大学出版社,2009.
  • 2KUSHMERICK N. Wrapper induction for information extraction[D]. Seattle :University of Washington, 1997.
  • 3SUHIT G,GAIL K,DAVID N,et al. DOM-based content extraction of HTML documents [C]//Proceedings of the 12th international conference on World Wide Web. New York :ACM Press ,2003:207-214.
  • 4王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量:81
  • 5任玉,樊勇,郑家恒.基于分块的网页主题文本抽取[J].广西师范大学学报(自然科学版),2009,27(1):141-144. 被引量:5
  • 6CAI Deng, YU Shi-peng,WEN Ji-rong, et al. VIPS.. a vision-based page segmentation algorithm: MSR-TR-2003-79 [R]. Beijing : Microsoft Research, 2003.
  • 7张霞亮,陈家骏.基于逻辑行和最大接纳距离的网页正文抽取[J].计算机工程与应用,2009,45(25):125-128. 被引量:5
  • 8王利,刘宗田,王燕华,廖涛.基于内容相似度的网页正文提取[J].计算机工程,2010,36(6):102-104. 被引量:20
  • 9VNIKIC. HtmlCleaner [EB/OL]. (2008-09-02) [2010-11-01 ]. http ://htmlcleaner. sourceforge, net/.
  • 10汉语言智能实验室.新闻类网页正文提取在线演示系统[EB/OL].(2009-08-16)[2010-11-01].http://dm.griddss.cn/contentdemo.aspx.

二级参考文献31

共引文献124

同被引文献23

  • 1Mihalcea R, Tarau P. TextRank : Bringing Order into Texts [ C ]. In: Proceedings of Empirical Methods in Natural Language Process- ing, Barcelona, Spain. 2004:404-411.
  • 2Frank E, Paynter G W, Witten I H, et al. Domain - Specific Key- phrase Extraction [ C ] In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden. 1999 : 668 -673.
  • 3Turney P D. Learning Algorithms for Keyphrase Extraction[ J]. In- formation Retrieval, 2000, 2 (4) :303 - 336.
  • 4Pasquier C. Task 5 : Single Document Keyphrase Extraction Using Sentence Clustering and Latent Dirichlet Allocation [ C ]. In : Pro- ceedings of the 5th International Workshop on Semantic Evaluation. Stroudsburg, PA, USA : Association for Computational Linguistics, 2010:154 - 157.
  • 5Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[ J]. Journal of Machine Learning Research, 2003, 3: 993- 1022.
  • 6Page L, Brin S, Motwani R, et al. The PageRank Citation Rank- ing: Bringing Order to the Web [ R]. Stanford Digital Library Technologies Project, 1998.
  • 7Rajaraman A, Ullman J D. Mining of Massive Datasets[ M]. Cam- bride University Press. 2012 : 171 - 173.
  • 8刘知远.基于文档主题结构的关键词抽取方法研究[D].北京:清华大学,2011.
  • 9Mihalcea R, Tarau P. TextRank: Bringing Order into Texts [C]. In: Proceedings of Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain. 2004: 404-411.
  • 10Frank E, Paynter G W, Witten I H, et al. Domain-Specific Keyphrase Extraction [C]. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden. San Francisco: Morgan Kaufmann Publishers Inc., 1999: 668-673.

引证文献2

二级引证文献120

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部