期刊文献+

基于双层决策的新闻网页正文精确抽取 被引量:16

Precise Content Extraction from News Web Page Based on Decisions of Two Layers
下载PDF
导出
摘要 本文提出了基于双层决策的新闻网页正文的精确抽取算法,双层决策是指对新闻网页正文所在区域的全局范围决策和对正文范围内每段文字是否确是正文的局部内容决策。首先根据实际应用的需要给出了新闻网页正文的严格界定,然后分析了新闻网页及其正文的特性,提出了基于双层决策的正文抽取策略,基于特征向量提取和决策树学习算法对上述双层决策进行了建模,并在国内10个主要新闻网站的1687个新闻页面上开展了模型训练和测试实验。实验结果表明,上述基于双层决策的方法能够精确地抽取出新闻网页的正文,最终正文抽取与人工标注不完全一致的网页比例仅为18.14%,比单纯局部正文内容决策的方法相对下降了29.85%,同时抽取误差率大于10%的网页比例更是仅为7.11%,满足了实际应用的需要。 This paper concerns content extraction from news web pages based on decisions of two layers. The first layer of decision is introduced to predict the scope of content in a webpage, and the second layer is employed to judge whether the paragraph within predicted scope is content or not. We firstly present a strict definition of content for web pages orienting to the practical applications, then analyze the characteristics of news web pages and their contents. Based on the analysis, we propose a content extraction method based on decisions of two layers, and carry out experiments on a corpus of 1867 HTMLs collected from 10 main news web sites in China. The experiment results show that our method can predict the content of news web pages quite well: the percentage of web pages which contain mismatching in extracted content is only 18.14%, which decreases 29. 85% compared to that just based on the second layer prediction, and only 7. 11% of extracted pages are with more than 10% mismatching,indicating that this method could be applied to practical applications.
出处 《中文信息学报》 CSCD 北大核心 2006年第6期1-9,103,共10页 Journal of Chinese Information Processing
基金 国家自然科学基金资助项目(69975018)
关键词 计算机应用 中文信息处理 信息抽取 特征向量 决策树 正文抽取 computer application Chinese information processing information extraction feature vector decision tree content extraction
  • 相关文献

参考文献11

  • 1David Buttler,Ling Liu,et al.A Fully Automated Object Extraction System for the World Wide Web[A].In:Proceedings of the 2001 International Conference on Distributed Computing Systems[C].2001:361-370.
  • 2高军,王腾蛟,杨冬青,唐世渭.基于Ontology的Web内容二阶段半自动提取方法[J].计算机学报,2004,27(3):310-318. 被引量:18
  • 3张绍华,徐林昊,杨文柱,薛文玲,李天柱.基于样本实例的Web信息抽取[J].河北大学学报(自然科学版),2001,21(4):431-437. 被引量:19
  • 4邹纲,刘洋,刘群,孟遥,于浩,西野文人,亢世勇.面向Internet的中文新词语检测[J].中文信息学报,2004,18(6):1-9. 被引量:59
  • 5许勇,荀恩东,贾爱平,宋柔.基于互连网的术语定义获取系统[J].中文信息学报,2004,18(4):37-43. 被引量:13
  • 6Yunhua Hu,Guomao Xin,Ruihua Song,Guoping Hu,Shuming Shi,Yunbo Cao and Hang Li.Title Extraction from Bodies of HTML Documents and Its Application to Web Page Retrieval.[A]Proc.of ACM-SIGIR'05[C].2005.
  • 7Valter Crescenzi,Giansalvatore Mecca.RoadRunner:Towards Automatic Data Extraction from Large Web Site[A].In:proceeding of the 26th International Conference on very Large Database Systems[C],2001:109-118.
  • 8Alberto H.F.Laender,Berthier A.Ribeiro2Neto.A Brief Survey of Web Data Extraction Tools[J].SIGMOD Record.2002,31(2):84-93.
  • 9Daisuke Ikeda,Yasuhiro Yamada.Expressive Power of Tree and String Based Wrapper[A].In:on2line proceedings of IJCA1p03 workshop on Information Integration on the Web[C].2003.
  • 10T.Berners-Lee,D.Connolly,Hypertext Markup Language-2.0,MIT/W3C,1995 http://www.w3.org/MarkUp/html-spec/html-spec_toc.html.

二级参考文献18

  • 1[1]Baumgartner R.,Flesca S.,Gottlob G.. Visual web information extraction with lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, 2001,119~128
  • 2[2]Liu L.,Pu C., Han W.. XWRAP: An XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th International Conference on Data Engineering, California, 2000, 611~621
  • 3[3]Gottlob G., Koch C.. Monadic datalog and the expressive power of languages for web Information extraction. In: Proceedings of the 21th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Wisconsin, 2002, 17~28
  • 4[4]Hamer J.,Brennig M., Garcia-Molina H.. Template-based wrappers in the TSIMMIS system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Arizona, 1997, 532~535
  • 5[5]Atzeni P., Mecca G.. Cut and paste. In: Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Arizona, 1997, 144~153
  • 6[6]Crescenzi V., Mecca G., Merialdo P.. RoadRunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, 2001, 109~118
  • 7[7]Soderland S.. Learning information extraction rules for semistructured and free text. Machine Learning,1999, 34(1~3):233~272
  • 8[8]Adelberg B.. Nodose-A tool for semi automatically extracting structured and semi-structured data from text document. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington, 1998, 283~294
  • 9[9]Ribeiro-Neto B.A., Laender A., da silva A.S.. Extracting semistructured data through examples. In: Proceedings of the 1999 ACM CIKM International Conference on Information and Knowledge Management, Missouri, 1999,94~101
  • 10[10]EmbleyD.W., Campbell D.M., Jiang Y.S.. A conceptual-modeling approach to extracting data from web. In: Proceedings of the 17th International Conference on Conceptual Modeling, Singapore, 1998,78~91

共引文献104

同被引文献130

引证文献16

二级引证文献69

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部