期刊文献+

一种基于统计的中文网页正文抽取方法 被引量:3

A Statistical Way to Extract Full Text from Chinese Web Pages
下载PDF
导出
摘要 针对抽取中文网页正文的传统方法的不足,提出一种基于统计的中文网页正文抽取方法。该方法首先利用DOM树计算文本结点的文本密度,即文本长度与HTML源码长度之比,再利用贝叶斯判别准则计算密度区分阈值,最后根据文本密度与密度区分阈值的比较结果抽取正文,即大于密度区分阈值的结点就判定为正文文本结点,小于或等于密度区分阈值的结点则判定为非正文文本结点,将所有判定为正文文本结点的文本连接起来即为要抽取的网页正文。通过使用中文新闻类网页对该方法的有效性进行验证,结果表明:该方法虽然简单,但是抽取准确率极高且易于实现。 In view of the shortcomings of traditional methods,this paper proposed a statistical method for extracting full text from Chinese web pages.It is simple,but accurate and easy to be implemented.This approach extracted full text of Chinese web pages based on the text density of each text node which is computed by caculating the ratio of text to html code length according to DOM tree.The pretty good full text is filtered out by comparing the text density to a fixed threshold.The fixed threshold of text density is got by using Bayesian criteria.Experimental results show that the proposed method is an effective solution to extract full text from Chinese web pages,especially for Chinese web news.
作者 钱爱兵
出处 《情报学报》 CSSCI 北大核心 2009年第2期187-194,共8页 Journal of the China Society for Scientific and Technical Information
关键词 文本密度 文本结点 正文抽取 贝叶斯判别准则 DOM树 text density text node fulltext extraction Bayesian criteria DOM tree
  • 相关文献

参考文献4

  • 1Line Eikvil.Information Extraction from World Wide Web -A Survey[OL].[2007-11-19].http://www.nr.no/files/samba/bamg/webIE-rep945.ps.
  • 2Alberto H.F.Laender,Berthier A.Ribeiro-Neto.A Brief Survey of Web Data Extraction Tools[J].ACM SIGMOD Record.2002,31(2):84-93.
  • 3高军,王腾蛟,杨冬青,唐世渭.基于Ontology的Web内容二阶段半自动提取方法[J].计算机学报,2004,27(3):310-318. 被引量:18
  • 4Andy Powney.Html Parser For.NET v2.0[OL].[2007-11-19].http://www.planetsourcecode.com/URLSEO/vb/scripts/ShowCode!asp/txtCodeId!2201/lngWid!10/anyname.htm.

二级参考文献11

  • 1[1]Baumgartner R.,Flesca S.,Gottlob G.. Visual web information extraction with lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, 2001,119~128
  • 2[2]Liu L.,Pu C., Han W.. XWRAP: An XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th International Conference on Data Engineering, California, 2000, 611~621
  • 3[3]Gottlob G., Koch C.. Monadic datalog and the expressive power of languages for web Information extraction. In: Proceedings of the 21th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Wisconsin, 2002, 17~28
  • 4[4]Hamer J.,Brennig M., Garcia-Molina H.. Template-based wrappers in the TSIMMIS system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Arizona, 1997, 532~535
  • 5[5]Atzeni P., Mecca G.. Cut and paste. In: Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Arizona, 1997, 144~153
  • 6[6]Crescenzi V., Mecca G., Merialdo P.. RoadRunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, Roma, 2001, 109~118
  • 7[7]Soderland S.. Learning information extraction rules for semistructured and free text. Machine Learning,1999, 34(1~3):233~272
  • 8[8]Adelberg B.. Nodose-A tool for semi automatically extracting structured and semi-structured data from text document. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington, 1998, 283~294
  • 9[9]Ribeiro-Neto B.A., Laender A., da silva A.S.. Extracting semistructured data through examples. In: Proceedings of the 1999 ACM CIKM International Conference on Information and Knowledge Management, Missouri, 1999,94~101
  • 10[10]EmbleyD.W., Campbell D.M., Jiang Y.S.. A conceptual-modeling approach to extracting data from web. In: Proceedings of the 17th International Conference on Conceptual Modeling, Singapore, 1998,78~91

共引文献17

同被引文献27

引证文献3

二级引证文献10

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部