期刊文献+

一种统一的Web新闻对象自动抽取方法 被引量:4

Unified and Automatic Web News Object Extraction Approach
下载PDF
导出
摘要 提出一种统一的Web新闻对象自动抽取方法。通过抽取新闻页面中的分类、标题、发布时间、来源、作者、内容、相关评论链接和相关新闻链接作为分类属性,经页面解析、候选值抽取、真值识别3个步骤,实现新闻对象的自动抽取。实验结果表明,该方法在同时抽取新闻对象的多个属性方面具有较高的准确性,且抽取结果不依赖于特定的页面模板。 This paper proposes a unified and automatic approach for extracting Web news object.By extracting the category,title,date,source,author,content,comments,related links and news links in the news pages as category properties,and through page analysis,candidate extraction and true value identification,news object can be extracted automatically.Experimental results show that the method for extracting information of objects multiple properties has high accuracy,and the result does not depend on a specific page template.
作者 刘伟 严华梁
出处 《计算机工程》 CAS CSCD 2012年第11期167-169,共3页 Computer Engineering
基金 国家"863"计划基金资助项目(2008AA01Z421) 中国科学技术信息研究所预研基金资助项目(YY-201103)
关键词 WEB数据抽取 视觉特征 序列标注 网页模板 新闻属性 新闻对象 Web data extraction visual feature sequence tagging Web page template news attribute news object
  • 相关文献

参考文献9

  • 1Xue Yewei, Hu Yunhua, Xin Guomao. Web Page Title Extraction and Its Application[J]. Information Processing Management, 2007, 43(5): 1332-1347.
  • 2胡国平,张巍,王仁华.基于双层决策的新闻网页正文精确抽取[J].中文信息学报,2006,20(6):1-9. 被引量:16
  • 3钱爱兵.一种基于统计的中文网页正文抽取方法[J].信息学报,2009,28(2):187-194.
  • 4Lafferty J D, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]//Proc. of International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann Publishers Inc., 2001.
  • 5于江德,樊孝忠,尹继豪,顾益军.基于隐马尔可夫模型的中文科研论文信息抽取[J].计算机工程,2007,33(19):190-192. 被引量:9
  • 6Zhao Hongkun, Meng Weiyi, Wu Zonghuan, et al. Fully Automatic Wrapper Generation for Search Engines[C]//Proc. of WWW'05. New York, USA: [s. n.], 2005.
  • 7Cai Deng, Yu Shiping, Wen Jirong, et al. VIPS: A Vision-based Page Segmentation Algorithm[R]. Microsoft, Technical Report: MSR-TR-2003-79, 2003.
  • 8Quinlan J R. C4.5: Programs for Machine Learning[M]. San Francisco, USA: Morgan Kaufmann Publishers Inc., 1993.
  • 9Cortes C, Vapnik V. Support-Vector Networks[J]. Machine Learning, 1995, 20(3): 273-297.

二级参考文献17

  • 1许勇,荀恩东,贾爱平,宋柔.基于互连网的术语定义获取系统[J].中文信息学报,2004,18(4):37-43. 被引量:13
  • 2邹纲,刘洋,刘群,孟遥,于浩,西野文人,亢世勇.面向Internet的中文新词语检测[J].中文信息学报,2004,18(6):1-9. 被引量:59
  • 3林亚平,刘云中,周顺先,陈治平,蔡立军.基于最大熵的隐马尔可夫模型文本信息抽取[J].电子学报,2005,33(2):236-240. 被引量:48
  • 4David Buttler,Ling Liu,et al.A Fully Automated Object Extraction System for the World Wide Web[A].In:Proceedings of the 2001 International Conference on Distributed Computing Systems[C].2001:361-370.
  • 5Yunhua Hu,Guomao Xin,Ruihua Song,Guoping Hu,Shuming Shi,Yunbo Cao and Hang Li.Title Extraction from Bodies of HTML Documents and Its Application to Web Page Retrieval.[A]Proc.of ACM-SIGIR'05[C].2005.
  • 6Valter Crescenzi,Giansalvatore Mecca.RoadRunner:Towards Automatic Data Extraction from Large Web Site[A].In:proceeding of the 26th International Conference on very Large Database Systems[C],2001:109-118.
  • 7Alberto H.F.Laender,Berthier A.Ribeiro2Neto.A Brief Survey of Web Data Extraction Tools[J].SIGMOD Record.2002,31(2):84-93.
  • 8Daisuke Ikeda,Yasuhiro Yamada.Expressive Power of Tree and String Based Wrapper[A].In:on2line proceedings of IJCA1p03 workshop on Information Integration on the Web[C].2003.
  • 9T.Berners-Lee,D.Connolly,Hypertext Markup Language-2.0,MIT/W3C,1995 http://www.w3.org/MarkUp/html-spec/html-spec_toc.html.
  • 10J.R.Quinlan.C4.5 Programs for Machine Learning[J].Morgan Kaufmannn Publishers San Meteo,California,1992.

共引文献23

同被引文献40

引证文献4

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部