期刊文献+

一种新闻网页关键信息的提取算法 被引量:6

Key information extraction algorithm of news Web pages
下载PDF
导出
摘要 针对网页正文提取算法缺乏通用性,以及对新闻网页的提取缺乏标题、时间、来源信息的问题,提出一种新闻关键信息的提取算法news Extractor。该算法首先通过预处理将网页转换成行号和文本的集合,然后根据字数最长的一句话出现在新闻正文的概率极高的特点,从正文中间开始向两端寻找正文的起点和终点提取新闻正文,根据最长公共子串算法提取标题,构造正则表达式并以行号辅助判断提取时间,根据来源的格式特点并辅以行号提取来源;最后构造了数据集与国外开源软件news Paper进行提取准确率的对比实验。实验结果表明,news Extractor在正文、标题、时间、来源的平均提取准确率上均优于news Paper,具有通用性和鲁棒性。 Since information extraction algorithm for Web pages lacks generality and information of title, release-time and source in news Web page, a new information extraction algorithm was proposed to resolve those problems. Firstly, HTML code of Web page was parsed to text sets combined with line number and text; then, extractor began to search boundary of news content from line which the longest sentence belonged to due to the characteristic that the longest sentence belongs to the content of news with an extremely high probability. Meanwhile, the longest common string algorithm was used to extract title, the regular expression and line number were used to extract release-time, and the presentation characteristics of source and line number were used to extract source. Finally, a data set was built to conduct a comparison experiment with an open-source software named newsPaper in accuracy of extraction. Experimental results show that newsExtractor outperforms newsPaper in average accuracy of content, title, release-time and source, it has strong generality and robustness.
出处 《计算机应用》 CSCD 北大核心 2016年第8期2082-2086,2120,共6页 journal of Computer Applications
基金 国家自然科学基金面上项目(61375039) 中国科学院网络中心一三五重点项目(CNIC_PY_1402)~~
关键词 网页信息提取 新闻信息提取 网页去噪 Web information extraction news information extraction Web denoising
  • 相关文献

参考文献23

  • 1COWIE J, LEHNERT W. Information extraction [ J]. Communica- tions of the ACM, 1996, 39(1) : 80 -91.
  • 2MOONEY R J, BUNESCU R. Mining knowledge from text using in- formation extraction [ J]. ACM SIGKDD Explorations Newsletter, 2005, 7(1): 3-10.
  • 3CHANG C-H, LUI S-C. IEPAD : information extraction based on pattern discovery [ C]// WWW '01: Proceedings of the 10th Inter- national Conference on World Wide Web. New York: ACM, 2001: 681 - 688.
  • 4BANKO M, CAFARELLA M J, SODERLAND S, et al. Open infor- mation extraction from the Web [ C]// IJCAI 2007: Proceedings of the 20th International Joint Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press, 2007:2670-2676.
  • 5BAUMGARTNER R, FLESCA S, GOTTLOB G. Visual Web infor- mation extraction with Lixlo [C]// VLDB '01: Proceedings of the 27th International Conference on Very Large Data Bases. San Fran- cisco, CA: Morgan Kaufmann, 2001:119 - 128.
  • 6孙承杰,关毅.基于统计的网页正文信息抽取方法的研究[J].中文信息学报,2004,18(5):17-22.
  • 7赵欣欣,索红光,刘玉树.基于标记窗的网页正文信息提取方法[J].计算机应用研究,2007,24(3):144-145. 被引量:33
  • 8王磊,蒋建中,郭军利.基于扩展DOM树的Web页面信息抽取[J].计算机应用与软件,2007,24(6):137-139. 被引量:12
  • 9GOTrLOB G, KOCH C. Logic-based Web information extraction [J]. ACM SIGMOD Record, 2004, 33(2): 87 -94.
  • 10梅雪,程学旗,郭岩,张刚,丁国栋.一种全自动生成网页信息抽取Wrapper的方法[J].中文信息学报,2008,22(1):22-29. 被引量:21

二级参考文献85

共引文献133

同被引文献39

引证文献6

二级引证文献12

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部