期刊文献+

基于网页格式信息量的博客文章和评论抽取模型 被引量:15

Extraction Model Based on Web Format Information Quantity in Blog Post and Comment Extraction
下载PDF
导出
摘要 从信息论的角度出发,提出了一个基于网页格式信息量的博客文章和评论抽取模型.首先,结合网页视觉上的位置信息和文本的有效信息来定位网页正文.其次,利用博客网页中的格式信息作为信息单元并计算每个信息块所包含的格式信息量,通过计算最小切分位置信息量来切分正文中的文章和评论.该模型具有与语言无关的特点,因此具有一定的通用性.实验结果表明,该模型在博客正文定位和正文切分方面达到了较高的精确率. Based on the information theory, this paper presents a model based on Web format information quantity in blog information extraction. First, the vision information in blog Web page and the effective text information are combined to locate the main text which represents the theme of the blog Web page. Second, the format information ofblog Web page is used to calculate the information quantity of each block and the minimal separating information quantity of separate position is used to detect the boundary of posts and comments in the main text. This model is language insensitive and can be used in a lot of blogs which are written in different natural languages. Experimental results show that this method achieves high precision in locating main text and separating the post and comment.
出处 《软件学报》 EI CSCD 北大核心 2009年第5期1282-1291,共10页 Journal of Software
基金 国家重点基础研究发展计划(973)Nos.2004CB318109,2007CB311100 国家高技术研究发展计划(863)No.2007AA01Z441~~
关键词 博客信息抽取 最小正文子树 有效信息率 网页格式信息 视觉信息 切分位置信息量 blog information extraction, minimal main text subtree effective information ratio Web format information vision information information quantity of separate position
  • 相关文献

参考文献1

二级参考文献5

  • 1[1]Ellen Riloff. Automatically Constructing a Dictionary for Information Extraction Tasks[C]. In: Proceedings of the Eleventh National Conference on Artificial Intelligence, 811-816. AAAI Press/ The MIT Press, 1993.
  • 2[2]Stephen Soderland, David Fisher, Jonathan Aseltine, and Wendy Lehnert. CRYSTAL: Inducing a conceptual dictionary[C]. In: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, 1314-1319, 1995.
  • 3[3]Ellen Riloff. Automatically Generating Extraction Patterns from Untagged Text[C]. In: Proceedings of Thirteenth National Conference on Artificial Intelligence (AAAI-96), 1044-1049. 1996.
  • 4[4]Ellen Riloff, Rosie Jones. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping[C]. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), Orlando FL. 1999.
  • 5[5]Roman Yangarber, Ralph Grishman, Pasi Tapanainen and Silja Huttunen. Unsupervised Discovery of Scenario-Level Patterns for Information Extraction[C]. In: Proceedings of Sixth Applied Natural Language Processing Conference (ANLP-2000), 282-289, Seattle WA. 2000.

共引文献21

同被引文献134

引证文献15

二级引证文献67

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部