面向Web的新闻网页正文信息抽取策略研究

The Strategy of the Information Extraction for Text from News Web Pages

下载PDF

导出

摘要本文介绍了一种结合STU-DOM树和文本字数统计的新闻网页正文信息抽取方法,它赋予节点语义属性后,过滤主题无关节点,最终利用相关标记包含的中文字符数从中选择包含正文信息的节点。这种策略能够准确地提取正文内容外,还无损地保留了正文中与主题相关的链接。 A method for text extraction from news web pages is introduced, which is based upon the STU-DOM tree and statistics of words’ number. The key algorithm is to give a node two semantic contextual attributes and f ilter the nodes which is not related subject. Then choose the node which contains text content by using the number of the Chinese characters in each node of the STU-DOM tree. The strategy can not only extract the useful and relevant text content from HTML documents, but also retain the related subject link in the text.

作者陈蕾蕾张如静 Chen Leilei1 Zhang Rujing2 (Department of Educational Technology, Nanjing Normal University, Nanjing 210097,China)

机构地区南京师范大学教育技术系

出处《电脑知识与技术》 2008年第S2期1-2,共2页 Computer Knowledge and Technology

关键词统计 STU树信息抽取 statistical STU-DOM tree information extraction

分类号 TP393.092 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献3

1李纲,戴强斌.WNBTE网页正文抽取方法研究[J].情报科学,2008,26(3):333-336. 被引量：5
2王琦,唐世渭,杨冬青,王腾蛟.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786-1792. 被引量：81
3刘秉权,王喻红,葛冬梅,李佳.基于结构树解析的网页正文抽取方法[A]黑龙江省计算机学会年学术交流年会论文集,2007.

二级参考文献19

1O Buyukkokten, H Garcia-Molina, A Paepcke. Accordion summarization for end-game browsing on PDAs and cellular phones. In: Proc of ACM Conf on Human Factors in Computing Systems(CHI 2001). New York: ACM Press, 2001. 213～220
2Wang Tengjiao, Tang Shiwei, Yang Dongqing, et al. COMIIX:Towards effective WEB information extraction, integration and query answering. In: Proc of SIGMOD' 02. New York: ACM Press, 2002. 620
3Liu Ling, Pu Calton, Han Wei. XWRAP: An XML-enabled wrapper construction system for Web information sources. In:Proc of the 16th Int'l Conf on Data Engineering. Washington:IEEE Computer Society Press, 2000. 611～621
4R Baumgartner, S Flesca, G Gottlob. Visual Web information extraction with Lixto. In: Proc of the 27th Int'l Conf on Very Large Data Bases. San Francisco: Morgan Kaufmann, 2001. 119～ 128
5D Freitag. Machine learning for information extraction in information domains. Machine Learning, 2000, 39 (2-3): 169 ～202
6S SoderLan. Learning information extraction rules for semistructured and free text. Machine Learning, 1999, 34(1-3): 233～ 272
7R D Doorenbos, O Etzioni, D S Weld. A scalable comparasonshopping agent for the World-Wide Web. In: ACM Agents' 97.New York: ACM Press, 1997. 39～48
8D W Embley, et al. Conceptual-model-based data extraction from multiple-record Web pages. Data and Knowledge Engineering,1999, 31(3): 227～251
9A Finn, A Kushmerick, B Smyth. Fact or fiction: Content classification for digital libraries. The 2nd DELOS Network of Excellence Workshop on Personalisation and Recommender Systems in Digital Libraries, Dublin, Ireland, 2001
10S Gupta, G Kaiser, D Neistadt, et al. DOM-based content extraction of HTML documents. In: Proc of the 12th Int'l World-Wide Web Conf. New York: ACM Press, 2003. 207～214

共引文献83

1赵彦斌,李庆华,赵峰.Web网页语义树的构造与利用[J].华中科技大学学报（自然科学版）,2005,33(z1):229-231. 被引量：1
2张聚弘,山岚.基于页面对比分析的数据提取[J].计算机与数字工程,2006,34(1):49-52. 被引量：1
3吴鹏飞,孟祥增,刘俊晓,马凤娟.网页区域分割与识别技术[J].现代计算机,2006(6):48-50. 被引量：4
4吴鹏飞,孟祥增,刘俊晓,马凤娟.基于结构与内容的网页主题信息提取研究[J].山东大学学报（理学版）,2006,41(3):41-44. 被引量：15
5贺智平,徐学洲,李爱玲.一种基于信息熵的Web页面主题信息抽取方法[J].计算机工程与应用,2007,43(4):164-166. 被引量：6
6赵欣欣,索红光,刘玉树.基于标记窗的网页正文信息提取方法[J].计算机应用研究,2007,24(3):144-145. 被引量：33
7谢华,刘卫国.基于局部语义的网页净化算法[J].计算机系统应用,2007,16(5):25-28.
8章勤,余洋,陶文兵.图像搜索中基于网页分块的图像分类研究[J].计算机工程与科学,2007,29(6):42-44. 被引量：1
9高琰,谷士文,谭立球.基于多种策略的页面内容提取算法[J].西南交通大学学报,2007,42(4):473-477. 被引量：4
10张恒,屈景辉,张亮.网页文本信息提取及结果评价[J].微计算机应用,2007,28(9):921-924. 被引量：10

电脑知识与技术

2008年第S2期

浏览历史

内容加载中请稍等...

面向Web的新闻网页正文信息抽取策略研究

参考文献3

二级参考文献19

共引文献83

相关作者

相关机构

相关主题

浏览历史