期刊文献+

基于分块的网页主题文本抽取 被引量:5

Extraction of Topical Text from Web Pages Based on Page Segmentation
下载PDF
导出
摘要 根据网页文本信息的结构和内容特征,提出一种网页主题文本信息的抽取策略,将网页文档表示为DOM标签树的形式,然后根据Web页面的结构特征进行内容块的分割,以网页的文本内容特征为依据识别链接型和主题型内容块,并提取主题型网页的文本信息块。实验结果表明:基于分块的方法有效地实现了链接型和主题型网页的分类,并准确地完成主题型网页的文本信息块的抽取任务,是一种简单、准确的网页信息抽取方法。 According to the structure and content features of Web pages,a method is proposed,which is applied to extract topical text information in Web pages. Firstly,the Webpage is transformed to the form of DOM Tree, then divided into blocks based on the structural features. Secondly, content blocks are classified into link class or topic class using the content features. Finally,the topical text in topical Web pages is extracted. The experimental result indicates that this method has achieved the classification and extraction of Web pages in simple way ,but effectively and accurately.
出处 《广西师范大学学报(自然科学版)》 CAS 北大核心 2009年第1期141-144,共4页 Journal of Guangxi Normal University:Natural Science Edition
基金 国家自然科学基金资助项目(60473139 60775041)
关键词 HTML标签 网页分块 内容特征 信息抽取 HTML tags Web page segmentation content features information extraction
  • 相关文献

参考文献5

二级参考文献30

  • 1林亚平,刘云中,周顺先,陈治平,蔡立军.基于最大熵的隐马尔可夫模型文本信息抽取[J].电子学报,2005,33(2):236-240. 被引量:48
  • 2EMBLEY DW,JIANG YS,NG YK.Record-Boundary Discovery in Web Documents[A].SIGMOD'99 Proceedings[C].1999.
  • 3EMBLEY DW,LI X.Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents[A].WebDB'00 Proceedings[C].2000.
  • 4LIM SJ,NG YK.Extracting Structures of HTML Documents Using a High-Level Stack Machine[M].Information Networking in Asia,Gordon and Breach Science Publishers,Newark,New Jersey,2001.
  • 5LIM SJ,NG YK,YANG XC.Integrating HTML Tables Using Semantic Hierarchies And Meta-Data Sets[A].International Database Engineering and Applications Symposium(IDEAS'02)[C].Edmonton,Canada,2002.
  • 6LIM SJ,NG YK.A Heuristic Approach for Converting HTML Documents to XML Documents[A].Proceedings of the Sixth International Conference on Rules and Objects in Databases(DOOD 2000)[C].London,England,2000.1182-1196.
  • 7LIN SH,HO JM.Discovering Informative Content Blocks from Web Documents[A].KDD 2002[C].2002.588-593.
  • 8YU SP,CAI D,WEN JR,et al.Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation[EB/OL].http://research.microsoft.com/research/pubs/view.aspx?type=Technical%20Report&id=632,2002-12.
  • 9WEN JR,SONG RH,CAI D,et al.Microsoft Research Asia at The Web Track of TREC 2003[A].The Twelfth Text Retrieval Conference(TREC'12)[C].2003.
  • 10朱明.[D].中国科学技术大学,2000.

共引文献88

同被引文献79

引证文献5

二级引证文献17

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部