期刊文献+

灵活结构网页的正文提取 被引量:3

Content Extraction Based on Unknown Structure Web
下载PDF
导出
摘要 在Web数据挖掘中,由于网页大多都含有指向其他页面的超链接等噪音信息,为了减少噪音信息对Web数据挖掘效果的影响,有必要对网页进行净化处理,提取其中的正文,同时,现实中很多网页的代码结构不是特别规范,对此,提出一种对灵活结构网页适用的正文抽取算法。将网页用HTML标签分割成节点形式,找出其中含有正文内容的一个节点,以此节点为基础向前和向后进行余下正文内容的抽取。实验结果表明,本算法的适用性强、正确率较高。 There is often some useless information in the Web page,such as hyperlinks,copyright,which will affect the accurateness of Web data mining results.Extracting useful text content from a Web page for the mining is necessary.On the other hand,some pages' HTML codes are not standard.To solve this problem,propose an approach of Web information extraction based on unknown structure Web.It splits a Web page into a lot of nodes using HTML tags,then finds out one of the nodes which contained valuable information,and searches out other informative content nodes in front or back of the node,finally extracts the article from the Web page after connecting all found nodes' contents together.Experiments show that the arithmetic can deal with unstructured Web pages and is effective.
作者 殷彬 杨会志
出处 《计算机技术与发展》 2011年第9期111-113,117,共4页 Computer Technology and Development
基金 中山市科技计划项目(20092A210)
关键词 WEB数据挖掘 网页内容提取 正文节点 超链接节点 节点权值 链接密度 Web data mining Web information extraction content node hyperlink node node weight link density
  • 相关文献

参考文献12

二级参考文献37

共引文献86

同被引文献29

  • 1常育红,姜哲,朱小燕.基于标记树表示方法的页面结构分析[J].计算机工程与应用,2004,40(16):129-132. 被引量:24
  • 2Laender A, Ribeiro-Neto B, Silva A. A brief survey of web data ex- traction Tools[ J]. SIGMOD Record, 2002,31 (2) :84 - 93.
  • 3Soderland S. Learning Information extraction rules for semi-structured and free text [ J ]. Machine Learning, 1999,34 ( 1 - 3 ) :233 - 272.
  • 4Kushmerik N. Wrapper induction: efficiency and expressiveness [ J]. Artificial Intelligence, 2000,118 ( 1-2 ) : 15 - 68.
  • 5Chung C Y, Gertz M, Sundaresan N. Reverse engineering for web da- ta: from visual to semantic Structures [ C ]//Proceedings of 18th Inter- national Conference on Data Engineering, San Jose, California, USA, 2002:53 - 63.
  • 6Arocena G O, Mendelzon A O. Web-OQL: restructuring documents, databases and webs[ C ]//Proceedings of 14th International Conference on Date Engineering. Orlando. Florida. USA. 1998..24-33.
  • 7李芳芳,葛斌.基于统计的中文网页正文信息抽取方法研究[c]//第三届全国社会计算会议,张家界,湖南,中国,2011:1-7.
  • 8Herl HE, Jr HFO, Chung GKWK, et al. Reliability and validity of a computer-based knowledge mapping system to measure content understanding [J]. Computers in Human Behavior (S0747-5632),1999, 15(3/4)! 315-333.
  • 9Keim D A. Information Visualization and Visual Data Mining [J]. mEE Transactions on Visualization (S1077-2626),2002, 8(1): 1-8.
  • 10Gupta S, Kaiser G E, Grin P, et al. Automating Content Extraction of HTML Documents [J]. World Wide Web-interact & Web Information Systems (S1386-145X), 2005, 8(2): 179-224.

引证文献3

二级引证文献24

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部