期刊文献+

基于HTML结构特征的网页信息提取 被引量:5

Page Information Extraction Based on the Structure of the HTML
下载PDF
导出
摘要 Web上的信息很多存储在HTML页面上,传统的网页数据抽取方法是使用包装器(Wrapper)来抽取网页中感兴趣的数据。包装器所需的信息模式识别知识的获取是一个费时费力且需要较高智能的工作。避开了使用Wrapper,针对新闻类网页的结构特点,从视觉角度对网页页面空间的构成进行了噪声与信息实体的划分与判断。讨论了一种根据新闻类网页层次结构和各层节点统计信息进行新闻主体提取的方法。改进了传统的DOM模型,增加了层次与样式等属性作为噪声判断的依据,并对其节点添加了统计信息,利用新闻的标题、时间等外显特性,提出并实现了一种结合正向直接抽取与反向降噪抽取新闻类网页得到结构化数据的方法。实验结果表明,用这种方法进行新闻类网页主体信息提取的有效性。 Large amount of information on the Web is stored as HTML documents. Traditional web page data extraction method is to use Wrapper to collect data of interest. Wrapper need the knowledge acquisition of pattern recognition, which is a time and effort consuming work, and needs high intelligence. Based on the structure features of news web pages, and from the visual perspective, the web page's space structure was partitioned into noise and information entities. A method of extracting news web pages principal part was discussed, according to the hierarchical structure and node statistical information. The traditional DOM model was improved, and the hierarchy and style attribute to distinguishing the noise and principal parts were added Some statistic information was added to the DOM node. By utilizing the special format of news headlines and time string, a method, which combines positive information extraction and negative noise reducing, to get structured data from news web pages was proposed and implemented. Experiments show that it is effective to use the method to extract the information of news.
作者 胡瑜 王立志
出处 《辽宁石油化工大学学报》 CAS 2009年第3期65-69,共5页 Journal of Liaoning Petrochemical University
关键词 信息提取 DOM LA-DOM HTML解析 噪声标记 Information extraction DOM LA-DOM HTML parse Noise mark
  • 相关文献

参考文献6

  • 1中国互联网络信息中心(CNNIC).第19次中国互联网络发展状况统计报告[R].2007.
  • 2韩家炜 范明 孟小峰.数据挖掘概念与技术[M].北京:机械工业出版社,2001..
  • 3Soumen Chakrabarti. Mining the web: discovering knowledge from hypertext data[M]. USA: Morgan kaufmann publishers, 2002.
  • 4朱永盛,武港山.基于Web的新闻信息抽取[J].计算机工程,2006,32(10):74-76. 被引量:11
  • 5Liu Bing . Editorial: special issue on web content mining[J]. Acm Sigkdd explorations newsletter,2004, 6(2):1-4.
  • 6李彦刚,魏海平,侯兴华.基于HTMLParser的Web信息抽取系统的设计与实现[J].辽宁石油化工大学学报,2006,26(2):83-86. 被引量:8

二级参考文献14

  • 1许建潮,侯锟.Web信息的自主抽取方法[J].计算机工程与应用,2005,41(14):185-189. 被引量:15
  • 2Muslea I.Extraction Patterns for Information Extraction Tasks:A Survey[C].AAAI-99 Workshop on Machine Learning for Information Extraction,1999.
  • 3Eikvil L.Information Extraction from World Wide Web-A Survey[R].Norwegian Computer Center,Tech.Rep:945,1999-07.
  • 4World Wide Web Consortium:The Document Object Model[EB/OL].http://www.w3.org/DOM,2004.
  • 5Chang Chiahui,Lui Shaochen.IEPAD:Information Extraction Based on Pattern Discovery[C].Proceedings of the Tenth International Conference on World Wide Web,Hong Kong,2001-05.
  • 6Horstmann C S.Java2核心技术[M].第5版.北京:机械工业出版社,2001.
  • 7CHANG Chia- hui, HSU Chun- nan, LUI Shao cheng. Automatic information extraction from semi-structured Web pages by pattern discovery[ J ]. Decision support systems,2003,35 ( 1 ) : 129-147.
  • 8Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine[J ]. Computer networks and ISDN systems,1998,30( 1 ): 107 - 117.
  • 9王自军,崔朝辉,刘恩,李志刚,程小茁.Web技术在股票查询系统中的应用及Java实现[J].石油化工高等学校学报,2000,13(3):78-80. 被引量:4
  • 10李效东,顾毓清.基于DOM的Web信息提取[J].计算机学报,2002,25(5):526-533. 被引量:101

共引文献61

同被引文献34

引证文献5

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部