摘要
Web上的信息很多存储在HTML页面上,传统的网页数据抽取方法是使用包装器(Wrapper)来抽取网页中感兴趣的数据。包装器所需的信息模式识别知识的获取是一个费时费力且需要较高智能的工作。避开了使用Wrapper,针对新闻类网页的结构特点,从视觉角度对网页页面空间的构成进行了噪声与信息实体的划分与判断。讨论了一种根据新闻类网页层次结构和各层节点统计信息进行新闻主体提取的方法。改进了传统的DOM模型,增加了层次与样式等属性作为噪声判断的依据,并对其节点添加了统计信息,利用新闻的标题、时间等外显特性,提出并实现了一种结合正向直接抽取与反向降噪抽取新闻类网页得到结构化数据的方法。实验结果表明,用这种方法进行新闻类网页主体信息提取的有效性。
Large amount of information on the Web is stored as HTML documents. Traditional web page data extraction method is to use Wrapper to collect data of interest. Wrapper need the knowledge acquisition of pattern recognition, which is a time and effort consuming work, and needs high intelligence. Based on the structure features of news web pages, and from the visual perspective, the web page's space structure was partitioned into noise and information entities. A method of extracting news web pages principal part was discussed, according to the hierarchical structure and node statistical information. The traditional DOM model was improved, and the hierarchy and style attribute to distinguishing the noise and principal parts were added Some statistic information was added to the DOM node. By utilizing the special format of news headlines and time string, a method, which combines positive information extraction and negative noise reducing, to get structured data from news web pages was proposed and implemented. Experiments show that it is effective to use the method to extract the information of news.
出处
《辽宁石油化工大学学报》
CAS
2009年第3期65-69,共5页
Journal of Liaoning Petrochemical University