摘要
精准地抽取Web页面中正文内容,在许多Web挖掘研究领域有着重要的应用价值.目前针对该问题主要采用网页分割和密度统计的方法.但现有的方法在网页中正文内容字符数较少时可能失去作用.经实例分析发现,网站内部的网页大多都是由一套相同内容模板生成的.因此本文提出一种基于网页聚类的正文信息提取的方法,该方法主要有2个部分组成:第一,基于网页的结构特征对网页进行聚类;第二,面向相似网页集合的正文位置特征生成.采用该方法可以从多种类型的网页中抽取正文信息.我们针对5个网站进行了实验,实验结果表明该方法的可行性和有效性.
Accurately extracting important content from webpage has important applications for many research fields in Web mining. Atpresent,the method of webpage segmentation and density statistics is used to solve this problem. However, the existing method maylose its function when the number of characters in the webpage is small. In this paper,we propose a method for extracting web infor-mation,based on the webpage clustering. This method consists of two components:webpage clustering based on structure feature andtext block features generation with similar webpages. The method can extract web information from different types of webpages. Weconduct the experiment with webpages from 5 sites, and the experimental results show that the proposed methods are feasibility and ef-fective.
出处
《小型微型计算机系统》
CSCD
北大核心
2018年第1期111-115,共5页
Journal of Chinese Computer Systems
基金
国家自然基金项目(61402111)资助
福建省科技平台建设项目(2014m005)资助.