摘要
获取Web页面中的重要内容如文本和链接,在许多Web挖掘研究领域有着重要的应用价值.目前针对该问题主要采用Web页面分割和区块识别的方法.但现有的方法将Web页面中重要文本和链接的识别视为两个相互独立的问题,这种做法忽略了Web页面中文本和链接的内在语义关系,同时降低了页面处理的效率.文中提出了一种Web页面重要内容挖掘的统一框架,该框架主要由3个部分组成:第一,先将Web页面转换为DOM树表示,然后采用节点密度熵为度量将DOM树分割为不同的页面块;第二,采用基于K最近邻标签传播的半监督方法自动扩展页面块训练集;第三,在扩展的页面块训练集上对SVM分类器进行训练,并用来对页面块进行分类.采用该框架可以将Web页面块区分为多种类型,并且该框架独立于Web页面的类型和布局.我们在真实的Web环境下进行了广泛的实验,实验结果表明了该方法的有效性.
For many research fields in Web mining, how to get the important content in a Web page, such as texts and links, has important applications. At present, the main method for solving this problem is to adopt Web page segmentation and informative sections recognition. However, existing approaches use decoupled strategies that attempt to do text content and link content identification in two separate phases. This ignores the inner semantic relationships between texts and links in a Web page, and also results in low efficiency of the processing of Web page. In this paper, we propose a uniform framework for mining important content in a Web page. This framework consists of three components. First, a Web page is transformed into a DOM tree, and then it is segmented into several Web page blocks with a metric based on node density entropy. Second, a semi-supervised approach based on K-Nearest Neighbor label propagation is proposed to automatically extend the training set for classification. Third, a SVM-based classifier is trained over the extended training set, and eventually it is leveraged to classify Web page blocks. The framework can distinguish Web page blocks into a variety of types, and it is independent of the type and layout of Web pages. We conduct the extensive experiment over real Web environment, and the experimental results show that the proposed methods are effective.
出处
《计算机学报》
EI
CSCD
北大核心
2015年第2期349-364,共16页
Chinese Journal of Computers
基金
国家自然科学基金(61272109
61202285)
国家星火计划项目(2012GA750007)
河南省科技厅基础与前沿技术研究项目(122300410378)
河南省教育厅科学技术研究重点项目(13A520032)资助~~
关键词
页面分割
节点密度
标签传播
DOM树
块分类
社会计算
社交网络
Web page segmentation
node density
label propagation
DOM tree
block classification
social computing
social networks