摘要
研究基于CURE聚类的Web页面分块方法及正文块的提取规则。对页面DOM树增加节点属性,使其转换成为带有信息节点偏移量的扩展DOM树。利用CURE算法进行信息节点聚类,各个结果簇即代表页面的不同块。最后提取了正文块的三个主要特征,构造信息块权值公式,利用该公式识别正文块。
This paper discusses an approach based on CURE algorithm of Web pages segmentation and text extraction rules. The main idea is to add attributes to nodes of a standardization DOM tree to convert it into the extended DOM tree with the infor- mation node offset. Subsequently, we use the CURE algorithm to cluster information nodes. And each result of the cluster represent different block of the page. Finally, we extracts three nmin features of the text block and construct information weights formula which can distinguish text blocks.
出处
《微型机与应用》
2012年第12期11-14,共4页
Microcomputer & Its Applications
关键词
WEB信息抽取
聚类算法
页面分块
正文块提取
Web information extraction
clustering algorithm
page block
text block extraction