摘要
利用网页的视觉特征和DOM树的结构特性对网页进行分块,并采用逐层分块逐层删减的方法将与正文无关的噪音块删除,从而得到正文块。对得到的正文块运用VIPS算法得到完整的语义块,最后在语义块的基础上提取正文内容。试验表明,这种方法是切实可行的。
To get the useful information blocks, this paper first segmented the Web page into blocks with its visual features and its DOM tree's characteristics, and then deleted the noise blocks. This is a recursive process until no block can be deleted. Then handled the reserved blocks with the VIPS algorithm to get the semantic blocks. At last, got the text content by handling the semantic blocks. Experiment shows that this method is feasible.
出处
《微型机与应用》
2010年第3期38-41,共4页
Microcomputer & Its Applications
关键词
页面分块
信息提取
视觉特征
page segmentation
information extraction
visual features