摘要
提出了一种结合页面视觉信息和标签信息来提取页面内容结构的方法——DVS。DVS首先通过分析页面的CSS样式信息、DOM树以获得页面的视觉信息和标签信息,初步得到页面的视觉树;然后利用树的路径相似算法,既考虑标签信息又考虑视觉信息来计算树中模块的相似性,对模块进行聚类,最终得到页面的视觉树,即页面的内容结构。DVS主要的特色在于从视觉信息和标签信息两方面来提取页面的内容结构;采用树形结构表示视觉信息,将分析视觉信息转换成分析'视觉属性'树。实验采用UIUC的TEL数据集,分别与WTS算法、VIPS算法进行了比较,文中算法可以获得更高的准确性。
Extracting content from deep web pages is a challenging problem due to the underlying intricate structures of such pages.A vision and tags based approach(DVS) is proposed.It primarily utilizes the vision information and tag information on the Deep Web result pages to extract the content structure of pages.This approach includes two steps as follows: First,the vision information and tag information are produced by analyzing the Cascading Style Sheet and the DOM Tree to generate an initial visual-tree of the Deep Web result page.And then,the Path Shingle(PS) algorithm is employed,by considering both of the vision and the tag information,and the blocks in the visual-tree are clustered according to the similarity computing result of them to produce the final visual-tree,i.e.,the content structure of pages.The innovations of DVS are that it utilizes the vision information and tag information on the Deep Web pages to extract the content structure;and stores the vision information as a tree to tansform the analysis of the vision information to a vision-attribute tree.Experiments are conducted with a large set of Web databases called UIUC’s TEL.The experimental results show that the vision and tag based approach has high precision compared with the WTS algorithm and the VIPS algorithm.
出处
《重庆大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2012年第6期117-124,共8页
Journal of Chongqing University
基金
国家自然科学基金资助项目(61103114)
重庆市高等教育教学改革研究重点资助项目(112023)
中央高校基本科研业务基金资助项目(CDJXS11181164)
'211工程'三期建设资助项目(S-10218)
关键词
深层网
内容提取
DOM树
CSS样式
视觉树
deep web
content extraction
dom tree
cascading style sheet
visual tree