摘要
本篇论文以去除网页噪声,整合网页内容为目标,提出了面向主题型网页,根据网页规划布局抽取网页内容的方法。算法首先分析原始网页的DOM结构生成标签树,再根据标签分类和对应节点的信息对标签树自底向上进行划分,并依据划分块的文字密度,链接密度及图片密度,分类信息块。进一步,提炼网页主题的文本特征向量,采用基于词条空间的文本相似度计算,获取划分块的主题相关度,以主题相关度为量化基准剔除噪声,识别网页主旨内容,重构页面描述。这一算法被应用于面向人才资讯的信息采集项目中,实验表明,算法适用于主题型网页的“去噪”及内容提取,具体应用中有较理想的表现。
A Web page extraction method based on the layout of Web page is proposed in this paper to implement tasks of page cleaning and content extraction. Firstly, a tag-tree is constructed by analyzing the corresponding DOM structure of original page. Then the tree is partitioned into a set of blocks from bottom to up in terms of categories of tags and concerning information of nodes, furthermore, blocks are classified on the basis of the proportion of word, link and image in blocks. Next, by using VSM (Vector Space Model), text eigenvector of page's subject is abstracted, which has been used to calculate degree of correlation between block' s content and page' s subject. In the light of degree of correlation, we can judge which blocks should be got rid of and which ones should be kept. The content blocks with high degree of correlation are kept to reconstruct the description of Web page. The method has been applied in a project concerning Talent Information Collection. Test results indicate effectiveness of the method in page cleaning and content extraction.
出处
《情报学报》
CSSCI
北大核心
2012年第1期31-39,共9页
Journal of the China Society for Scientific and Technical Information
基金
本文系2008年度教育部人文社会科学研究项目“基于信息抽取的数字图书馆的知识获取研究”(项目批准号08JC870013)及2009年度中山大学青年教师培育项目“智能化深度搜索引擎实现技术的研究”(项目编号:2000-3161101)研究成果.
关键词
网页内容抽取
网页分块
网页去噪
Web page content extraction, page segmentation, Web page cleaning