摘要
目前,搜索引擎以整张网页作为最小处理单位进行排序处理,容易受到噪音信息的干扰.针对存在的问题,提出用网页分块对网页净化,进而利用净化结果改进传统的排序算法.首先,用基于视觉的网页分块算法VIPS将网页分成若干语义块,然后通过设定规则保留网页中与主题相关度高的语义块,最后用这些语义块代表整个网页参与检索,减少网页噪音对搜索引擎排序算法正确性的影响,实现了检索质量的改进.最后通过实验证明了改进算法的优越性.
At present, an entire webpage is used as sorting unit in the search engine. This method is vulnerable to noise interference. In order to overcome the problem, the webpage segmentation method is proposed to purify the webpage in this paper. The purified webpage is used to improve the sorting algorithm in the search engine. Firstly, the webpage segmentation algorithm VIPS based on vision is used to divide the webpage into several semantic blocks. Then, the semantic blocks with highly relevant to the subject is reserved through setting rules. Finally, these semantic blocks on behalf of the entire webpage will be used in the search engine. It effectively reduces the impact of noise on the sorting algorithm in webpage search engine and improves the search quality. The experiment shows that the strategies proposed in this paper are practical.
出处
《浙江工业大学学报》
CAS
北大核心
2009年第5期495-498,共4页
Journal of Zhejiang University of Technology
关键词
网页噪音
网页分块
网页净化
排序算法
VIPS
webpage noise
webpage segmentation
webpage purification
sorting algorithm
VIPS