摘要
网页消重模块是搜索引擎系统的重要组成部分,其作用是对搜索引擎的爬虫系统下载的网页进行过滤,去除重复内容的网页,从而提高搜索引擎爬虫系统的性能和检索的质量。提出了一种网页消重的并行算法以及基于Map/Reduce的实现机制,并通过实际网站的实验验证了该消重算法的稳定性和处理大量网页时的并行性能。
The module of elimination of duplicated web pages ,which filters the web pages downloaded by the crawler module and gets rid of the duplicated pages,is an important part of a search engine. This module can improve the performance of the crawl module and the quality of searching results of a search engine. An algorithm of elimination of duplicated web pages and a strategy based on Map/Reduce are proposed. Its stability and parallel performance in large scale web pages processing is demonstrated when applied to a real web site in our experiment.
出处
《广西师范大学学报(自然科学版)》
CAS
北大核心
2007年第2期153-156,共4页
Journal of Guangxi Normal University:Natural Science Edition
基金
国家自然科学基金资助项目(90412015)