摘要
介绍了并行Web爬虫系统的总体结构,引入了增量更新爬虫策略,在提高Web海量数据更新效率的同时,考虑到机群中各个爬虫的能力不一,为了使机群中爬虫的能力得到充分应用,又提出了向量度量技术,解决了抓取任务和爬虫能力匹配的问题。对抓取任务向量、爬虫向量进行了定义,并在此基础上给出了相关的并行算法。实践表明,系统具有良好的分配适应性,并可以在此基础上渐增式地提高网页库新鲜度。
This paper discussed the architecture of parallel Web crawler system. Incremental crawling method was used to the system to improve the efficiency of massive information updating. Meanwhile, considering the difference of crawler in the system and with the aim of fully usage of crawler in cluster system, Cosine vector parallel crawling model was introduced to solve this problem. After giving the definitions of crawling task vector and crawler vector, relevant parallel crawling algorithms were designed. The results confirm that the system is effective in distribution adaptability and runs well in maintaining the "freshness" of the Web repository.
出处
《计算机应用》
CSCD
北大核心
2009年第4期1117-1119,1127,共4页
journal of Computer Applications
基金
国家自然科学基金资助项目(60573108)
上海教委发展基金资助项目(06QZ00207ZZ92)
上海教委科研创新重点项目(08ZZ76)
上海市重点学科建设项目(s30501)