摘要
Shark-Search算法是一个经典的主题爬取算法.针对该算法在爬取噪音链接较多的Web页面时性能并不理想的问题,提出了基于网页分块的Shark-Search算法,该算法从页面、块、链接的多种粒度来更加有效的进行链接的选择与过滤.实验证明,改进的Shark-Search算法比传统的Shark-Search算法在查准率和信息量总和上有了质的提高.
A Shark-Seareh algorithm is one of the classical algorithms for focused crawling. However, its performance is not ideal for crawling Web pages which contain too many noisy links. An improved Shark-Search algorithm based on page segmentation was proposed, which can accurately evaluate the relevance from three granularities: page, block and single link. Several experiments were carried out to verify that the improved Shark-Search algorithm can obtain significantly higher efficiency than traditional ones.
出处
《山东大学学报(理学版)》
CAS
CSCD
北大核心
2007年第9期62-66,共5页
Journal of Shandong University(Natural Science)
基金
国家科技支撑计划子课题资助项目(2006BAH02A29)
山东省博士基金资助项目(2006BS01016)