摘要
网页主题爬取过程中,需要计算网页中出现的url权重,不断填充待爬行队列,以满足爬行条件,如何发现与主题最相关的链接,同时又不会导致"主题漂移"问题是关键。针对链接的锚文本较短小,不能很好地表明链接指向页面与主题的相关性的问题,论文在Shark-search算法的基础上引入相关链接块权重,利用块中子链接的锚文本进行块的权重计算,通过对比实验验证了改进算法可以更好地区分处于同一页面中的链接的相关度评分,提高爬虫的查准率,同时缓和"主题漂移"的问题。
In crawling process,the urls' weight is need to compute,the crawl queue is filled to meet the crawl conditions. It's the key problem that how to find the most relevant links to the theme and how to avoid "theme drift" problem. Due to anchor text is short,it can't clearly show the page's relevance to the topic which the page linked to. On the basis of Shark-search algorithm introducing the related link weights,the neutron link anchor text is used for calculating blocks' weight. Through contrasted experiments,verified the effectiveness of the improved algorithm is verfied,it can better distinguish the links' relevance score in the same page,improve the precision of the crawler and moderate "theme drift" problem at the same time.
作者
周雪
刘乃文
ZHOU Xue;LIU Naiwen(School of Information Science and Engineering,Shandong Normal University,Jinan 250014;Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology,Jinan 250014)
出处
《计算机与数字工程》
2018年第5期874-878,共5页
Computer & Digital Engineering