摘要
通过分析基于内容的链接选择Best-First算法,引入能够体现链接价值的HITS(hyperlink induced topic search)算法,提出了新的链接选择策略.将两种算法相结合,新的爬虫不仅仅考虑页面内容,同时将链接结构加入进来,使得在下载的过程中能够保证主题相关性和权威性,缓解爬虫在爬行阶段的"近视"现象.结果表明:新的爬行策略比单一的Best-First算法具有更好的性能表现.
By analyzing the content-based link selection Best-First algorithm, and introduce the HITS (hyper-link induced topic search) algorithm which can reflect the link value, a new kind of link selection strategy is proposed: Combination of two algorithms, new crawler not only consider the page content, but also the link structure,and can ensure topic relevance and authority in the process of downloading; at the same time, ease the “short-siglited” phenomenon in crawling stage. Experimental result shows the new crawling strategy has better performance than that of the single Best-First algorithm.
出处
《华侨大学学报(自然科学版)》
CAS
北大核心
2017年第2期195-200,共6页
Journal of Huaqiao University(Natural Science)
基金
福建省科技厅科研基金资助项目(2011H6016)