期刊文献+

引入主题链接块因子的候选链接搜索策略研究 被引量:1

Research of Searching Strategy in Candidate Link Introducing Topic Link Blocking Factor
下载PDF
导出
摘要 网页主题爬取过程中,需要计算网页中出现的url权重,不断填充待爬行队列,以满足爬行条件,如何发现与主题最相关的链接,同时又不会导致"主题漂移"问题是关键。针对链接的锚文本较短小,不能很好地表明链接指向页面与主题的相关性的问题,论文在Shark-search算法的基础上引入相关链接块权重,利用块中子链接的锚文本进行块的权重计算,通过对比实验验证了改进算法可以更好地区分处于同一页面中的链接的相关度评分,提高爬虫的查准率,同时缓和"主题漂移"的问题。 In crawling process,the urls' weight is need to compute,the crawl queue is filled to meet the crawl conditions. It's the key problem that how to find the most relevant links to the theme and how to avoid "theme drift" problem. Due to anchor text is short,it can't clearly show the page's relevance to the topic which the page linked to. On the basis of Shark-search algorithm introducing the related link weights,the neutron link anchor text is used for calculating blocks' weight. Through contrasted experiments,verified the effectiveness of the improved algorithm is verfied,it can better distinguish the links' relevance score in the same page,improve the precision of the crawler and moderate "theme drift" problem at the same time.
作者 周雪 刘乃文 ZHOU Xue;LIU Naiwen(School of Information Science and Engineering,Shandong Normal University,Jinan 250014;Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology,Jinan 250014)
出处 《计算机与数字工程》 2018年第5期874-878,共5页 Computer & Digital Engineering
关键词 网页分块 Shark-search算法 链接结构 主题链接块 page-block Shark-search algorithm link-structure topic-relative link block
  • 相关文献

参考文献7

二级参考文献83

共引文献67

同被引文献11

引证文献1

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部