摘要
通过对Web主题爬行器在预测链接优先级时所用到的特征因子的细化和重新分类,引入收割率和媒体类型两个新特征作为相关性判断依据,提出一种改进的最好优先搜索算法。该算法采用"细粒度"策略过滤不相关网页,选取多个角度有代表性的特征因子构造链接优先级计算公式,以达到全面揭示和预测链接主题的目的。通过与其他三类主题搜索算法的小规模实验比较,证明改进算法在收割率和平均提交链接数上效果较好。
This paper introduces two new features harvest rate and media type as the basis to judge relevance, by refining and reclassifying all kinds of characteristic factors that are used by focused crawlers to predict the priority of Web links, and proposes an improved Best - First Search algorithm. The algorithm uses "fine - grained" policy filtering irrelevant Web pages, selects multiple angles representative characteristic factors and constructs a links priority formula to reveal and predict the subjects of Web links comprehensively. The small - scale experiment comparing with the other three topic search algorithms demonstrates that the improved algorithm has a better performance on harvest rate and the average number of links submitted.
出处
《现代图书情报技术》
CSSCI
北大核心
2013年第7期28-35,共8页
New Technology of Library and Information Service
关键词
主题搜索
搜索算法
最好优先搜索算法
主题爬行器
特征因子
Focused crawling
Search algorithm
Best - First Search algorithm
Focused crawler
Characteristic factor