摘要
如何确定搜索的方向和深度是聚焦爬行的核心问题。为此,提出了链接的预期剩余能量概念及其计算方法。该方法利用当前页面的信息计算链接的立即回报能量,利用到达同一链接不同历史路径给予的历史回报知识不断迭代更新链接的预期剩余能量。利用预期剩余能量作为链接的优先级和搜索深度限制,设计了基于预期剩余能量模型的聚焦爬行算法,并给出了关键模块的实现。实验结果显示该方法具有更强的主题网站发现能力。
How to determine the search direction and depth are the key problem of focused crawling. This paper proposes an expected residual energy based URL priority computing method. This method uses the information of the current web page to calculate the immediately returning energy of hyperlinks, and then updates the expected residual energy using the historical returning knowledge of different historical paths in an iterative way. Using the expected residual energy as the priority and depth limit, this paper presents the system architecture of the expected residual energy based focused crawler,and gives out the detailed implementation of the key modules. Experiment result shows the focused crawler acquires better topic relevant websites finding ability.
出处
《计算机工程与应用》
CSCD
北大核心
2015年第24期120-125,158,共7页
Computer Engineering and Applications
关键词
聚焦爬行
搜索方向
搜索深度
主题相关度
预期剩余能量
focused crawling
search direction
search depth
topic relevance
expected residual energy