摘要
结合信息增益,提出了一种新的自适应主题爬行策略。利用维基百科的分类树和主题描述文档构建主题向量T,并在爬行过程中不断地进行自动学习,反馈更新主题向量空间中每个概念的权重,完善主题描述。实验结果表明,该方法具有增量爬行的能力,并在信息量总和上明显优于基于the interest ratio的自适应策略;且前者所爬取的网页更接近于与主题相关。
In combination with information gain,this paper proposed a new adaptive focused crawling method.It set up topic vector T by category tree and topic descriptive article of Wikipedia,and automatically learned and fed back to modify weight of each concept in the topic vector space during crawling,improving topic description.Experimental results show that the method contributes to the focused crawler an incremental crawling ability,it is superior to the adaptive method based on the interest ratio significantly in terms of sum of information,and Web pages crawled with the former are more related to the topic than the latter.
出处
《计算机应用研究》
CSCD
北大核心
2012年第2期501-503,共3页
Application Research of Computers
基金
中央高校研究生科技创新基金个人项目(CDJXS11180014)
关键词
主题爬行
维基百科
主题描述
自适应方法
信息增益
focused crawling
Wikipedia
topic description
adaptive method
information gain