摘要
聚焦爬虫是主题搜索引擎的核心部件。针对目前聚焦爬虫搜索策略的不足,提出基于主题相关度和页面重要性相结合的综合相关度来判别页面主题相关性,并采用自适应免疫进化算法这种搜索策略指导聚焦爬虫的爬行,实验结果证明,该算法下载的主题相关网页数所占比例明显高于最佳搜索和广度优先搜索算法的比例,具有更高的搜索效率。
Focused crawler was a core component of the topic search engine.To overcome the deficiency of focused crawler search strategy,a comprehensive value based on theme relevance and importance of page was proposed to determine the topic relevant of the page,and the adaptive immune evolutionary algorithm of this search strategy was used to guide the crawling strategy of focused crawler.The experiment results showed that the algorithm download the proportion to the number of webpage related to the themes was higher significantly than the best search and breadth first search algorithm and had higher searching efficiency.
出处
《黑龙江八一农垦大学学报》
2012年第4期61-64,共4页
journal of heilongjiang bayi agricultural university
基金
黑龙江省教育厅科学技术研究资助项目(NO.11551015)
关键词
聚焦爬虫
搜索策略
主题相关度
自适应免疫进化算法
focused crawler
searching strategy
topic relevancy
adaptive immune evolutionary algorithm