摘要
在信息化爆炸的时代,一般搜索引擎的搜索结果已经满足不了人们的需要,能获得更准确全面信息的垂直搜索引擎越来越受到关注。其中,主题爬虫作为垂直搜索引擎的核心部分一直是搜索方向的研究热点。本文以开源的网络爬虫Heritrix为基础,分析其结构特征与工作原理并引入了多线程处理的改进办法,设计出一个主题爬虫,在单机环境下进行该爬虫性能的测试。实验结果表明该主题爬虫的查全率达到较高水准,为进一步研究开发搜索效率高的垂直搜索引擎打下坚实的基础。
In the era of information explosion, the general crawler cannot meet the requirements of personalized search in specific areas, but the topic crawler which can obtain more accurate and comprehensive information get more attention. Among them, the topic crawler as the core part of the vertical search engine has been the research focus in the search direction. On the basis of analyzing the structure and characteristics of the topic crawler Heritrix, this paper it designs a topic crawler by introducing its own improvement suggestions to multithreading and an experiment of the performance of the crawler has been carried out on PC. The results of this experiment proves that the ability, which lays a solid foundation for the development of vertical search engine based on topic crawler.
出处
《电子设计工程》
2015年第6期30-32,共3页
Electronic Design Engineering
基金
镇江市社会发展项目(SH2013015)