摘要
介绍了垂直搜索引擎和网络爬虫的基本概念,以及Heritrix系统的体系结构,分析了Heritrix工作流程,针对Heritirx中存在的一些不完善的地方,引入了ELFHash算法并通过扩展Heritrix实现了电信信息搜索平台信息的定向与多线程抓取,为建立面向电子信息的垂直搜索引擎提供了信息源.
This paper mainly introduces the basic concepts of the vertical search engine and web crawler, and describes the architecture of Heritrix system, The Heritrix workflow is analyzed. Aiming at some imper- fections in Heritirx, our project designs how to grab directionally a certain type of information. The ELFHash algorithm is introduced. The multi-threaded crawl of information in the telecommunications information search platform is realized by extending the Heritrix to provide information source for the establishment of a vertical search enghae for electronic information.
出处
《成都大学学报(自然科学版)》
2013年第2期156-158,共3页
Journal of Chengdu University(Natural Science Edition)
基金
四川省科技基础条件平台资助项目