摘要
爬行器是搜索引擎的重要组成部分,它在搜索引擎中负责网络信息采集。详细介绍了Chao,一个高性能并行爬行器的设计和实现,包括它的系统框架、主要模块、运行流程、调度算法和URL检索算法。Chao的调度算法采用两次散列计算,不仅实现了负载平衡,而且在一定程度上避免了冲突;URL检索融合了树查找算法,在实现了快速检索的同时减少了存储空间需求。
A web crawler is an important component of a search engine for information gathering, but its design is not well-documented in the literature. Chao is introduced in detail, which is a high performance parallel crawler, including its overall architecture, major components, working process and two core algorithms, scheduling and URL indexing. The scheduling algorithm using double hashing nOt only realizes load balance, but also avoids collision. The URL indexing algorithm based on tree searching, achieving both a fast searching speed and a significantly low storage requirement.
出处
《计算机工程与设计》
CSCD
北大核心
2006年第24期4762-4766,共5页
Computer Engineering and Design
关键词
搜索引擎
信息采集
爬行器
并行
检索
search engine
information gathering
crawler
parallel
retrieval