摘要
互联网上分布的许多用于搜集网络信息的WebSpiders(网络爬虫)一般都工作在单机上,难以快速完成大规模的信息采集工作.对此提出了一种集群式Spider系统的构想,它能够使许多Spider工作在不同的主机上完成同一项任务(每个Spider负责一部分,可动态调整),因此可大大加速信息采集工作.文中描述了这种系统的体系结构与模型,并介绍了该系统的一种实现,即ChinaWebWizard.它不仅可以在集群模式下工作,还能动态地发现新的站点.该系统为搜索引擎提供了底层支持,对网点建设者和开发者具有参考价值.
There have been many spiders on the Web and they normally can run only on one machine.Here we suggest a system that belongs to the Spider family but has significant differences from others. It can work in cluster which means that many Spiders can host on many machines to perform one task and speed up the whole process. The article describes the system architecture and models and introduces an implementation of this system that is China Web Wizard and can work in cluster and dynamically find new Web sites. The system provides fundamental support for search engineers and has great reference values for Web site builder developers.
出处
《上海交通大学学报》
EI
CAS
CSCD
北大核心
1998年第8期36-41,共6页
Journal of Shanghai Jiaotong University