摘要
随着硬件和网络技术的发展 ,集群系统已成为构建网络服务的重要方式 .基于集群系统提供网络信息检索服务 (如搜索引擎等 )具有很大的应用价值 .网络信息检索的工作基础是从网络空间采集检索数据 ,通常由信息采集系统完成 .本文介绍一个集群系统上的网络信息采集器 .该采集器利用 WWW网页之间的链接关系对采集空间进行宽度优先遍历 .采用多线程并发方式来提高单结点上的带宽利用率 ;
With the development of hardware and network, the cluster system has become an important solution to build up a Web server. Constructing information retrieve (IR) systems on the cluster, such as search engines, is promising in practice. The IR systems usually retrieve data sets downloaded from the web by information gathering (crawling) systems. This paper will introduce a web information crawler based on a cluster system. By analyzing the linkages among the WWW pages, this crawler gathers information in the BFS pattern. On each single node, multi thread pattern helps to improve the efficiency of bandwidth usage; and an effective cooperative mechanism among the nodes of the cluster is implemented in this crawler.
出处
《小型微型计算机系统》
CSCD
北大核心
2003年第8期1413-1417,共5页
Journal of Chinese Computer Systems
基金
国家 8 63计划项目资助 ( 863 -3 0 6-ZT0 1-0 3 -1)
国家自然科学基金资助 ( 60 13 1160 743 )
关键词
网络信息采集
集群系统
BFS
多线程
network information gathering
cluster system
BFS
multi thread