摘要
针对互联网日益增长的网页数量,提出了一种采用分布式技术设计实现的分布式网络蜘蛛(DWS)。该系统作为搜索引擎的前端,快速有效地下载网页,以获得整个Internet更加完整的映像。DWS设置中央控制节点来协调各个Web Spider的行为,以宽度优先搜素获得高质量的网页,通过对域名系统(DNS)缓存来提高访问Web Server的速度,增加并行线程数量增加下载速度,并能动态地加入Web Spider节点或子中央控制节点,具有很强的灵活性和扩张能力。
Concerning the growth of Web pages everyday,a Web spider system named Distributed Web Spider(DWS) based on distributed technology was proposed.It acted as front-end of search engine and quickly and efficiently downloaded Web pages to get more complete image of the Internet.The DWS set up central control node to coordinate all of Web spider actions,used breadth-first search crawling policy to get high-quality pages,cached Domain Name System(DNS) to gather speed,increased thread number to increase download speed,added Web Spider nodes or sub-central-control-nodes dynamically,and had strong flexibility and expansion capability.
出处
《计算机应用》
CSCD
北大核心
2010年第12期316-318,共3页
journal of Computer Applications
基金
四川省科技计划项目(2008GZ0003)
四川省科技攻关项目(07GG006-019)
关键词
分布式网络蜘蛛
网页质量
搜索引擎
分布式计算
Distributed Web Spider(DWS)
page quality
search engine
distributed computing