期刊文献+

基于GNP算法的分布式爬虫调度策略 被引量:5

GNP-based scheduling strategy for distributed crawling
下载PDF
导出
摘要 针对分布式搜索引擎的任务调度及负载均衡问题,提出了基于GNP算法的分布式爬虫调度策略和负载均衡的方法。利用网络距离预估取代大规模的网络距离测量,不仅提高了系统的响应速度,还减少了系统对广域网造成的压力。通过在广域网上部署爬虫节点,构建分布式搜索引擎,应用该调度策略进行实验,验证了系统性能有较大提高。 In order to solve task scheduling and load balancing problems of distributed search engines, this paper proposed a GNP-hased scheduling strategy for distributed crawling and a load balancing method. Adopted internet distance estimating mechanism as a replacement for large-scale network distance measurement, which not only improved response time of the system, but also reduced WAN pressure caused by the system. Through deploying crawling nodes at WANs, built a distributed search engine, and implemented several scheduling strategies. The online experiment shows great improvement in system' s performance.
出处 《计算机应用研究》 CSCD 北大核心 2010年第2期446-449,共4页 Application Research of Computers
基金 国家"973"重点基础研究发展计划资助项目(G2005CB321806) 国家自然科学基金资助项目(60703014) 高等学校博士学科点专项科研基金资助课题(20070213044) 哈尔滨工业大学优秀青年教师培养计划(HITQNJS.2007.034)
关键词 分布式爬虫 任务调度 负载均衡 网络测量 全局网络定位 distributed crawling scheduling strategies load balancing network measurement GNP( global network positioning )
  • 相关文献

参考文献9

  • 1BAEZA-YATES R, CASTILLO C, JUNQUEIRA F, et al. Challenges in distributed information retrieval [ C ]//Proc of International Conference on Data Engineering. Istanbul, Turkey: IEEE CS Press, 2007.
  • 2BOSWELL D. Distributed high-performance Web crawlers: a survey of the state of the art [ EB/OL ]. ( 2003 ) [ 2009-05-15 ]. http :// www. cs. ucsd. edu/dboswell/PastWork/WebCrawlingSurvey, pdf.
  • 3NG T S E , ZHANG Hui. Towards global network positioning [ C]// Proc of the 1 st ACM SIGCOMM Conference on Internet Measurement. New York : ACM Press, 2001:25- 29.
  • 4FRANCIS P, JAMIN S, PAXSON V, et al. An architecture for a global intemet host distance estimation service [ C ]//Proc of IEEE INFOCOM'99. New York:ACM Press, 1999: 210-217.
  • 5柯怡,林宇,金跃辉,等.GNP算法与基于GNP的全局负载均衡技术[C]//第九届全国青年通信学术会议论文集.2004.
  • 6KARGER D, LEHMAN E, LEIGHTON T, et al. Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web[ C]//Proc of the 29th Annual ACM Symposium on Theory of Computing. New York: ACM Press, 1997: 654- 663.
  • 7CAMBAZOGLU B , KARACA E, KUCUKYILMAZ T, et al. Architecture of a grid-enabled Web search engine[ J]. Information Processing and Management, 2007, 43 (3) :609- 623.
  • 8EXPOSTO J, MACEDO J, PINA A, et al. Geographical partition for distributed Web crawling[ C ]//Proc of the Workshop on Geographic Information Retrieval. New York :ACM Press, 2005:55-60.
  • 9GOVINDAN R, TANGMUNARUNKIT H. Heuhstics for Internet map discovery[ C]//Proc of IEEE INFOCOM Conference. Tel Aviv, Israel: IEEE Press, 2000:1371- 1380.

同被引文献46

  • 1Loo B T,Cooper O,Krishnamurthy S.Distributed web crawling over DHTs[R].University of California,Berkeley,2004.
  • 2Singh A,et al.Apoidea:A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web Distributed Multimedia Information Retrieval[J].Distributed Multimedia Information Retrieval(Lecture Notes in Computer Science),2004,2924:126-142.
  • 3Boldi P,et al.UbiCrawler:a scalable fully distributed Web crawler[J].Software:Practice and Experience,2004,34 (8):711-726.
  • 4Zhu K,et al.A Full Distributed Web Crawler Based on Structured NetworkInformation Retrieval Technology[J].Information Retrieval Technology(Lecture Notes in Computer Science),2008,4993:478-483.
  • 5中国科学院声学研究所,一种网页爬虫协作方法:中国,CN201110375264.1[P].2012-05-30.
  • 6Maymounkov P,Mazieres D.Kademlia:A peer-to-peer information system based on the xor metric[C] //Peer-to-Peer Systems.2002:53-65.
  • 7Rao A,et al.Load Balancing in Structured P2P Systems[C] //Proc.2nd Int.Workshop on Peer-to-Peer Systems.Berlin/Heidelberg:Springer,2003:68-79.
  • 8Karger D R,Ruhl M.Simple efficient load balancing algorithms for peer-to-peer systems[C] //Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures 2004.ACM:Barcelona,Spain,2004:36-43.
  • 9Rieche S,Petrak L,Wehrle K.A thermal-dissipation-based approach for balancing data load in distributed hash tables[C] //29th Annual IEEE International Conference on Local Computer Networks.2004.
  • 10张俊林.这就是搜索引擎[M].北京:电子工业出版社,2012:1-320.

引证文献5

二级引证文献30

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部