期刊文献+

广域网分布式爬虫中的Agent协同与Web划分研究

Research on Agent collaboration and Web partition in WAN-based distributed Web crawlers
下载PDF
导出
摘要 针对广域网环境下分布式Web爬虫的Agent协同和Web划分两个核心问题进行深入研究,提出了基于顾问服务的分布式Web爬虫系统模型,给出了详细的系统设计方案及Agent协同算法框架,并通过推导证明了顾问服务参与Agent协同能够使分布式爬虫系统承受相对较小的网络负载。提出了分布式Web爬虫Web划分的概念,围绕Web划分单元选取及Web划分策略,对Web划分的分类和实现进行了详细的讨论,并通过实验对多种Web划分方法进行了对比和评价,验证了广域网系统相对于局域网系统的优势,并发现运营商互连因素对爬虫系统性能的影响大于地理位置因素的影响。 This paper focuses on agent collaboration and Web partition, the two core issues in WAN-based distributed crawling. First, a new consultant-service-based agent collaboration method and the corresponding system model are proposed. The new method has a lower communication overhead than the central-coordinator-based crawling systems and exploits location proximity better than the ones based on Distributed Hash Table (DHT). Second, the detailed definitions of Web partition are presented. The selection of Web partition unit and the Web partition strategy are discussed. The experiment under the real Interact environment shows that WAN-based distributed Web crawling systems have better performance than the LAN-based ones. The experiment also shows that the impact of Interact service providers interconnectivity on the system performance is greater than that of the geographical locality.
出处 《高技术通讯》 EI CAS CSCD 北大核心 2010年第3期239-245,共7页 Chinese High Technology Letters
基金 863计划(2009AA01Z437) 973计划(G2005CB321806) 国家自然科学基金(60703014) 高等学校博士学科点专项科研基金(20070213044) 哈尔滨工业大学优秀青年教师培养计划(HITQNJS.2007.034)资助项目
关键词 分布式Web爬虫 AGENT协同 Web划分 顾问服务 distributed Web crawler, Agent collaboration, Web partition, consultant service
  • 相关文献

参考文献15

  • 1Baeza-Yates R,Castillo C,Junqueira F,et al.Challenges in distributed information retrieval.In:Proceedings of the International Conference on Data Engineering (ICDE),Istanbul,Turkey,2007.
  • 2Brin S,Page L.The anatomy of a large-scale hypertextual Web search engine.In:Proceedings of the 7th International World Wide Web Conference (WWW),Brisbane,Australia,1998.107-117.
  • 3Burner M.Crawling towards eternity-building an archive of the world wide web.Web Techniques Magazine,1997,2 (5):37-40.
  • 4Heydon A,Najork M.Mercator:a scalable,extensible web crawler.World Wide Web,1999,2 (4):219-229.
  • 5Liu F,Ma F Y,Ye Y M,et al.(2005).IglooG:A distributed web crawler based on grid service.Web Technologies Research and Development (APWeb 2005),2005,3399:207-216.
  • 6叶允明,于水,马范援,宋晖,张岭.分布式Web Crawler的研究:结构、算法和策略[J].电子学报,2002,30(12A):2008-2011. 被引量:23
  • 7蒋宗礼,赵钦,肖华,王蕊.高性能并行爬行器[J].计算机工程与设计,2006,27(24):4762-4766. 被引量:7
  • 8Dustin B.Distributed High-performance Web Crawlers:A Survey of The State of the Art.http://www.cs.ucsd.edu/-dboswell/PastWork/WebCrawlingSurvey.:UCSD,2003.
  • 9Christen M.YaCy Peer-To-Peer Web Search.http://yacy.net/:YaCy,2003.
  • 10Garbe M.FAROO P2P Web Search.http://www.faroo.com:FAROO,2007.

二级参考文献15

  • 1Junghoo Cho,Hector Garcia-Molina.Parallel crawlers[C].Honolulu:Proceedings of the 11 th International World Wide Web Conference,ACM Press,2002.124-135.
  • 2Sergey Brin,Lawrence Page.The anatomy of a large-scale hypertextual web search engine[J].Computer Networks and ISDN Systems,1998,30:107-117.
  • 3Allan Heydon,Marc Najork.Mercator:A scalable,extensible web crawler[J].World Wide Web,1999,(2):219-229.
  • 4Marc Najork,Janet L Wiener.Breadth-first search crawling yields high quality pages[C].Hong Kong:Proceedings of 10th International World Wide Web Conference,ACM Press,2001.114-118.
  • 5Paolo Boldi,Bruno Codenotti,Massimo Santini,et al.UbiCrawler:A scalable fully distributed web crawler[J].Software:Practice and Experience,2004,34(8):711-726.
  • 6George Samaras,Odysseas Papapetrou.Distributed location aware web crawling[C].New York,USA:Proceedings of the 13th international World Wide Web conference,ACM Press,2004.468-469.
  • 7Boon Thau Loo,Sailesh Krishnamurthy,Owen Cooper.Distributed web crawling over DHTs[R].UC Berkeley Technical Report UCB//CSD-4-1305,2004.
  • 8Marc Najork,Allan Heydon.High-performance web crawling[C].Handbook of Massive Data Sets,Kluwer Academic Publishers Inc,2001.25-45.
  • 9Kasom Koht-arsa,Surasak Sanguanpong.High performance large scale web spider architecture[EB/OL].http://anres.cpe.ku.ac.th/pub/thesis-spider.pdf.
  • 10Carlos Castillo,Mauricio Marin,Andrea Rodriguez,et al.Scheduling algorithms for web crawling[C].Brazil:WEBMEDIA and LA-WEB,IEEE Cs Press,2004.10-17.

共引文献25

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部