期刊文献+

基于流水线负载平衡模型的并行爬虫研究 被引量:2

Study on Parallel Crawler Based on Pipeline Load Balancing Model
下载PDF
导出
摘要 针对并行爬虫系统在多任务并发执行时所遇到的模块间负载平衡问题,提出流水线负载平衡模型(PLB),将不同的任务抽象为独立模块而达到各模块的处理速度相等,采用多线程的方式实现基于PLB的并行爬虫,根据线程的休眠和缓冲区的变化对线程数量进行动态调整以实现PLB。实验结果表明该方法具有良好的运行效率和稳定性。 This paper proposes a load balancing model named Pipeline Load Balancing(PLB), to address the load balancing problem among concurrent modules in a parallel crawling system. Different tasks in PLB are implemented as independent modules which have similar processing abilities. Dynamic multi-threading and buffering mechanisms are employed to implement a PLB-based parallel crawler. The number of threads is adjusted according to the changing in buffer size and waiting interval of a thread. Experimental results show that the PLB-based crawler provides high performance as well as good stability.
出处 《计算机工程》 CAS CSCD 北大核心 2009年第2期34-36,共3页 Computer Engineering
基金 国家自然科学基金资助项目"基于增量学习的主题爬虫关键技术研究"(60603066)
关键词 爬虫 并行 流水线 负载平衡 crawler parallel pipeline load balancing
  • 相关文献

参考文献5

  • 1Brin S, Page L. The Anatomy of a Large-scale Hypertextual Web Search Engine[C]//Proc. of the 7th International Conference on World Wide Web. [S. l.]: IEEE Press, 1998.
  • 2Shkapenyuk V, Suel T. Design and Implementation of a High Performance Distributed Web Crawler[C]//Proc. of the 18th International Conference on Data Engineering. California, USA: 2002.
  • 3Boldi P, Codenotti B, Santini M, et al. Crawler[J]. Software: Practice and Experience, 2004, 34(8): 711-722.
  • 4张岭,叶允明,宋晖,于水,马范援.一种高性能分布式Web Crawler的设计与实现[J].上海交通大学学报,2004,38(1):59-61. 被引量:6
  • 5叶允明,于水,马范援,宋晖,张岭.分布式Web Crawler的研究:结构、算法和策略[J].电子学报,2002,30(12A):2008-2011. 被引量:23

二级参考文献7

  • 1[1]Heydon A, Najork M. Mercator: A scalable, extensible Web Crawler[J]. World Wide Web, 1999, 2(4):219-229.
  • 2[2]Pinkerton B. Web Crawler: Finding what people want [D]. Washington: University of Washington, 2000.
  • 3[3]Fredkin E. Trie memory [J]. Communication of ACM, 1960, 26(3):490-500.
  • 4[4]IETF. Robot Exclusion Protocol [EB/OL]. http://www. robotstxt. org/wc/exclusion. html, 2001-10.
  • 5[5]Brin S, Page L. the anatomy of a large-scale hypertexual web search engine [A]. Proceeding of the WWW7 Conference [C]. Australia: Elsevier, 1998.107-117.
  • 6Allan Heydon,Marc Najork. Mercator: A scalable, extensible Web crawler[J] 1999,World Wide Web(4):219~229
  • 7刘济波,朱培栋.WWW大规模cache技术[J].现代计算机,1998(6):8-10. 被引量:1

共引文献26

同被引文献9

引证文献2

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部