期刊文献+

高性能网络爬虫:研究综述 被引量:89

Survey of High-performance Web Crawler
下载PDF
导出
摘要 网络爬虫是一种自动下载网络资源的程序,是搜索引擎的基础构件之一。系统地介绍了网络爬虫的工作原理和发展现状,详细地阐述了一个高性能、可伸缩、分布式的网络爬虫的系统架构和所面临的关键问题。 Web Crawlers,one of basic components of Search Engine,are programs to download resources from Internet. We illuminated the work theory of the Web Crawlers, and its development, and how to design a high-performance, scala- ble,distributed Web crawler, including the faced key problem.
出处 《计算机科学》 CSCD 北大核心 2009年第8期26-29,53,共5页 Computer Science
基金 国家自然科学基金项目(60573057 90718017)资助
关键词 网络爬虫 高性能 可伸缩 分布式 Crawler, High-performance, Scalability
  • 相关文献

参考文献36

  • 1Arasu A, Cho J. Searching the Web[J]. ACM Transactions on Internet Technology, 2001,1 (1) : 2-43.
  • 2Dean J, Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters[A]//Proceedings of the 6th Conference on Symposium on Opear-ting Systems Design & Implementation[C]. San Francisco, CA, 2004: 10-10.
  • 3Ghemawat S, Gobioff H, Leung Shun-Tak. The Google File System[A]//Proeeedings of the 19th ACM Symposium on Operating Systems Principles[C]. 2003:20-43.
  • 4Pike R, Dorward S, Griesemer R. Interpreting the Data:Parallel Analysis with Sawzall [J]. Scientific Programming Journal, 2005,13:277-298.
  • 5Chang F, Dean J, Ghemawat S. Bigtable: A Distributed Storage System for Structured Data[A]//7th USENIX Symposium on Operating Systems Design and Implementation[C]. 2006:205- 218.
  • 6Brin S, Page L. The Anatomy of a Large - scale Hypertextual Web Search Engine[J]. Computer Networks, 1998,30:107-117.
  • 7Burner M. Crawling towards Eternity: Building an Archive of the World Wide Web[J]. Web Techniques Magazine, 1997, 2 (5) : 125-130.
  • 8Boldi P, Codenotti B, Santini M. UbiCrawler: A Scalable Fully Distributed Web Crawler[J]. Software: Practice & Experienee, 2004,34:711-726.
  • 9Lee Hsin- Tsang, Leonard D. IRLbot: Scaling to 6 Billion Pages and Beyond[A]//Proceedings of the 17th International World Wide Web Conference[C]. ACM Press, 2008:427-436.
  • 10We knew the web was big [EB /OL]. http: // googleblog. blogspot, oom/2008/07/we-knew-web-was-big. html,2008-07-25.

二级参考文献76

  • 1李东升,卢锡城.P2P网络中常量度数常量拥塞的DHT方法研究[J].中国科学(E辑),2004,34(12):1337-1358. 被引量:4
  • 2何克抗 余胜泉 孙波.网络教育应用“全面解决技术方案”[J].教育技术通信,2002,(4).
  • 3[1]R Botafogo, E Rivlin, B Shneiderman. Structural analysis of hypertext: Identifying hierarchies and useful metrics. ACM Trans on Information System, 1992, 10(2): 142~180
  • 4[2]J Carriere, R Kazman. WebQuery: Searching and visualizing the Web through connectivity. The 6th Int'l WWW Conf (WWW6), Santa Clara, 1997
  • 5[3]Jon M Kleinberg. Authoritative sources in a hyperlinked environment. The 9th Annual ACM-SIAM Symp on Discrete Algorithms, California, 1997
  • 6[4]K Bharat, M R Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. The 21st Int'l ACM SIGIR Conf on Research and Development in Information Retrieval (SIGIR 98), Melbourne, 1998
  • 7[5]S Brin, L Page. The anatomy of a large-scale hypertextual web search engine. The 7th Int'l WWW Conf (WWW7), Brisbane, Australia, 1998
  • 8[6]L Page, S Brin .et al.. The pagerank citation ranking: Bringing order to the web. 1998. http://dbpubs.stanford.edu:8090/pub/1999-66
  • 9[7]N Craswell, D Hawking, S E Robertson. Effective site finding using link anchor information. The SIGIR 2001, Louisiana, 2001
  • 10[8]Gao Jianfeng .et al.. TREC-10 Web track experiments at MSRA. The 10th Text Retrieval Conf, Gaithersburg, 2001

共引文献146

同被引文献558

引证文献89

二级引证文献537

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部