期刊文献+

基于Nutch的分布式爬虫的优化研究 被引量:7

Research on Improvement of Distributed Crawler Based on Nutch
下载PDF
导出
摘要 网络爬虫是互联网运行服务的重要组成部分,并为整个互联网、企业内部网和大型门户网站提供搜索和索引。为解决现有爬虫方法在效率方面的问题,本文介绍了Nutch分布式爬虫工作流程及机制,通过分析Hadoop下的Nutch网络爬虫,在Nutch分布式爬虫的参数方面、Hadoop的I/O模型和Nutch分布式爬虫小文件问题三个方面做了相关优化。实验结果表明,优化后的网络爬虫能更有效的爬取网络资源,能在较大程度上提升网络爬虫效率。 Web crawler is an important part of Intemet operation service and it provides search and indexing service for the Intemet, Intranet and large portals. In order to improve the efficiency of the existing crawler, This paper introduces the Nutch distributed crawler work process and mechanism, by analyzing the Nutch crawler under the Hadoop, in terms of the parameters of the Nutch distributed crawler,hadoop's I/O model and small files of the Nutch distributed crawler to deal with three aspects to do some optimization. The results show that the optimized web crawler can achieve great improvements in efficiency compared with the existed measures.
出处 《无线通信技术》 2014年第3期44-47,52,共5页 Wireless Communication Technology
关键词 NUTCH HADOOP 分布式文件系统 分布式爬虫 Nutch Hadoop distributed file system distributed crawler
  • 相关文献

参考文献3

二级参考文献17

  • 1Cho J, Garcia-Molina H. Paraller Crawlers. Proceedings of the Eleventh International Conference on World Wide Web, 2002-05
  • 2Aggarwal C, Al-Garawi F, Yu P. Intelligent Crawling on the World Wide Web with Arbitrary Predicates. In: Proceedings of the 10th International WWW Conference, 2003
  • 3Menczer F, Pant G, Srinivasan P, et al. Evaluating Topic-Driven WebCrawlers. In: Proceedings of the 24th Annual International ACM/SIGIRConference, 2002
  • 4Chakrabarti S, Van Den Berg M, Dom B. Focused Crawling: A New Approach to Topic-specific Web Resource Discovery. In: Proceedings of the 8th International WWW Conference, Toronto, Canada, 1999-05
  • 5The Apache Software Foundation. Apache NutchTM [EB/OL]. [ 2012-12-11 ]. http://nutch, apache, org.
  • 6DennisKubes. NutchWiki[ EB/OL]. [2009-11-24]. https://wiki. apache, org/nutch/OptimizingCrawls.
  • 7Intel. Optimizing Hadoop* deployments[ EB/OL]. [ 2010-10-08]. http://communities, intel, com/servlet/JiveServlet/downloadBody/ 5645-102-I-8759/Optimizing Hadoop_201 0_final. pdf.
  • 8Impetus Technologies Inc. Hadoop performance tuning[ EB/OL]. [ 2010-11-16]. http://hadoop-toolkit, googlecode, com/files/White paper-HadoopPerformanceTuning, pdf.
  • 9HANSEN C A. Optimizing Hadoop for the cluster[ EB/OL]. [2010- 04-17]. http://www, scratchmytail, com/papers/cha030-optimiz- inghadoop, pdf, 2012.
  • 10HERODOTOU H, LIM H, LUO G, et al. Starfish: A self-tuning system for big data analytics[ EB/OL]. [2013-01-08]. http://x86. cs. duke. edu/-gang/documents/CIDR11_Paper36, pdf.

共引文献34

同被引文献35

引证文献7

二级引证文献13

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部