摘要
网络爬虫是互联网运行服务的重要组成部分,并为整个互联网、企业内部网和大型门户网站提供搜索和索引。为解决现有爬虫方法在效率方面的问题,本文介绍了Nutch分布式爬虫工作流程及机制,通过分析Hadoop下的Nutch网络爬虫,在Nutch分布式爬虫的参数方面、Hadoop的I/O模型和Nutch分布式爬虫小文件问题三个方面做了相关优化。实验结果表明,优化后的网络爬虫能更有效的爬取网络资源,能在较大程度上提升网络爬虫效率。
Web crawler is an important part of Intemet operation service and it provides search and indexing service for the Intemet, Intranet and large portals. In order to improve the efficiency of the existing crawler, This paper introduces the Nutch distributed crawler work process and mechanism, by analyzing the Nutch crawler under the Hadoop, in terms of the parameters of the Nutch distributed crawler,hadoop's I/O model and small files of the Nutch distributed crawler to deal with three aspects to do some optimization. The results show that the optimized web crawler can achieve great improvements in efficiency compared with the existed measures.
出处
《无线通信技术》
2014年第3期44-47,52,共5页
Wireless Communication Technology