摘要
随着互联网行业和信息技术的发展,Google、IBM和Apache等大型公司纷纷投入去发展云计算,其中Apache开发的Hadoop平台是一个对用户极为友好的开源云计算框架。该文就是要基于Hadoop框架去设计和实现分布式网络爬虫技术,以完成大规模数据的采集,其中采用Map/Reduce分布式计算框架和分布式文件系统,来解决单机爬虫效率低、可扩展性差等问题,提高网页数据爬取速度并扩大爬取的规模。
with the rapid development of the Internet industry and information technology, Google, IBM and Apache and otherLarge Firm are input to the development of cloud computing, in which Apache Hadoop development platform is a very friendly tousers of open source cloud computing framework. This paper is based on the Hadoop framework to design and implementation of adistributed web crawler technology, to complete the large-scale data collection, in which the Map/Reduce distributed computingframework and distributed file system, to solve the single crawler low efficiency, poor scalability issues, improve the Webpagecrawling speed and expand the scale of crawling.
出处
《电脑知识与技术(过刊)》
2015年第3X期36-38,共3页
Computer Knowledge and Technology