摘要
随着互联网的高速发展,在互联网搜索服务中,搜索引擎扮演着越来越重要的角色。网络爬虫是搜索引擎系统中十分重要的组成部分,它负责从互联网中搜集网页,这些页面用于建立索引,从而为搜索引擎提供支持。面对当前极具膨胀的网络信息,集中式的单机爬虫早已无法适应目前的互联网信息规模,因此高性能的分布式网络爬虫系统成为目前信息采集领域研究的重点。本文对网络爬虫原理、分布式架构设计以及网络爬虫中的关键模块、瓶颈问题及解决办法进行了相关研究。
With the rapid development of Internet, search engine plays an increasingly important role in Internet search service. Web crawler is a very important component of search engine system. It is responsible for collecting web pages from the Internet, which is used to build indexes so as to provide support for search engines. Because of the great expansion of network information, centralized stand-alone web crawler has been unable to adapt to the Internet scale, so high-performance distributed web crawler system has become the focus of current research in the field of information collection. In this paper, the principles of web crawler, the design of distributed architecture, and the key modules, bottlenecks and solutions of crawler were studied.
出处
《农业网络信息》
2017年第8期12-14,共3页
Agriculture Network Information