摘要
传统的单机网络爬虫技术滞后于海量网页数据的应用场景,存在诸多的缺陷,但Hadoop、Spark等大数据分布式技术可以高效地存储和计算海量网络信息资源。因此,本文设计和应用一种基于Hadoop的分布式网络爬虫系统,进行系统架构设计和工作流程设计。该系统基于模块化设计的理念和方法,在分布式网络爬虫关键技术的依托下,进行分布式网络爬虫技术的功能模块设计,体现其可扩展、高可用性的特点,能较好地适用于海量网页信息资源的应用场景。
Big data distributed technologies such as Hadoop and Spark efficiently and reliably store and calculate massive network information resources. The traditional stand-alone web crawler technology lags behind the application scenarios of massive web data and has many defects. To this end, this paper designs and applies a Hadoop-based distributed web crawler system for system architecture design and workflow design. And based on the concept and method of modular design, based on the key technology of distributed web crawler, the functional module design of distributed web crawler technology is carried out, which reflects its scalability and high availability characteristics, and is better suitable for mass web information. The application scenario of the resource.
作者
吴宇鹏
WU Yupeng(Fuzhou Melbourne Institute of Technology,Fuzhou Fujian 350000,China)
出处
《信息与电脑》
2021年第19期87-89,共3页
Information & Computer