摘要
随着移动互联网、大数据以及人工智能时代的到来,数据在整个互联网体系中的地位显得越来越重要,而数据体量的大小对大数据的分析以及人工智能的最终学习成果也有着深刻影响。但是目前的现状是,全世界范围内的大多数企业都深陷数据不完善或者数据体量太小的窘境,尤其对新创企业和小微型企业来说,这个已经成为了它们生存和发展的桎梏。因此,能够从无时无刻抓取大量数据的爬虫就显得尤为必要,故而我们深入地研究网络爬虫是非常有必要的。本文将会通过基于twisted的异步爬虫框架Scrapy,对网络爬虫进行研究,并实现抓取互联网页数据以及文件文本数据的分布式策略。
With the advent of the mobile Internet,big data and artificial intelligence era,the status of data in the entire Internet system is becoming more and more important,and the amount of data has a profound impact on the analysis of big data and the final learning results of artificial intelligence.However,the current status quo is that most companies around the world are deeply mired in data imperfections or too small data volume,especially for new ventures and small and micro enterprises,which have become their survival and development.Therefore,it is especially necessary to be able to crawl large amounts of data from time to time,so it is very necessary for us to study web crawlers in depth.This article will explore the layer web crawler through the twisted asynchronous crawler framework Scrapy,and implement the strategy of crawling Internet page data and file text data.
作者
张捷
郝建维
李欢欢
ZHANG Jie;HAO Jianwei;LI Huanhuan(Aostar Information Technologies Co.,Ltd.,Chengdu,Sichuan Province,610041 China)
出处
《科技创新导报》
2020年第21期149-153,共5页
Science and Technology Innovation Herald