期刊文献+

基于Python的三种网络爬虫技术研究 被引量:12

Research on Three Web Crawler Technologies based on Python
下载PDF
导出
摘要 针对网络爬虫技术选型较多,影响抓取效率和准确性的问题,对基于Python语言的Requests、Scrapy和Selenium三种主流爬虫技术进行分析。首先,安装配置开发环境,完成单线程和多线程爬虫软件开发;其次,爬取“站长之家”网站10页、100页、500页和1,000页简历数据,计算爬取时间;最后,通过爬取“中国裁判文书网”中的数据验证突破反爬虫机制的能力。实验结果表明,Requests爬虫使用一行代码就能实现数据爬取,开发定制灵活;Scrapy爬虫技术平均每页的抓取时间为0.02 s,并发性能突出;Selenium爬虫技术破解网站反爬虫机制能力强。因此,开发网络爬虫技术要综合考虑业务需求和技术特点,只有这样,才能达到最佳的数据抓取效果。 As there are many types of web crawler technologies,which affect the efficiency and accuracy of crawling,this paper proposes to analyze three mainstream crawler technologies based on Python:Requests,Scrapy and Selenium.Firstly,the development environment is installed and configured to complete the development of single threaded and multithreaded crawler software.Secondly,the three crawlers crawl 10,100,500 and 1,000 pages of resume data from the"Home of Webmasters",and the crawling time is calculated.Finally,the ability to break through the anti-crawler mechanism is verified by crawling the data on the website of"China Judgements Online".The results show that Requests crawler technology uses one line of code to achieve data crawling,and the development and customization are flexible.The average crawling time per page of Scrapy crawler technology is 0.02 seconds,and its concurrency performance is outstanding.Selenium crawler technology has strong ability to crack website anti-crawler mechanism.Therefore,the development of web crawler technology should comprehensively consider the business needs and technical characteristics.Only in this way can the best data grabbing effect be achieved.
作者 杨健 陈伟 YANG Jian;CHEN Wei(Zhuji Public Security Bureau,Shaoxing 311800,China)
机构地区 诸暨市公安局
出处 《软件工程》 2023年第2期24-27,19,共5页 Software Engineering
关键词 网络爬虫 Requests技术 Scrapy技术 Selenium技术 web crawler Requests Scrapy Selenium
  • 相关文献

参考文献7

二级参考文献52

共引文献97

同被引文献66

引证文献12

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部