期刊文献+

网络爬虫反爬策略研究 被引量:9

下载PDF
导出
摘要 网络爬虫在工作时会对目标站点发送大量的请求,这样的爬虫工作方式决定了其会消耗不少目标站点的服务器资源,这对于一个服务器不大的中小型站点来说负载是巨大的,甚至会导致该站点直接崩溃。另外某些网站也不希望自己的内容被轻易的获取,如电商网站的交易额,这些数据是一个互联网产品的核心,因此采取一定的手段保护敏感的数据。因此很多网站都在站点中加入了反爬机制。例如User-Agent+Referer检测、账号登陆及Cookie验证等。文章讨论了几种主流的方法来避免爬虫被目标站点服务器封禁,从而保证爬虫的正常运行。 Web crawlers send a large number of requests to the target site when they work. this way of crawler work determines that it will consume a lot of server resources of the target site, which is a huge load for a small and medium-sized site with small and medium-sized servers. It can even cause the site to crash directly. In addition, some websites do not want their content to be easily accessed, such as the transaction volume of e-commerce websites, these data is the core of an Internet product, so take certain means to protect sensitive data. As a result, many sites have added anti-crawling mechanisms to their sites. For example, User-Agent + Referer detection, account login and Cookie verification. In this paper, several mainstream methods are discussed to avoid the crawler being blocked by the target site server, so as to ensure the normal operation of the crawler.
机构地区 中国传媒大学
出处 《科技创新与应用》 2019年第15期137-138,140,共3页 Technology Innovation and Application
关键词 网络爬虫 反爬虫 抓取策略 Web crawler anti-crawler crawling strategy
  • 相关文献

参考文献4

二级参考文献8

共引文献42

同被引文献58

引证文献9

二级引证文献40

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部