摘要
网络爬虫在工作时会对目标站点发送大量的请求,这样的爬虫工作方式决定了其会消耗不少目标站点的服务器资源,这对于一个服务器不大的中小型站点来说负载是巨大的,甚至会导致该站点直接崩溃。另外某些网站也不希望自己的内容被轻易的获取,如电商网站的交易额,这些数据是一个互联网产品的核心,因此采取一定的手段保护敏感的数据。因此很多网站都在站点中加入了反爬机制。例如User-Agent+Referer检测、账号登陆及Cookie验证等。文章讨论了几种主流的方法来避免爬虫被目标站点服务器封禁,从而保证爬虫的正常运行。
Web crawlers send a large number of requests to the target site when they work. this way of crawler work determines that it will consume a lot of server resources of the target site, which is a huge load for a small and medium-sized site with small and medium-sized servers. It can even cause the site to crash directly. In addition, some websites do not want their content to be easily accessed, such as the transaction volume of e-commerce websites, these data is the core of an Internet product, so take certain means to protect sensitive data. As a result, many sites have added anti-crawling mechanisms to their sites. For example, User-Agent + Referer detection, account login and Cookie verification. In this paper, several mainstream methods are discussed to avoid the crawler being blocked by the target site server, so as to ensure the normal operation of the crawler.
出处
《科技创新与应用》
2019年第15期137-138,140,共3页
Technology Innovation and Application
关键词
网络爬虫
反爬虫
抓取策略
Web crawler
anti-crawler
crawling strategy