期刊文献+

微博数据爬虫的检测方法研究

Research on detection method of Weibo data crawler
下载PDF
导出
摘要 针对常见的分布式网络爬虫提出了一种对策,研究了爬虫检测的方法,并分析了分布式爬虫如何绕过这些方法。通过关注网络流量遵循功率分配的属性来检测分布式爬虫。当我们按请求数量对网页进行排序时,大多数请求都集中在最常请求的网页上。此外,还会有一些普通用户通常不会要求的网页。但是爬虫会请求这些网页,因为它们的算法旨在通过解析网页来迭代请求,以收集爬虫遇到的每个项目。因此可以假设,如果某些IP地址频繁用于请求位于功率分配图长尾区域的网页,则这些IP地址可以归类为爬虫节点。网络流量数据的实验结果表明,该方法可以有效地识别出0.02%误报的分布式爬虫。 This paper proposes a countermeasure against common distributed web crawlers,studies the methods of crawler de-tection,and analyzes how distributed crawlers bypass these methods.Detect distributed crawlers by focusing on the property that network traffic follows power distribution.When we sort web pages by number of requests,most requests are concentrated on the most frequently requested web pages.In addition,there will be pages that the average user would not normally request.But crawl-ers request these web pages because their algorithms are designed to iterate the request by parsing the web page to collect every item the crawler encounters.Therefore,we can assume that certain IP addresses can be classified as crawler nodes if they are fre-quently used to request web pages located in the long-tail region of the power distribution graph.Experimental results on network traffic data show that the method can effectively identify distributed crawlers with 0.02%false positives.
作者 黄志高 Huang Zhigao(School of Physics and Information Engineering,Quanzhou Normal University,Quanzhou 362000,China)
出处 《现代计算机》 2023年第16期64-68,共5页 Modern Computer
基金 2018年福建省中青年教师教育科研项目(JT180381)。
关键词 分布式网络爬虫 长尾域值 爬虫检测 distributed web crawler long tail domain value crawler detection
  • 相关文献

参考文献5

二级参考文献28

共引文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部