摘要
目前在网站信息增量爬取中,采用布隆过滤器去重是比较有效的方法,但随着存入的元素数量增加,误算率随之增加。为此本文设计并实现了一种基于窗口比较的网站信息增量爬取方法,按照网站数据呈现顺序一次性爬取有限长度的数据,并按照网站数据的呈现顺序放入数据队列,在数据队列末端设定比较窗口,通过检查比较窗口内的数据与已爬取数据的重复度决定是否停止数据爬取。实验表明,针对增量爬取未严格按照时间排序网站信息时,本方法降低了爬取损耗。
Nowadays, Bloom filters are useful methods for the incremental crawling of websites. With the increasing of the stored items, the error rate is also enlarged. To solve this problem, we proposed a window comparison based incremental crawling approach, by which the information on the websites can be crawled within the limited length, and it will be stored in the data queue based on the dis- play location in the website. A window is set at the end of the queue, which is used to check how much data is crawled by several times and whether the crawling process should be terminated. The simulation shows that, this approach can reduce the cost of the crawling for the website in which the incremented data is not displayed based on the updated time.
出处
《网络新媒体技术》
2017年第4期24-27,共4页
Network New Media Technology
基金
中国科学院战略性先导科技专项(编号:XDA06040602)
关键词
增量爬取
爬取效率
HASH
布隆过滤器
incremental crawling, crawling efficiency, Hash, Bloom filter