摘要
随着农业新闻数据日益膨胀,以农业为主题的增量爬虫成为爬取农业信息的重要手段。增量爬虫可以依据农业新闻数据的更新爬取数据相关更新的内容,剔除已经爬取的重复内容。本文结合农业新闻数据信息的特点,提出了一种适用于农业新闻信息的基于Redis的布隆过滤器的增量去重方法,摆脱超大的持久化文件撑爆内存的问题。试验证明随着抓取相关农业信息的增加,该方法在保证内存不被撑爆的同时能有效提高增量爬取农业信息的效率,在增量信息爬取过程中具有很好的应用价值。
With the increasing expansion of agricultural news data,incremental crawlers with the theme of agriculture have become an important means of crawling agricultural information.Incremental crawlers can crawl the updated content based on the update of the agricultural news data,and remove the duplicate content that has been crawled.Combined with the characteristics of agricultural news data,the paper proposed an incremental deduplication method based on Redis-based Bloom filter suitable for agricultural news information,to get rid of the problem of memory over flowing caused by large persistent files.Experiments proved that with the increase of related agricultural information crawled,this method could effectively improve the efficiency of incremental crawling of agricultural information while ensuring that the memory is not burst.It has good application value in the process of incremental information crawling.
作者
杨广召
曹叶
朱航飞
王家硕
朱家玮
YANG Guangzhao;CAO Ye;ZHU Hangfei;WANG Jiashuo;ZHU Jiawei(School of Information Engineering,Tarim University,Alaer Xinjiang 843300)
出处
《现代农业科技》
2021年第2期259-260,264,共3页
Modern Agricultural Science and Technology
基金
南疆红枣生产管理信息化系统示范与推广(19/1117831)。
关键词
农业新闻
增量爬虫
去重
agricultural news
incremental crawler
deduplication