摘要
随着互联网的发展,网络信息呈爆炸性的增长。大量的信息一方面给人们带来了更多的消息来源,而另一方面也给人们搜索有用的信息带来巨大的负担。根据2015年11月的最新数据,互联网上活动网站的数量达到了902,997,800个。因此如何更有效的消除互联网中的重复信息,让人们便捷的找到所求,已经成为现代互联网的一个重要的问题。布隆过滤器(Bloom Filter)是1970年提出的一种去重算法,它实际上是由一个很长的二进制向量和一系列随机映射函数组成的,拥有查询速度快和占用空间低的优点,然而其存在一定的误识别率。针对这个问题,本文设计了一种多维布隆过滤器算法,有效降低了传统布隆过滤器误识别率,并且通过实验,测试对比误称率和查询速度。
With the development of Internet,the information in Internet has grown rapidly.On the one hand,a lot of information brings people more source,on the other hand it also brings people huge burden on searching useful infor-mation.According to the newest data in November,2015,the number of active websites on Internet is up to 902,997, 800.So how to deduplication information on Internet effectively and let people find what they need has become a criti-cal problem in modern Internet.Bloom Filter is a duplicated deletion algorithm proposed in 1970.It actually consists of a very long series of random binary vectors,and a lot of hash functions.And it has advantages of fast searching speed and low memory cost.But it has error probability in recognizing.Aiming at this problem,a multidimensional Bloom Filter is proposed and reduces error probability in recognizing effectively.And experiment is done to test error probability in recognizing and searching speed.
出处
《软件》
2015年第12期166-170,共5页
Software
关键词
算法理论
多维布隆过滤器
布隆过滤器
网页消重
Algorithm theory
Multidimensional bloom filter
Bloom filter
Webpage deduplication