期刊文献+

基于布隆过滤器的海量数据查询技术的优化与应用 被引量:2

The Query Optimization and Application of BloomFilter for Large Dataset
下载PDF
导出
摘要 通过一个用户行为数据分析的案例,说明了布隆过滤器的原理和应用场景。在案例中,需要使用MapReduce框架在海量数据中筛选出付费用户相关的数据,布隆过滤器算法提供了一种快速、有效的实现方法。简述了使用MongoDB内存数据库存储付费用户的解决方案,其搜索效率高,但随着数据量的增加,一对多并发查询给服务端带来的压力会越来越大;如果使用分布式缓存的方法,这时为一对一存取,带来的问题是占用内存增大,如果数据结构选择HashSet,存入量大时,则容易使堆内存溢出,故考虑使用自定义数据结构:布隆过滤器,对其原理和误判率进行了分析,并针对其可能产生的错误数据("假阳性")提出消除方案,经实验验证,布隆过滤器占用内存低、查找效率高,解决本类问题极为合适。 The theory and application scenarios of Bloom filter is illustrated by an analysis sample of customer behavior data.During the project Bloom filter can be used to search for large dataset effectively at a rapid rate.At the beginning of this paper,in-memory database,like MongoDB,is used to solve that question,with a lookup time complexity of O(1)after default index(_id)is the only one permitted to save the premium accouts.The disadvantage is that the functionality needed is limited and the pressure brought by concurrent(one to multiple)query becomes bigger as the valume of data increses.Then the accounts can be read into momery througth appropriate data structure using distributed cache.The mode of data access is changed into oneto-one,resulting in the bigger usage of memory.With a small amount of data to be processed,the performace of HashSet is acceptable because of its convience and speed.As the volume of data increases,Heap memory may overflow.Then,a custom data structure is adopted for the Bloom filter.The basic theory and false positive rate are analyzed,the error data(False Positive Error),reduced by Bloom Filter,can be eliminated.Theory analysis and experiment show that the features of low space usage and high search efficiency for Bloom filter are appropriate to solve this problem.
作者 饶文 陈旭
出处 《微型电脑应用》 2018年第2期68-71,80,共5页 Microcomputer Applications
关键词 MAPREDUCE 布隆过滤器 数据集 MONGODB MapReduce Bloom filter Dataset Mongo DB Hash table
  • 相关文献

参考文献7

二级参考文献40

  • 1谢鲲,闵应骅,张大方,谢高岗,文吉刚.分档布鲁姆过滤器的查询算法[J].计算机学报,2007,30(4):597-607. 被引量:14
  • 2Bloom B. Space/time tradeoffs in hash coding with allowable errors. Communications of the ACM, 1970,13(7): 422-426
  • 3Mitzenmacher M. Compressed Bloom Filters. In: Proc. of the 20th ACM Symposium on Principles of Distributed Computing (PODC2001) ,Aug. 2001
  • 4Fan L,Cao P,Almeida J,Broder A. Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM transactions on networking,2000,8(3)
  • 5Kubiatowicz J,et al. OceanStore: An architecture for globe-scale persistent storage. In:Proc. of the 9th Intl. conf. on architectural support for programming languages and operating systems (ASPLOS 2000) ,2000
  • 6Ramakrishna M V. Practical performance of Bloom Filters and parallel free-text searching. Communications of the ACM, 1989,32(10):1237-1239
  • 7Mulllin J K. A second look at Bloom Filters. Communiations of the ACM,1983,26(8) :570-571
  • 8Witten I H, Moffat A, Bell T. Managing Gigabytes (2nd Edition). Morgan Kaufmann,San Francisco, 1999
  • 9Zhao B Y, Kubiatowicz J, Joseph A D. Tapstry: An infrastructure for fault-tolerant wide-area location and routing.Computer Science Division University of California, (UCB/CSD-01-1141) ,April 2001
  • 10Balter M H, Leighton T, Lewin D. Resource discovery in distributed networks. In: Proc. of the 18th annual ACM symposium on priciples of distributed computing (PODC'99),1999

共引文献60

同被引文献7

引证文献2

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部