期刊文献+

Minwise Hash动态双重阈值过滤器

Dynamic Double-Threshold Filter for Minwise Hash
下载PDF
导出
摘要 结合二项分布和小概率原理进行理论推导,提出了Minwise Hash的动态双重阈值过滤器,将比对过程划分为多个比对点,并设置各比对点的动态阈值,过滤相似度低于下界阈值TL(k)的文档,输出相似度高于上界阈值TU(k)的文档.该提前过滤的方法减少了后续的比对次数,降低了工作量,并设计了多组实验,结果显示过滤器在选取了适当的参数时,计算时间仅为原Minwise Hash的31%或原b位Minwise Hash的36%,较大地提升了原算法的时间效率.动态双重阈值过滤器不仅能应用于Minwise Hash,也能用于它的变种算法(如b位Minwise Hash),乃至所有符合二项分布的估计子. By combining the binomial distribution with the small probability principle, a dynamic double- threshold filter for Minwise Hash was proposed. The filter divides compared progress into several com- pared point and sets dynamic threshold on them to filter documents with similarity under lower threshold TL(k) and output documents with similarity exceed upper threshold Tu (k). It decreases the comparing time and workload that follow. The experimental results demonstrate that the dynamic double-threshold filter requires only 31% of CPU running time of Minwise Hash or 36% of b-bit Minwise Hash. Besides, the performance of the filter is enhanced with a set of suitable parameters. The filter applies not only to Minwise Hash, but also to its mutation algorithm (e. g. b-bit Minwise Hash), and even all estimators con- forming to binomial distribution.
出处 《上海交通大学学报》 EI CAS CSCD 北大核心 2016年第7期1075-1081,共7页 Journal of Shanghai Jiaotong University
基金 国家自然科学基金面上项目(No.61402165) 湖南省自然科学基金面上项目(No.2015JJ3058) 湖南工业大学自然科学基金(No.2014HZX17) 湖南省教育厅科技研究项目(No.14C0325) 湖南省教育厅科学研究基金(No.15C1288)
关键词 动态双重阈值 相似性检测 哈希 小概率事件 dynamic double-threshold similarity detection Hash small probability event
  • 相关文献

参考文献16

  • 1BENDERSKY M, CROFT W B. Finding text reuse on the web[C]//ACM International Conference on Web Search & Data Mining. New York, USA: ACM, 2009:262-271.
  • 2KALPAKIS K, TANG S. Collaborative data gather- ing in wireless sensor networks using measurement co-occurrence[J]. Computer Communications, 2008, 31 (10) : 1979-1992.
  • 3BRODER A Z, CHARIKAR M, FRIEZE A M, et al. Min-wise independent permutations[J]. Journal of Computer & System Sciences, 2000, 60 (3).- 630- 659.
  • 4DOURISBOURE Y, GERACI F, PELLEGRINI M. Extraction and classification of dense implicit commu- nities in the Web graph[J]. Journal of Animal Sci- ence, 2009, 3(2):334-339.
  • 5JI S, FAKHRY A, DENG H. Integrative analysis of the connectivity and gene expression atlases in the mouse brain[J]. Neuroimage, 2013, 84(1):245-253.
  • 6INDYK P. A small approximately rain-wise inde- pendent family of hash functions[J]. Journal of Algo- rithms, 2000, 38(1) ..84-90.
  • 7CHARIKAR M S. Similarity estimation techniques from rounding algorithms [C]//Thiry-Fourth ACM Symposium on Theory of Computing. New York, USA: ACM, 2002: 380-388.
  • 8SHAH R D, MEINSHAUSEN N. Min-wise hashing for largE-scale regression and classification with sparse data. Technical Report[R]. New York:Cornell University Library, 2013.
  • 9BUEHRER G, CHELLAPILLA K. A scalable pat- tern mining approach to web graph compression with communities[C]//WSDM ' 08 Proceedings of the Inter- national Conference on Web Search and Web Data Mining. New York, USA: ACM, 2008: 95-106.
  • 10TABEI Y, YAMANISHI Y. Scalable prediction of compound-protein interactions using minwise hashing [J]. Bmc Systems Biology, 2013, 7(6) :1-13.

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部