摘要
结合二项分布和小概率原理进行理论推导,提出了Minwise Hash的动态双重阈值过滤器,将比对过程划分为多个比对点,并设置各比对点的动态阈值,过滤相似度低于下界阈值TL(k)的文档,输出相似度高于上界阈值TU(k)的文档.该提前过滤的方法减少了后续的比对次数,降低了工作量,并设计了多组实验,结果显示过滤器在选取了适当的参数时,计算时间仅为原Minwise Hash的31%或原b位Minwise Hash的36%,较大地提升了原算法的时间效率.动态双重阈值过滤器不仅能应用于Minwise Hash,也能用于它的变种算法(如b位Minwise Hash),乃至所有符合二项分布的估计子.
By combining the binomial distribution with the small probability principle, a dynamic double- threshold filter for Minwise Hash was proposed. The filter divides compared progress into several com- pared point and sets dynamic threshold on them to filter documents with similarity under lower threshold TL(k) and output documents with similarity exceed upper threshold Tu (k). It decreases the comparing time and workload that follow. The experimental results demonstrate that the dynamic double-threshold filter requires only 31% of CPU running time of Minwise Hash or 36% of b-bit Minwise Hash. Besides, the performance of the filter is enhanced with a set of suitable parameters. The filter applies not only to Minwise Hash, but also to its mutation algorithm (e. g. b-bit Minwise Hash), and even all estimators con- forming to binomial distribution.
出处
《上海交通大学学报》
EI
CAS
CSCD
北大核心
2016年第7期1075-1081,共7页
Journal of Shanghai Jiaotong University
基金
国家自然科学基金面上项目(No.61402165)
湖南省自然科学基金面上项目(No.2015JJ3058)
湖南工业大学自然科学基金(No.2014HZX17)
湖南省教育厅科技研究项目(No.14C0325)
湖南省教育厅科学研究基金(No.15C1288)
关键词
动态双重阈值
相似性检测
哈希
小概率事件
dynamic double-threshold
similarity detection
Hash
small probability event