基于压缩近邻的查重元数据去冗算法设计被引量：3

Deduplication algorithm based on condensed nearest neighbor rule for deduplication metadata

下载PDF

导出

摘要随着重复数据删除次数的增加,系统中用于存储指纹索引的清单文件等元数据信息会不断累积,导致不可忽视的存储资源开销。因此,如何在不影响重复数据删除率的基础上,对重复数据删除过程中产生的元数据信息进行压缩,从而减小查重索引,是进一步提高重复数据删除效率和存储资源利用率的重要因素。针对查重元数据中存在大量冗余数据,提出了一种基于压缩近邻的查重元数据去冗算法Dedup2。该算法先利用聚类算法将查重元数据分为若干类,然后利用压缩近邻算法消除查重元数据中相似度较高的数据以获得查重子集,并在该查重子集上利用文件相似性对数据对象进行重复数据删除操作。实验结果表明,Dedup2可以在保持近似的重复数据删除比的基础上,将查重索引大小压缩50%以上。 Building effective deduplication index in the memory could reduce disk access times and enhance chunk fin- gerprint lookup speed, which was a big challenge for deduplication algorithms in massive data environments. As dedu- plication data set had many samples with high similarity, a deduplication algorithm based on condensed nearest neighbor rule, which was called Dedup2, was proposed. Dedup2 uses clustering algorithrn to divide the original deduplication metadata into several categories. According to these categories, it employs condensed nearest neighbor rule to remove the highest similar data in the deduplieation metadata. After that it can get the subset of deduplication metadata. Based on this subset, new data ob- jects will be deduplicated based on the principle of data similarity. The results of experiments show that Dedup2 can reduce the size of deduplication data set more than 50% effectively while maintain similar deduplication ratio.

作者姚文斌叶鹏迪李小勇常静坤

机构地区北京邮电大学智能通信软件与多媒体北京市重点实验室北京邮电大学计算机学院中国铁道科学研究院机车车辆研究所北京邮电大学可信分布式计算与服务教育部重点实验室

出处《通信学报》 EI CSCD 北大核心 2015年第8期1-7,共7页 Journal on Communications

基金国家自然科学基金资助项目(61370069) 国家高技术研究发展计划("863"计划)基金资助项目(2012AA012600) 中央高校基本科研业务费专项基金资助项目(BUPT2011RCZJ16)~~

关键词重复数据删除查重元数据近邻压缩规则 deduplication deduplication metadata condensed nearest neighbor rule

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献23

1ZHU B, LI K, PATTERSON H. Avoiding the disk bottleneck in the data domain deduplication file system[A]. Proceedings of the 6th USENIX Conference on File and Storage Technologies, USENIX As- sociation[C]. 2008,1-14.
2LILLIBRIDGE M, ESHGHI K, BHAGWAT D, et aL Sparse indexing: large scale, inline deduplication using sampling and locality[A]. Proc- eedings of the 7th Conference on File and Storage Technologies, USENIX Association[C]. 2009. 111-123.
3BHAGWAT D, ESHGHI K, LONG D, et al. Extreme binning: scalable, parallel deduplication for chunk-based file backup[A]. In Modeling, Analysis & Simulation of Computer and Telecommunication Systems, IEEE International Symposium[C]. IEEE, 2009,1-9.
4XIA W, JIANG H, FENG D, et al. SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput[A]. Proceedings of the 2011 USENIX Annual Technical Conference (ATC), USENIX Association[C],2011,26-28.
5ARONOVICH L, ASHER R, BACHMAT E, et al. The design of a similar- ity based deduplication system[A]. Proceedings of SYSTOR 2009, The Is- raeli Experimental Systems Conference[C]. ACM, 2009. 1-14.
6ROMAIQSK1 B, HELDT L, KILIAN W, et al. Anchor-driven sub- chunk deduplication[A]. Proceedings of the 4th Annual International Conference on Systems and Storage[C]. 201 l. 16-28.
7ZHANG Z, BHAGWAT D, LITWIN W, et al. Improved deduplication through parallel binning[A]. Performance Computing and Communications Conference (IPCCC), 2012 IEEE 31st International[C]. 2012. 130-141.
8DOUGLIS F, IYENGAR A. Application-specific deltaencoding via resemblance detection[A]. Proceedings of the 2003 USENIX Annual Technical Conference[C]. San Antonio, Texas, 2003. 113-126.
9BRODER A Z, MITZENMACHER M. Network applications of Bloom filters: a survey[J]. Interact Mathematics, 2004, 1(4): 485-509.
10TAN L J, YAO W B, LIU Z Y. et aL CDFS: a cloud-based deduplica- tion filesystem[J]. Advanced Science Letters, American Scientific Publishers, 2012, 9(1): 855-860.

同被引文献16

1敖莉,舒继武,李明强.重复数据删除技术[J].软件学报,2010,21(5):916-929. 被引量：119
2董博,郑庆华,宋凯磊,田锋,马瑞.基于多SimHash指纹的近似文本检测[J].小型微型计算机系统,2011,32(11):2152-2157. 被引量：21
3付印金,肖侬,刘芳.重复数据删除关键技术研究进展[J].计算机研究与发展,2012,49(1):12-20. 被引量：64
4谢平.存储系统重复数据删除技术研究综述[J].计算机科学,2014,41(1):22-30. 被引量：25
5郭颖,陈峰宏,周明辉.大规模代码克隆的检测方法[J].计算机科学与探索,2014,8(4):417-426. 被引量：9
6张沪寅,周景才,陈毅波,查文亮.用户感知的重复数据删除算法[J].软件学报,2015,26(10):2581-2595. 被引量：7
7马明辉,杨庆芳,梁士栋.基于元胞传输模型的高速公路可变限速控制[J].华中科技大学学报（自然科学版）,2015,43(9):46-50. 被引量：9
8曹海,孙婧,史喜斌.基于特征迭代的短文本去重算法[J].计算机工程,2015,41(12):54-57. 被引量：4
9刘青,付印金,倪桂强,梅建民.基于Hadoop平台的分布式重删存储系统[J].计算机应用,2016,36(2):330-335. 被引量：16
10罗恩韬,王国军,李超良.大数据环境中多维数据去重的聚类算法研究[J].小型微型计算机系统,2016,37(3):438-442. 被引量：19

引证文献3

1胡宁玉,赵青杉,张静.基于重复数据删除的快速恢复方案研究[J].忻州师范学院学报,2017,33(5):34-38.
2王青松,葛慧.指纹极值的双层重复数据删除算法[J].辽宁大学学报（自然科学版）,2018,45(3):201-207.
3阮嘉琨,蔡延光,蔡颢,张丽.基于灰狼算法的Simhash冗余数据检测算法[J].东莞理工学院学报,2020,27(5):38-43. 被引量：4

二级引证文献4

1张玉良,王艳兵.基于分层聚合的通信信息冗余数据检测方法[J].上海电机学院学报,2022,25(3):182-186.
2严浩洲,刘旺盛,蔡振亮,敬添俊.改进麻雀搜索算法及其应用研究[J].东莞理工学院学报,2022,29(5):60-68. 被引量：2
3唐磊,陈璇,王庆宇.基于Hough变换的企业财务重复数据批量剔除方法[J].河北北方学院学报（自然科学版）,2023,39(3):22-26.
4姚鹏,段兴锋.基于改进灰狼算法的港作拖轮调度研究[J].东莞理工学院学报,2024,31(1):37-43.

1李明,李珊,夏绪辉,唐秋华,郑巧仙.大规模多工位装配线平衡问题的规则组合算法[J].计算机集成制造系统,2013,19(11):2780-2787. 被引量：6
2胡小春.一种基于压缩规则的关联分类方法[J].信息系统工程,2010,23(9):32-34.
3田中春.用户主识别许可、组识别许可与UNIX系统安全[J].中国金融电脑,1997,0(10):12-13.
4帅典勋,顾静.细胞自动机超并行数据压缩方法[J].华东理工大学学报（自然科学版）,1999,25(2):188-193. 被引量：2
5音乐播放器Winamp使用指南[J].计算机与网络,2001,0(23):16-16.
6程仁田.Happy 99,Not Happy![J].中国新通信,1999,0(6):32-32.
7傅建明,朱福喜,刘莉萍.基于Petri网子范畴的压缩规则[J].小型微型计算机系统,2004,25(3):391-394.
8刘杰.Eclipse下插件的设计方法[J].仲恺农业技术学院学报,2006,19(2):32-35.
9张志超.电动汽车领域基于规则的网管告警压缩机制研究及应用[J].软件,2012,33(12):185-187.
10朱明峰,杜建强,丁成华,何扬名.改进的快速随机游走舌像提取算法[J].计算机辅助设计与图形学学报,2015,27(4):633-639. 被引量：1

通信学报

2015年第8期

浏览历史

内容加载中请稍等...

基于压缩近邻的查重元数据去冗算法设计被引量：3

参考文献23

同被引文献16

引证文献3

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

基于压缩近邻的查重元数据去冗算法设计 被引量：3

参考文献23

同被引文献16

引证文献3

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

基于压缩近邻的查重元数据去冗算法设计被引量：3