相似索引:适用于重复数据删除的二级索引被引量：1

Similar index:two-level index used for deduplication

下载PDF

导出

摘要由于EB(extreme binning)使用文件的最小块签名作为文件的特征,它不适合处理主要包括小文件的数据负载,会导致较差的重复数据删除率。为了改进EB,提出了相似索引。它把相似哈希作为文件的特征,是一种适用于以小文件为主的数据负载的重复数据删除的二级索引。实验结果表明,相似索引的重复数据删除率比EB高24.8%;相似索引的内存使用量仅仅是EB的0.265%。与EB相比,相似索引需要更少的存储使用量和内存使用量。 However, since EB （extreme binning） utilized the minimum chunk ID of a file as the representative chunk signature, EB was not suitable for backup data stream mainly containing small files. To improve EB, this paper proposed simi index using simi hash as the feature of a file. It was a novel two-level index suitable for workload mainly consisting of small files. Experiment results show that, the deduplication efficiency of simi index is 24.8% better than EB, and the RAM usage of simiIndex only 0.265% of that of EB. Compared with EB,simi index needs less storage and less RAM.

作者张志珂蒋泽军蔡小斌彭成章

机构地区西北工业大学计算机学院

出处《计算机应用研究》 CSCD 北大核心 2013年第12期3614-3617,共4页 Application Research of Computers

基金陕西省自然科学基金资助项目(2010JM8023) 航空科学基金资助项目(2010ZD53042)

关键词重复数据删除相似哈希相似索引块查找磁盘瓶颈问题二级索引 deduplication simi hash similar index chunk-lookup disk bottleneck problem two-level index

分类号 TP301.6 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献14

1ESHGHI K,LILLIBRIDGE M,WILCOCK L,et al.Jumbo store:pro-viding efficient incremental upload and versioning for a utility rendering service[C]//Proc of the 5 th USENIX Conference on File andStorage Technologies.Berkeley:USENIX,2007:123-138.
2ZHU B,LI Kai,PATTERSON H.Avoiding the disk bottleneck in thedata domain deduplication file system[C]// Proc of the 6th USENIXConference on File and Storage Technologies.Berkeley:USENIX,2008:269-282.
3LILLIBRIDGE M1ESHGHI K,BHAGWAT D,et al.Sparse indexing:large scale,inline deduplication using sampling and locality[C]//Proc of the 7th Conference on File and Storage Technologies.Berke-ley:USENIX,2009:111-123.
4BHAGWAT D,ESHGHI K,L0NG D,et al.Extreme binning:scala-ble,parallel deduplication for chunk-based file backup[C]// Proc ofIEEE International Symposium on Modeling,Analysis & Simulation ofComputer and Telecommunication Systems.Washington DC:IEEEComputer Society,2009:1-9.
5ARONOVICH L,ASHER RtBACHMAT E,et al.The design of asimilarity based deduplication system[C]// Proc of SYSTOR:TheIsraeli Experimental Systems Conference.New York:ACM Press,2009:6.
6ROMANSKI B5HELDT LtKILIAN W,et al.Anchor-driven subchunkdeduplication[C]// Proc of SYSTOR 2011:The Israeli ExperimentalSystems Conference.New York:ACM Press,2011:16.
7ZHANG Zhi-ke,BHAGWAT D,LITWIN W,et al.Improved dedupli-cation through parallel binning[C]// Proc of the 31st IEEE Interna-tional Performance Computing and Communications Conference.Washington DC:IEEE Compurter Society,2012:130-141.
8ZHANG Zhi-ke,JIANG Ze-jun,LIU Zhi-qiang,ef al.LHs:a novelmethod of information retrieval avoiding an index using linear hashingwith key groups in deduplication[C]// Proc of International Confer-ence on Machine Learning and Cybernetics.Washington DC:IEEECompurter Society,2012:1312-1318.
9DUBNICKI C,GRYZ L5HELDT L,et al.Hydrastor:a scalable sec-ondary storage[C]// Proc of the 7 th Conference on File and StorageTechnologies.Berkeley:USENIX,2009:97-210.
10UNGUREANU C,ATKIN B5ARANYA A,et al.Hydrafs:a high-throughput file system for the hydrastor content-addressable storagesystem[C]// Proc of the 8th USENIX Conference on File and StorageTechnologies.Berkeley:USENIX,2010:225-238.

同被引文献1

1付印金,肖侬,刘芳.重复数据删除关键技术研究进展[J].计算机研究与发展,2012,49(1):12-20. 被引量：65

引证文献1

1张宗华,屈英,叶志佳,牛新征.基于多特征匹配和Bloom filter的重复数据删除算法[J].深圳大学学报（理工版）,2016,33(5):531-535. 被引量：3

二级引证文献3

1郭玉剑,曾志浩.一种用于重复数据删除的非对称最大值分块算法研究[J].微型机与应用,2017,36(22):30-33. 被引量：1
2舒远仲,梁涛,王娟.一种针对天猫购物平台的网页URL去重策略研究[J].网络安全技术与应用,2018(6):48-50.
3曹晖,张秦正.基于FSL数据集的去重性能分析[J].电子科技大学学报,2018,47(4):621-625. 被引量：4

1余一清,汪宏斌,周洞汝.基于内容的视频检索研究[J].计算机系统应用,2003,12(7):30-32. 被引量：1
2杨建武,陈晓鸥.半结构化数据相似搜索的索引技术研究[J].计算机学报,2002,25(11):1219-1226. 被引量：11
3白雪生,徐光祐,史元春.相似索引等距包络参数计算的改进算法[J].清华大学学报（自然科学版）,1999,39(9):95-98. 被引量：2
4方刚,赵嵩群.数据库系统模拟负载的探讨和实现[J].微计算机信息,2007,23(05X):151-153.
5刘琨,肖琳,赵海燕.Hadoop中云数据负载均衡算法的研究及优化[J].微电子学与计算机,2012,29(9):18-22. 被引量：19
6刘琨,钮文良.一种改进的Hadoop数据负载均衡算法[J].河南理工大学学报（自然科学版）,2013,32(3):332-336. 被引量：10
7林伟伟,刘波.基于动态带宽分配的Hadoop数据负载均衡方法[J].华南理工大学学报（自然科学版）,2012,40(9):42-47. 被引量：10
8未来数据中心和云计算的十二大趋势[J].计算机与网络,2011,37(24):35-35.
9刘轶康.一种提升LTE网络高数据负载场景用户网页浏览体验的方法[J].中国科技纵横,2016,0(8):15-15.
10李昊,张辉,郭晓莲,胡广书.Image Restoration After Pixel Binning in Image Sensors[J].Tsinghua Science and Technology,2009,14(4):541-545. 被引量：1

计算机应用研究

2013年第12期

浏览历史

内容加载中请稍等...

相似索引:适用于重复数据删除的二级索引被引量：1

参考文献14

同被引文献1

引证文献1

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

相似索引:适用于重复数据删除的二级索引 被引量：1

参考文献14

同被引文献1

引证文献1

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

相似索引:适用于重复数据删除的二级索引被引量：1