一种基于预分块和滑动窗口的重复数据消除方法

Deduplication method based on content defined pre-chunking and sliding window

导出

摘要针对现有重复数据消除方法中提高压缩比和降低元数据开销之间的矛盾,提出了一种基于预分块和滑动窗口的重复数据消除方法并建立了性能分析通用模型.该方法首先对数据对象进行基于内容的预分块,再对数据变动区域和非变动区域采用不同的分块策略,从而在分块大小预期值较大时,仍能获得较高的压缩比并降低了元数据开销.真实数据集上的实验结果表明,该方法的平均压缩比高于现有最优值,而平均时间开销显著降低. To address the contradiction between improving compression ratio and reducing metadata cost, a deduplication method based on pre-chunking and sliding window is proposed. A universal performance-analyzing model is also given. In this method, the data objects are pre-chunked based on content, then different chunking strategies are used on the data changing regions and the non-changing regions respectively. A satisfying compression ratio and lower metadata cost can be achieved with a relatively larger expected chunk size. The experimental results on real data show that the average compression ratio of the method is higher than the current optimal value, and the average time cost is reduced significantly.

作者王灿秦志光王娟

机构地区电子科技大学计算机科学与工程学院成都信息工程学院网络工程学院

出处《控制与决策》 EI CSCD 北大核心 2012年第8期1157-1162,1168,共7页 Control and Decision

基金国家自然科学基金项目(60873075 60973118) 教育部培育基金项目(708078)

关键词重复数据消除数据压缩滑动窗口内容分块 deduplication data compression sliding window content defined chunking

分类号 TP309.3 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献15

1敖莉,舒继武,李明强.重复数据删除技术[J].软件学报,2010,21(5):916-929. 被引量：119
2Kruus E, Ungureanu C, Dubnicki C. Bimodal content defined chunking for backup streams[C]. Proc of 8th USENIX Conference on File and Storage Technologies. USENIX Association, 2010:18-31.
3Yang T M, Jiang H, Feng D, et al. DEBAR: A scalable high-performance De-duplication storage system for backup and archiving[C]. Proc of 24th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2010. IEEE Computer Society, 2010:1-12.
4朱恒民,王宁生.一种改进的相似重复记录检测方法[J].控制与决策,2006,21(7):805-808. 被引量：12
5Eshghi K, Tang H K. A framework for analyzing and improving content-based chunking algorithms[R]. Hewlett-Packard Labs, 2005.
6Quinlan S, Dorward S. Venti: a new approach to archival storage[C]. Proc of USENIX Conference on File and Storage Technologies, FAST 2002. USENIX Association, 2002:89-102.
7Bobbarjung D R, Jagannathan S, Dubnicki C. Improving duplicate elimination in storage systems[J]. ACM Transactions on Storage, 2006, 2(4): 424-448.
8Ports D, Clements A T, Demaine E D. PersiFS: a versioned file system with an efficient representation[C]. Proc of 20th ACM Symposium on Operating Systems Principles. Association for Computing Machinery, 2005:1-2.
9Jain N, Dahlin M, Tewari R. Taper: Tiered approach for eliminating redundancy in replica synchronization[C]. Proc of 4th USENIX Conference on File and Storage Technologies. USENIX Association, 2005:281-294.
10Muthitacharoen A, Chen B, Mazieres D. A low-bandwidth network file system[C]. Proc of 18th ACM Symposium on Operating Systems Principles (SOSP'01). Association for Computing Machinery, 2001:174-187.

二级参考文献48

1Bhagwat D,Pollack K,Long DDE,Schwarz T,Miller EL,P-ris JF.Providing high reliability in a minimum redundancy archival storage system.In:Proc.of the 14th Int'l Symp.on Modeling,Analysis,and Simulation of Computer and Telecommunication Systems (MASCOTS 2006).Washington:IEEE Computer Society Press,2006.413-421.
2Zhu B,Li K.Avoiding the disk bottleneck in the data domain deduplication file system.In:Proc.of the 6th Usenix Conf.on File and Storage Technologies (FAST 2008).Berkeley:USENIX Association,2008.269-282.
3Bhagwat D,Eshghi K,Mehra P.Content-Based document routing and index partitioning for scalable similarity-based searches in a large corpus.In:Berkhin P,Caruana R,Wu XD,Gaffney S,eds.Proc.of the 13th ACM SIGKDD Int'l Conf.on Knowledge Discovery and Data Mining (KDD 2007).New York:ACM Press,2007.105-112.
4You LL,Pollack KT,Long DDE.Deep store:An archival storage system architecture.In:Proc.of the 21st Int'l Conf.on Data Engineering (ICDE 2005).Washington:IEEE Computer Society Press,2005.804-815.
5Quinlan S,Dorward S.Venti:A new approach to archival storage.In:Proc.of the 1st Usenix Conf.on File and Storage Technologies (FAST 2002).Berkeley:USENIX Association,2002.89-102.
6Sapuntzakis CP,Chandra R,Pfaff B,Chow J,Lam MS,Rosenblum M.Optimizing the migration of virtual computers.In:Proc.of the 5th Symp.on Operating Systems Design and Implementation (OSDI 2002).New York:ACM Press,2002.377-390.
7Rabin MO.Fingerprinting by random polynomials.Technical Report,CRCT TR-15-81,Harvard University,1981.
8Rivest R.The MD5 message-digest algorithm.1992.http://www.python.org/doc/current/lib/module-md5.html.
9U.S.National Institute of Standards and Technology (NIST).Federal Information Processing Standards (FIPS) Publication 180-1:Secure Hash Standard.1995.http://www.itl.nist.gov/fipspubs/fip180-1.htm.
10U.S.National Institute of Standards and Technology (NIST).Federal Information Processing Standards (FIPS) Publication 180-2:Secure Hash Standard.2002.http://csrc.nist.gov/publications/fips/fips180-2/fips180-2.pdf.

共引文献128

1张砚波,刘正伟,文中领,王永海.一种高效存储解决方案的分析与研究[J].计算机研究与发展,2012,49(S1):180-184. 被引量：9
2马井玮,王克宾,赵彬,马良,王刚,刘晓光.基于重复数据删除的连续数据保护系统的快速回滚[J].计算机研究与发展,2012,49(S1):196-200.
3陆游游,敖莉,舒继武.一种基于重复数据删除的备份系统[J].计算机研究与发展,2012,49(S1):206-210. 被引量：5
4彭成,王树鹏,贾志凯.基于纠删码的数据消冗存储系统可靠性增强研究[J].计算机研究与发展,2011,48(S1):1-6. 被引量：3
5刘厚贵,邢晶,霍志刚,安学军.一种支持海量数据备份的可扩展分布式重复数据删除系统[J].计算机研究与发展,2013,50(S2):64-70. 被引量：5
6寇月,申德荣,李冬,聂铁铮.一种基于语义及统计分析的DeepWeb实体识别机制[J].软件学报,2008,19(2):194-208. 被引量：18
7张立芳.海量数据库中实时包的判重算法[J].计算机工程,2008,34(21):76-77. 被引量：2
8曹小峰.基于相似重复记录检测的特征优选方法研究[J].计算机工程与设计,2009,30(23):5492-5495. 被引量：3
9庞雄文,姚占林,李拥军.大数据量的高效重复记录检测方法[J].华中科技大学学报（自然科学版）,2010,38(2):8-11. 被引量：15
10安相璧,杜艾永,李树珉.基于Apriori算法的车辆检测相似重复记录消除方法[J].天津大学学报,2010,43(7):606-610. 被引量：3

1周杨.AJAX应用的典型设计模式[J].计算机系统应用,2011,20(1):128-132. 被引量：20
2Avamar6.0重复数据消除备份软件[J].软件和信息服务,2011(5):30-30.
3简单消除重复数据[J].网管员世界,2011(10):10-10.
4王灿,秦志光,冯朝胜,彭静.面向重复数据消除的备份数据加密方法[J].计算机应用,2010,30(7):1763-1766. 被引量：4
5郑鸿.一种冗余流量消除算法[J].电子世界,2014(2):166-167.
6王龙翔,董小社,张兴军,王寅峰,公维峰,魏晓林.内容分块算法中预期分块长度对重复数据删除率的影响[J].西安交通大学学报,2016,50(12):73-78. 被引量：6
7姚斌,郑辰桐,刘伟.温度采集系统探讨[J].电脑开发与应用,2015,28(3):45-47.
8熊熙玲.Windows XP系统发布 .NET迈出关键的一步[J].互联网周刊,2001(5):18-18.
9陈晓峰,钟静.计算机网络课程教法研究及应用实践[J].电脑知识与技术,2008,0(11X):1435-1437. 被引量：3
10董苹苹,王乐之,孙军,王建新.广域网传输中数据与协议优化研究综述[J].计算机研究与发展,2014,51(5):944-958. 被引量：3

控制与决策

2012年第8期

浏览历史

内容加载中请稍等...

一种基于预分块和滑动窗口的重复数据消除方法

参考文献15

二级参考文献48

共引文献128

相关作者

相关机构

相关主题

浏览历史