Improving Metadata Caching Efficiency for Data Deduplication via In-RAM Metadata Utilization

Improving Metadata Caching Efficiency for Data Deduplication via In-RAM Metadata Utilization

导出

摘要 We describe a data deduplication system for backup storage of PC disk images, named in-RAM metadata utilizing deduplication （IRwMUD）. In-RAM hash granularity adaptation and miniLZO based data compression are firstly proposed to reduce the in-RAM metadata size and thereby reduce the space overheads required by the in-RAM metadata caches. Secondly, an in-RAM metadata write cache, as opposed to the traditional metadata read cache, is proposed for further reducing metadata-related disk I/O operations and improving deduplication throughput. During deduplication, the metadata write cache is managed following the LRU caching policy. For each manifest that is hit in the metadata write cache, an expensive manifest reloading operation from the disk is avoided. After deduplieation, all the manifests in the metadata write cache are cleared and stored on the disk. Our experimental results using 1.5 TB real-world disk image dataset show that I） IR-MUD achieved about 95% size reduction for the deduplication metadata, with a small time overhead introduced, 2） when the metadata write cache was not utilized, with the same RAM space size for the metadata read cache, IR-MUD achieved a 400% higher RAM hit ratio and a 50% higher deduplication throughput, as compared with the classic Sparse Indexing deduplication system where no metadata utilization approaches are utilized, and 3） when the metadata write cache was utilized and enough RAM space was available, IR-MUD achieved a 500% higher RAM hit ratio compared with Sparse Indexing and a 70% higher deduplication throughput compared with IR-MUD with only a single metadata read cache. The in-RAM metadata harnessing and metadata write caching approaches of IR-MUD can be applied in most parallel deduplication systems for improving metadata caching efficiency. We describe a data deduplication system for backup storage of PC disk images, named in-RAM metadata utilizing deduplication （IRwMUD）. In-RAM hash granularity adaptation and miniLZO based data compression are firstly proposed to reduce the in-RAM metadata size and thereby reduce the space overheads required by the in-RAM metadata caches. Secondly, an in-RAM metadata write cache, as opposed to the traditional metadata read cache, is proposed for further reducing metadata-related disk I/O operations and improving deduplication throughput. During deduplication, the metadata write cache is managed following the LRU caching policy. For each manifest that is hit in the metadata write cache, an expensive manifest reloading operation from the disk is avoided. After deduplieation, all the manifests in the metadata write cache are cleared and stored on the disk. Our experimental results using 1.5 TB real-world disk image dataset show that I） IR-MUD achieved about 95% size reduction for the deduplication metadata, with a small time overhead introduced, 2） when the metadata write cache was not utilized, with the same RAM space size for the metadata read cache, IR-MUD achieved a 400% higher RAM hit ratio and a 50% higher deduplication throughput, as compared with the classic Sparse Indexing deduplication system where no metadata utilization approaches are utilized, and 3） when the metadata write cache was utilized and enough RAM space was available, IR-MUD achieved a 500% higher RAM hit ratio compared with Sparse Indexing and a 70% higher deduplication throughput compared with IR-MUD with only a single metadata read cache. The in-RAM metadata harnessing and metadata write caching approaches of IR-MUD can be applied in most parallel deduplication systems for improving metadata caching efficiency.

作者 Bing Zhou Jiang-Tao Wen

机构地区 State Key Laboratory on Intelligent Technology and Systems Tsinghua National Laboratory for Information Science and Technology Department of Computer Science and Technology

出处《Journal of Computer Science & Technology》 SCIE EI CSCD 2016年第4期805-819,共15页 计算机科学技术学报（英文版）

基金 This work is supported by the National Science Fund for Distinguished Young Scholars of China under Grant No. 61125102 and the Key Program of National Natural Science Foundation of China under Grant No. 61133008.

关键词 data deduplication CACHE metadata utilization data deduplication, cache, metadata utilization

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论] TP332 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献30

1Black J. Compare-by-hash: A reasoned analysis, in Proc. the USENIX Annual Technical Conference (ATC), May 2006, pp.85-90.
2Meister D, Kaiser J, Brinkmann A, Cortes T, Kuhn M, Kunkel J. A study on data deduplication in HPC storage systems. In Proc. the International Conference for High Performance Computing, Networking, Storage and Anal- ysis, November 2012, Article No. 7.
3Bloom B H. Space/time trade-offs in hash coding with al- lowable errors. Commun. ACM, July 1970, 13(7): 422-426.
4Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise G, Camble P. Sparse Indexing: Large scale, inline dedu- plication using sampling and locality. In Proc. the 7$h USENIX Conference on File and Storage Technologies (FAST), February 2009, pp.111-123.
5Tanenbaum A S. Modern Operating Systems (2nd edition). Prentice Hall PTR, 2001.
6Zhou B, Wen J. Hysteresis re-chunking based metadata harnessing deduplication of disk images. In Proc. the 42nd IEEE International Conference on Parallel Process- ing (ICPP), October 2013, pp.389-398.
7Rabin M O. Fingerprinting by random polynomials. Tech- nical Report, TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.
8Muthitacharoen A, Chen B, Mazi~res D. A low-bandwidth network file system. In Proc. the 18~h A CM Symposium on Operating Systems Principles, October 2001, pp.174-187.
9Romafiski B, Heldt T, Kilian W et al. Anchor-driven sub- chunk deduplication. In Proc. the ~th Annual International Conference on Systems and Storage (SYSTOR), May 2011, pp.16:1-6:13.
10Tolia N, Kozuch M, Satyanarayanan M, Karp B, Bressoud T, Perrig A. Opportunistic use of content addressable stor- age for distributed file systems. In Proc. the USENIX An- nual Technical Conference ( A TC), June 2003, pp.127-140.

1为民.分析:采用重复数据删除技术十项注意[J].网络与信息,2012,26(3):63-63.
2Bing Zhou,Jiang-Tao Wen.A Data Deduplication Framework of Disk Images with Adaptive Block Skipping[J].Journal of Computer Science & Technology,2016,31(4):820-835.
3贾志凯,王树鹏,陈光达,彭成.一种并行层次化的重复数据删除技术[J].计算机研究与发展,2011,48(S1):100-104. 被引量：3
4邢玉轩,肖侬,刘芳,孙振,何晚辉.AR-Dedupe: An Efficient Deduplication Approach for Cluster Deduplication System[J].Journal of Shanghai Jiaotong university(Science),2015,20(1):76-81. 被引量：2
5李孟,曹晟,秦志光.基于Hadoop的小文件存储优化方案[J].电子科技大学学报,2016,45(1):141-145. 被引量：12
6游小容,曹晟.海量教育资源中小文件的存储研究[J].计算机科学,2015,42(10):76-80. 被引量：16
7曹风华.一种基于授权机制的分布式文件系统小文件访问优化策略[J].计算机系统应用,2013,22(7):183-186. 被引量：1
8用Windows Server Backup备份服务器[J].网管员世界,2010(18):97-99.
9黄奕华,朱识,孙书.广东省水利厅计算机网络系统的数据存储备份策略[J].广东水利水电,2002(4):25-26. 被引量：4
10孙浩峰.正确选择重复数据删除[J].网管员世界,2009(17):10-10.

Journal of Computer Science & Technology

2016年第4期

浏览历史

内容加载中请稍等...

Improving Metadata Caching Efficiency for Data Deduplication via In-RAM Metadata Utilization

参考文献30

相关作者

相关机构

相关主题

浏览历史