期刊文献+

删冗系统数据冗余特征挖掘 被引量:1

Mining of Data Redundancy Characteristic in Deduplication Systems
下载PDF
导出
摘要 作为一项能够有效缩减数据存储量的技术,重复数据删除在存储系统中获得广泛应用.然而,目前针对删冗系统数据冗余特征的研究存在不足,大多仅关注如何提高针对特定数据集的删冗率.本文对删冗系统文件层次的数据冗余特征进行更深入的挖掘.首先基于冗余数据块定义了文件和文件集合相关性的概念,将相关性挖掘问题归结为频繁项集挖掘问题.然后给出离线生成事务组数据库的流程,以便应用频繁项集挖掘算法进行相关性计算.最后提出内嵌到删冗系统之中的相关性挖掘增量式算法,从而准实时地进行数据冗余特征分析.通过本文工作可以更好地理解删冗系统中冗余数据的来源和分布,从而针对实际应用环境设计更有效的删冗算法和IO优化算法. Data Deduplication is widely adopted in storage systems as an effective storage saving technique. However, currently most studies on data redundancy characteristic of dedup systems only focus on increasing data dedup ratios for specific datasets. This paper presents a novel study on file-level data redundancy characteristic of dedup systems. Firstly we define the correlation of files and filesets based on the duplicate data blocks they share, and reduce the correlation mining problem to the well-studied frequent itemset mining problem. Secondly we propose the process of transforming the dedup-metadata into transaction group database in order to apply algorithms of frequent itemset mining. Finally we design an incremental algorithm for correlation mining, which can be embedded into the dedup storage system to achieve near-realtime analysis of data redundancy characteristic. Our work can be used to understand the sources and distributions of redundancy data in dedup systems better. Consequently it can help to design more adaptive dedup algorithms and IO optimization algorithms according to the specific application environments.
出处 《小型微型计算机系统》 CSCD 北大核心 2014年第10期2237-2242,共6页 Journal of Chinese Computer Systems
基金 国家"八六三"高技术研究发展计划项目(2012AA012600)资助
关键词 重复数据删除 存储系统 数据冗余特征 频繁项集挖掘 deduplication storage system data redundancy characteristic frequent itemset mining
  • 相关文献

参考文献16

  • 1Gantz J,Reinsel D. The digital universe in 2020; big data, bigger digital shadows,and biggest growth in the far east[ R]. Technical Report, IDC, December, 2012.
  • 2Zhu B,Li K,Patterson H. Avoiding the disk bottleneck in the data domain deduplication file system [ C]. Proceedings of FAST'08: the 6th USENIX Conference on File and Storage Technologies,2008.
  • 3Lillibridge M.Eshghi K,Bhagwat D,et al. Sparse indexing:large scale, inline deduplication using sampling and locality [ C ]. Proceedings of FAST'09; the 7th USENIX Conference on File and Storage Technologies,2009.
  • 4Debnath B, Sengupta S, Li J. ChunkStash: speeding up inline storage deduplication using flash memory [ C ]. Proceedings of USENIX'10:the 2010 USEN1X Annual Technical Conference,2010.
  • 5Bhagwat D,Eshghi K,Long D E,et al. Extreme binning:scalable, parallel deduplication for chunk-based file backup [ C]. Proceedings of MASCOTS'09; the 17th IEEE International Symposium on Modeling , Analysis and Simulation of Computer and Telecommunication Systems,2009.
  • 6Ng C H,Ma M,Wong T Y,et al. Live deduplication storage of virtual machine images in an open-source cloud [ C ]. Proceedings of Middleware'll:the ACM/IFIP/USENIX 12th International Middleware Conference,2011.
  • 7Srinivasan K,Bisson T,Goodson G,et al. iDedup:Latency-aware, inline data deduplication for primary storage [ C ]. Proceedings of FAST'12;the 10th USENIX Conference on File and Storage Tech-nologies,2012.
  • 8Wildani A,Miller E L,Rodeh O. HANDS;a heuristically arranged non-backup in-line deduplication system[ R]. Technical report,UC-SC-SSRC-12-03 .March,2012.
  • 9Meyer D T,Bolosky WI.A study of practical deduplication[ C]. Proceedings of FAST'll ;the 9th USENIX Conference on File and Storage Technologies,2011.
  • 10El-Shimi A, Kalach R, Kumar A, et al. Primary data deduplication;large scale study and system design [ C]. Proceedings of USENIX' 12:the 2012 USENIX Annual Technical Conference,2012.

同被引文献8

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部