摘要
作为一项能够有效缩减数据存储量的技术,重复数据删除在存储系统中获得广泛应用.然而,目前针对删冗系统数据冗余特征的研究存在不足,大多仅关注如何提高针对特定数据集的删冗率.本文对删冗系统文件层次的数据冗余特征进行更深入的挖掘.首先基于冗余数据块定义了文件和文件集合相关性的概念,将相关性挖掘问题归结为频繁项集挖掘问题.然后给出离线生成事务组数据库的流程,以便应用频繁项集挖掘算法进行相关性计算.最后提出内嵌到删冗系统之中的相关性挖掘增量式算法,从而准实时地进行数据冗余特征分析.通过本文工作可以更好地理解删冗系统中冗余数据的来源和分布,从而针对实际应用环境设计更有效的删冗算法和IO优化算法.
Data Deduplication is widely adopted in storage systems as an effective storage saving technique. However, currently most studies on data redundancy characteristic of dedup systems only focus on increasing data dedup ratios for specific datasets. This paper presents a novel study on file-level data redundancy characteristic of dedup systems. Firstly we define the correlation of files and filesets based on the duplicate data blocks they share, and reduce the correlation mining problem to the well-studied frequent itemset mining problem. Secondly we propose the process of transforming the dedup-metadata into transaction group database in order to apply algorithms of frequent itemset mining. Finally we design an incremental algorithm for correlation mining, which can be embedded into the dedup storage system to achieve near-realtime analysis of data redundancy characteristic. Our work can be used to understand the sources and distributions of redundancy data in dedup systems better. Consequently it can help to design more adaptive dedup algorithms and IO optimization algorithms according to the specific application environments.
出处
《小型微型计算机系统》
CSCD
北大核心
2014年第10期2237-2242,共6页
Journal of Chinese Computer Systems
基金
国家"八六三"高技术研究发展计划项目(2012AA012600)资助
关键词
重复数据删除
存储系统
数据冗余特征
频繁项集挖掘
deduplication
storage system
data redundancy characteristic
frequent itemset mining