删冗系统数据冗余特征挖掘被引量：1

Mining of Data Redundancy Characteristic in Deduplication Systems

下载PDF

导出

摘要作为一项能够有效缩减数据存储量的技术,重复数据删除在存储系统中获得广泛应用.然而,目前针对删冗系统数据冗余特征的研究存在不足,大多仅关注如何提高针对特定数据集的删冗率.本文对删冗系统文件层次的数据冗余特征进行更深入的挖掘.首先基于冗余数据块定义了文件和文件集合相关性的概念,将相关性挖掘问题归结为频繁项集挖掘问题.然后给出离线生成事务组数据库的流程,以便应用频繁项集挖掘算法进行相关性计算.最后提出内嵌到删冗系统之中的相关性挖掘增量式算法,从而准实时地进行数据冗余特征分析.通过本文工作可以更好地理解删冗系统中冗余数据的来源和分布,从而针对实际应用环境设计更有效的删冗算法和IO优化算法. Data Deduplication is widely adopted in storage systems as an effective storage saving technique. However, currently most studies on data redundancy characteristic of dedup systems only focus on increasing data dedup ratios for specific datasets. This paper presents a novel study on file-level data redundancy characteristic of dedup systems. Firstly we define the correlation of files and filesets based on the duplicate data blocks they share, and reduce the correlation mining problem to the well-studied frequent itemset mining problem. Secondly we propose the process of transforming the dedup-metadata into transaction group database in order to apply algorithms of frequent itemset mining. Finally we design an incremental algorithm for correlation mining, which can be embedded into the dedup storage system to achieve near-realtime analysis of data redundancy characteristic. Our work can be used to understand the sources and distributions of redundancy data in dedup systems better. Consequently it can help to design more adaptive dedup algorithms and IO optimization algorithms according to the specific application environments.

作者江志雄陆春阳余宏亮

机构地区中国石油昌平数据中心清华大学高性能计算研究所

出处《小型微型计算机系统》 CSCD 北大核心 2014年第10期2237-2242,共6页 Journal of Chinese Computer Systems

基金国家"八六三"高技术研究发展计划项目(2012AA012600)资助

关键词重复数据删除存储系统数据冗余特征频繁项集挖掘 deduplication storage system data redundancy characteristic frequent itemset mining

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献16

1Gantz J,Reinsel D. The digital universe in 2020; big data, bigger digital shadows,and biggest growth in the far east[ R]. Technical Report, IDC, December, 2012.
2Zhu B,Li K,Patterson H. Avoiding the disk bottleneck in the data domain deduplication file system [ C]. Proceedings of FAST'08: the 6th USENIX Conference on File and Storage Technologies,2008.
3Lillibridge M.Eshghi K,Bhagwat D,et al. Sparse indexing:large scale, inline deduplication using sampling and locality [ C ]. Proceedings of FAST'09; the 7th USENIX Conference on File and Storage Technologies,2009.
4Debnath B, Sengupta S, Li J. ChunkStash: speeding up inline storage deduplication using flash memory [ C ]. Proceedings of USENIX'10:the 2010 USEN1X Annual Technical Conference,2010.
5Bhagwat D,Eshghi K,Long D E,et al. Extreme binning:scalable, parallel deduplication for chunk-based file backup [ C]. Proceedings of MASCOTS'09; the 17th IEEE International Symposium on Modeling , Analysis and Simulation of Computer and Telecommunication Systems,2009.
6Ng C H,Ma M,Wong T Y,et al. Live deduplication storage of virtual machine images in an open-source cloud [ C ]. Proceedings of Middleware'll:the ACM/IFIP/USENIX 12th International Middleware Conference,2011.
7Srinivasan K,Bisson T,Goodson G,et al. iDedup:Latency-aware, inline data deduplication for primary storage [ C ]. Proceedings of FAST'12;the 10th USENIX Conference on File and Storage Tech-nologies,2012.
8Wildani A,Miller E L,Rodeh O. HANDS;a heuristically arranged non-backup in-line deduplication system[ R]. Technical report,UC-SC-SSRC-12-03 .March,2012.
9Meyer D T,Bolosky WI.A study of practical deduplication[ C]. Proceedings of FAST'll ;the 9th USENIX Conference on File and Storage Technologies,2011.
10El-Shimi A, Kalach R, Kumar A, et al. Primary data deduplication;large scale study and system design [ C]. Proceedings of USENIX' 12:the 2012 USENIX Annual Technical Conference,2012.

同被引文献8

1黄晓娟,张莉.改进的多类支持向量机递归特征消除在癌症多分类中的应用[J].计算机应用,2015,35(10):2798-2802. 被引量：9
2李清泉,王欢.基于稀疏表示理论的优化算法综述[J].测绘地理信息,2019,44(4):1-9. 被引量：10
3张春,赵小珍,庞承珂,彭门路,王晓东,陈锋,张维,陈松,彭琦,易斌,孙程明,张洁夫,傅廷栋.甘蓝型油菜千粒重全基因组关联分析[J].作物学报,2021,47(4):650-659. 被引量：10
4刘俊红.不同种植模式对油菜资源利用率及产量的影响[J].辽宁农业科学,2021(6):38-41. 被引量：3
5汪学清,刘爽,李秋燕,马凯彬.基于K折交叉验证的SVM隧道围岩分级判别[J].矿冶工程,2021,41(6):126-128. 被引量：22
6崔洁.基于加权word2vec算法的文本相似度研究[J].电子测试,2021,32(21):53-55. 被引量：9
7马小博,刘鸿斌.废水处理过程的降维方法综述[J].造纸科学与技术,2022,41(1):1-11. 被引量：2
8王辽卫,胡文忠,高笑天.我国油菜籽临时收储政策改革评价及探索[J].中国经贸导刊,2016(3Z):38-39. 被引量：4

引证文献1

1何旭燕,刘昌华,管文杰.基于特征选择实现油菜基因预测千粒重值[J].武汉轻工大学学报,2022,41(5):34-39.

1丽影.微软主题天天换[J].电脑爱好者,2014(13):38-38.
2周蓓.一种改进的Apriori算法及应用[J].常熟理工学院学报,2010,24(8):95-99.
3周铁军,谭义红.基于统计方法的正负时态相关性挖掘[J].湘潭大学自然科学学报,2005,27(3):28-31.
4李雪婵.关联规则在课程相关性中研究与应用[J].计算机与数字工程,2006,34(9):173-176. 被引量：9
5任永功,钱海振,郎泓钰.基于改进布尔约减级数分层的大数据流滞后相关性挖掘方法[J].模式识别与人工智能,2016,29(5):455-463.
6袁淼,王鑫.基于抽样的Deep Web模式匹配框架[J].计算机工程与应用,2015,51(3):117-123.
7潘炯光,韦余永.一种基于领域语义相关性挖掘的迁移学习方法[J].西南师范大学学报（自然科学版）,2016,41(5):184-189. 被引量：1
8张剑飞,李大辉.网页相关性挖掘原型系统的设计[J].齐齐哈尔大学学报（自然科学版）,2007,23(5):31-34.
9吴飞,韩亚洪,庄越挺,邵健.图像-文本相关性挖掘的Web图像聚类方法[J].软件学报,2010,21(7):1561-1575. 被引量：10
10何文才,郑钊,刘培鹤,杜敏.一种基于目录数据分离存储的文件存储方法的研究与实现[J].网络安全技术与应用,2014(1):95-97. 被引量：1

小型微型计算机系统

2014年第10期

浏览历史

内容加载中请稍等...

删冗系统数据冗余特征挖掘被引量：1

参考文献16

同被引文献8

引证文献1

相关作者

相关机构

相关主题

浏览历史

删冗系统数据冗余特征挖掘 被引量：1

参考文献16

同被引文献8

引证文献1

相关作者

相关机构

相关主题

浏览历史

删冗系统数据冗余特征挖掘被引量：1