期刊文献+

面向数据差量压缩的高效压缩率估计方法

Efficient Approach of Compression Ratio Estimation for Data Delta Compression
下载PDF
导出
摘要 差量压缩不仅会消除数据中相同的数据块,还会消除数据中相似数据块之间的重复部分,因此可以实现比数据去重更高的数据压缩率。目前它已经被应用于许多商业产品中。然而,进一步挖掘数据的可压缩性会额外引入大量的开销,包括从存储设备中读取相似的数据块以获知它们的重复部分,这使得差量压缩的速度通常只有数据去重的1/7。但是如此大的开销不能保证总是可以得到更好的压缩率,因为并不是所有的数据都有足够的可压缩性可供挖掘。因此,当考虑在存储系统中使用差量压缩时,需要迅速了解当前的数据是否值得进行差量压缩。提出差量压缩估计框架EDCR,它通过数据块的相似特征值来快速判断它们之间的可压缩性,从而对数据进行差量压缩的价值做出快速而准确的判断。另外,该框架引入采样和补偿方案,进一步提升了压缩率估计的效率和准确性。最终,在多个真实数据集上的测试表明,EDCR的估计错误率可以控制在1.5%以下。同时,相对于实际的差量压缩框架,EDCR估计框架在固态硬盘(SSD)上的运行速度快18~24倍,在机械磁盘(HDD)上的运行速度快16~146倍。 Delta compression not only eliminates identical data chunks but also removes duplicate fragmentations among similar chunks,achieving higher data compression ratios than deduplication.This technique has been integrated into many commercial products.However,further exploitation of data compressibility introduces significant overhead,including reading similar chunks from storage devices to identify their duplicates.Consequently,delta compression typically operates at only one-seventh the speed of deduplication.However,such substantial overhead does not guarantee better compression ratios because not all data possess sufficient compressibility for exploitation.Therefore,when evaluating the implementation of delta compression in storage systems,it is essential to quickly ascertain its applicability for current data.This study proposes a delta compression estimation framework,EDCR,which promptly assesses the compressibility of data chunks based on their similarity features to evaluate the applicability of delta compression accurately.Additionally,the framework incorporates sampling and correction schemes to enhance the efficiency and accuracy of compression ratio estimation.Evaluations conducted on multiple real-world datasets demonstrate that EDCR achieves an estimation error rate of less than 1.5%.Moreover,compared to existing delta compression frameworks,the EDCR estimation framework operates 18-24 times faster on Solid State Disk(SSD)and 16-146 times faster on Hard Disk Drive(HDD).
作者 邹翔宇 魏灿 夏文 李诗逸 ZOU Xiangyu;WEI Can;XIA Wen;LI Shiyi(School of Computer Science and Technology,Harbin Institute of Technology(Shenzhen),Shenzhen 518071,Guangdong,China;Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies,Shenzhen 518071,Guangdong,China)
出处 《计算机工程》 CAS CSCD 北大核心 2024年第12期70-82,共13页 Computer Engineering
基金 国家自然科学基金面上项目(61972441) 深圳市基础研究优青项目(RCYX20210609104510007) 广东省普通高校青年创新人才项目(2022KQNCXl59)。
关键词 差量压缩 压缩率估计 相似性特征 采样 估计修正 delta compression compression ratio estimation similarity feature sampling estimation correction
  • 相关文献

参考文献3

二级参考文献42

  • 1Huang Hai,Huang Wanda,Shin G K.FS2:Dynamic Data Replication in Free Disk Space for Improving Disk Performance and Energy Consumption[C] ∥Brighton,UK.Proceedings of the 20th ACM Symposium on Operating Systems Principles(SOSP).New York,NY,USA:ACM,2005:263-276.
  • 2Gurumurthi S,Sivasubramaniam A,Kandemir M,et al.DRPM:Dynamic Speed Control for Power Management in Server Class Disks[C] ∥San Diego,CA,USA.Proceedings of the International Symposium on Computer Architecture(ISCA).New York,NY,USA:ACM,2003:169-181.
  • 3Sankar S,Gurumurthi S,Stan R M.Intra-Disk Parallelism:An Idea Whose Time Has Come[C] ∥Beijing,China.Proceedings of 35th the International Symposium on Computer Architecture(ISCA).New York,NY,USA:ACM,2003:303-314.
  • 4Zhu Qingbo,Chen Zhifeng,Tan Lin,et al.Hibernator:Helping Disk Arrays Sleep through the Winter[C] ∥Brighton,UK.Proceedings of the 20th ACM Symposium on Operating Systems Principles(SOSP).New York,NY,USA:ACM,2005:177-190.
  • 5Papathanasiou E A,Scott L M.Energy Efficient Prefetching and Caching[C] ∥Boston,MA,USA.Proceedings of the USENIX 2004 Annual Technical Conference(USENIX).Berkeley,CA,USA:USENIX,2004:255-268.
  • 6Colarellt D,Grunwald D.Massive Arrays of Idle Disks for Storage Archives[C] ∥Baltimore,MD,USA.Proceedings of the 2002 ACM/IEEE Conference on Supercomputing(ICS).Los Alamitos,CA,USA:IEEE,2002:1-11.
  • 7Zhu Qingbo,Zhou Yuanyuan.Power-Aware Storage Cache Ma-nagement[J].IEEE Transaction on Computers:2005,54(5):587-602.
  • 8Pinheiro E,Bianchini R.Energy Conservation Techniques forDisk Array-Based Servers[C] ∥Malo,France.Proceedings of the 18th International Conference on Supercomputing(ICS).New York,NY,USA:ACM,2004:68-78.
  • 9Pinheiro E,Bianchini R,Dubnichi C.Exploiting Redundancy to Conserve Energy in Storage Systems[C] ∥Saint Malo,France.Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems(SIGMETRICS).New York,NY,USA:ACM,2006:15-26.
  • 10Weddle C,Oldham M,Qian Jin,et al.PARAID:A Gear-Shifting Power-Aware RAID[C] ∥San Jose,CA,USA.Proceedings of the 5th USENIX Conference on File and Storage Technologies(FAST).Berkeley,CA,USA:USENIX,2007:245-260.

共引文献19

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部