摘要
差量压缩不仅会消除数据中相同的数据块,还会消除数据中相似数据块之间的重复部分,因此可以实现比数据去重更高的数据压缩率。目前它已经被应用于许多商业产品中。然而,进一步挖掘数据的可压缩性会额外引入大量的开销,包括从存储设备中读取相似的数据块以获知它们的重复部分,这使得差量压缩的速度通常只有数据去重的1/7。但是如此大的开销不能保证总是可以得到更好的压缩率,因为并不是所有的数据都有足够的可压缩性可供挖掘。因此,当考虑在存储系统中使用差量压缩时,需要迅速了解当前的数据是否值得进行差量压缩。提出差量压缩估计框架EDCR,它通过数据块的相似特征值来快速判断它们之间的可压缩性,从而对数据进行差量压缩的价值做出快速而准确的判断。另外,该框架引入采样和补偿方案,进一步提升了压缩率估计的效率和准确性。最终,在多个真实数据集上的测试表明,EDCR的估计错误率可以控制在1.5%以下。同时,相对于实际的差量压缩框架,EDCR估计框架在固态硬盘(SSD)上的运行速度快18~24倍,在机械磁盘(HDD)上的运行速度快16~146倍。
Delta compression not only eliminates identical data chunks but also removes duplicate fragmentations among similar chunks,achieving higher data compression ratios than deduplication.This technique has been integrated into many commercial products.However,further exploitation of data compressibility introduces significant overhead,including reading similar chunks from storage devices to identify their duplicates.Consequently,delta compression typically operates at only one-seventh the speed of deduplication.However,such substantial overhead does not guarantee better compression ratios because not all data possess sufficient compressibility for exploitation.Therefore,when evaluating the implementation of delta compression in storage systems,it is essential to quickly ascertain its applicability for current data.This study proposes a delta compression estimation framework,EDCR,which promptly assesses the compressibility of data chunks based on their similarity features to evaluate the applicability of delta compression accurately.Additionally,the framework incorporates sampling and correction schemes to enhance the efficiency and accuracy of compression ratio estimation.Evaluations conducted on multiple real-world datasets demonstrate that EDCR achieves an estimation error rate of less than 1.5%.Moreover,compared to existing delta compression frameworks,the EDCR estimation framework operates 18-24 times faster on Solid State Disk(SSD)and 16-146 times faster on Hard Disk Drive(HDD).
作者
邹翔宇
魏灿
夏文
李诗逸
ZOU Xiangyu;WEI Can;XIA Wen;LI Shiyi(School of Computer Science and Technology,Harbin Institute of Technology(Shenzhen),Shenzhen 518071,Guangdong,China;Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies,Shenzhen 518071,Guangdong,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2024年第12期70-82,共13页
Computer Engineering
基金
国家自然科学基金面上项目(61972441)
深圳市基础研究优青项目(RCYX20210609104510007)
广东省普通高校青年创新人才项目(2022KQNCXl59)。
关键词
差量压缩
压缩率估计
相似性特征
采样
估计修正
delta compression
compression ratio estimation
similarity feature
sampling
estimation correction