摘要
数据的损坏和丢失会带来无法弥补的损失,数据备份系统可以将损失降到最低程度。随着收集的数据量的迅速增加,备份系统需要备份与恢复的数据也迅速增加,然而备份文件之间的相似度超过60%,全部存储在硬盘上十分浪费存储空间,故提出了一种基于K-medoids聚类的DELTA压缩方法,用来去除备份数据中的重复数据。该方法首先对文件进行切割分块,通过对文件块进行两两DELTA压缩,得出各自压缩文件的大小,作为两个文件块之间的相似度。通过得到的相似度进行K-medoids聚类,作为DELTA压缩前的预处理步骤。然后根据K-medoids的聚类结果,合并小文件块之后再进行DELTA压缩。测试结果表明,该方法提高了压缩率,并减少了DELTA压缩中查找指纹的次数,降低了压缩时间。
Data damage and loss will lead the irreparable losses which can be minimized by data backup system. With the increasing amountof data collection,data backup system has to deal with more and more data of backup and recovery,but the similarity between the variousbackup files is more than 60% so that all the data stored in the hard disk will be a waste of storage space. For this,we propose a DELTAcompression method based on K-medoids clustering to remove duplicate data from the backup data. It firstly segments and blocks the files,and then obtains the size of each compression file by means of DELTA compression between the two blocks as the similarity of them. K-medoids clustering is performed by the similarity obtained as preprocessing steps before DELTA compression. According to the K-medoidsclustering,we merge the small similar file blocks before DELTA compression. The tests show that the proposed method can improve thecompression rate,reduce the number of fingerprints in DELTA compression and shorten the compression time.
出处
《计算机技术与发展》
2018年第2期125-129,共5页
Computer Technology and Development
基金
国家电网公司总部科技项目(0711-150TL173)