摘要
由于EB(extreme binning)使用文件的最小块签名作为文件的特征,它不适合处理主要包括小文件的数据负载,会导致较差的重复数据删除率。为了改进EB,提出了相似索引。它把相似哈希作为文件的特征,是一种适用于以小文件为主的数据负载的重复数据删除的二级索引。实验结果表明,相似索引的重复数据删除率比EB高24.8%;相似索引的内存使用量仅仅是EB的0.265%。与EB相比,相似索引需要更少的存储使用量和内存使用量。
However, since EB (extreme binning) utilized the minimum chunk ID of a file as the representative chunk signature, EB was not suitable for backup data stream mainly containing small files. To improve EB, this paper proposed simi index using simi hash as the feature of a file. It was a novel two-level index suitable for workload mainly consisting of small files. Experiment results show that, the deduplication efficiency of simi index is 24.8% better than EB, and the RAM usage of simiIndex only 0.265% of that of EB. Compared with EB,simi index needs less storage and less RAM.
出处
《计算机应用研究》
CSCD
北大核心
2013年第12期3614-3617,共4页
Application Research of Computers
基金
陕西省自然科学基金资助项目(2010JM8023)
航空科学基金资助项目(2010ZD53042)