摘要
针对在大数据管理中,在压缩的数据上无需解压即可进行相关操作的问题,在数据服从正态分布的前提下,根据列数据存储的特点,提出了一种新的面向列存储的压缩方法——CCA。首先,通过对列数据的长度进行归类;然后,采用抽样的方法获得重复度较高的前缀;最后,使用字典编码进行压缩,提出了列索引(CI)和列实体(CR)作为数据压缩结构来降低大数据存储的空间需求,从而直接有效地在压缩数据上支持选择、投影、连接等基本操作,并实现了基于CCA的数据库原型系统——D-DBMS。理论分析和在1 TB数据上的实验结果表明,该压缩算法能够显著提高大数据的存储效率和数据操作性能,与BAP和TIDC压缩方法相比,在压缩率分别提高了51%、14%;在执行速度上提高了47%、42%。
Since in the massive data management, the compressed data can be done some operations without decompressing first, under the condition of normal distribution, according to features of column data storage, a new compression algorithm which oriented column storage, called CCA( Column Compression Algorithm), was proposed. Firstly,the length of data was classified; secondly, the sampling method was used to get more repetitive prefix; finally the dictionary coding was utilized to compress, meanwhile the Column Index( CI) and Column Reality( CR) were acted as data compression structure to reduce storage requirement of massive data storage, thus the basic relational algebraic operations such as select,project and join were directly and effectively supported. A prototype database system based on CCA, called D-DBMS( DingDatabase Management System), was implemented. The theoretical analyses and the results of experiment on 1 TB data show that the proposed compression algorithm can significantly improve the performance of massive data storage efficiency and data manipulation. Compared to BAP( Bit Address Physical) and TIDC( Tuple ID Center) method, the compression rate of CCA was improved by 51% and 14%, and its running speed was improved by 47% and 42%.
出处
《计算机应用》
CSCD
北大核心
2016年第1期21-26,51,共7页
journal of Computer Applications
基金
国家自然科学基金资助项目(81273649)
黑龙江省自然科学基金资助项目(F201434)~~
关键词
大数据压缩
列索引
列实体
关系代数操作
massive data compression
Column Index(CI)
Column Reality(CR)
relational algebraic operation