期刊文献+

压缩数据上的关系代数操作算法 被引量:2

Relational algebraic operation algorithm on compressed data
下载PDF
导出
摘要 针对在大数据管理中,在压缩的数据上无需解压即可进行相关操作的问题,在数据服从正态分布的前提下,根据列数据存储的特点,提出了一种新的面向列存储的压缩方法——CCA。首先,通过对列数据的长度进行归类;然后,采用抽样的方法获得重复度较高的前缀;最后,使用字典编码进行压缩,提出了列索引(CI)和列实体(CR)作为数据压缩结构来降低大数据存储的空间需求,从而直接有效地在压缩数据上支持选择、投影、连接等基本操作,并实现了基于CCA的数据库原型系统——D-DBMS。理论分析和在1 TB数据上的实验结果表明,该压缩算法能够显著提高大数据的存储效率和数据操作性能,与BAP和TIDC压缩方法相比,在压缩率分别提高了51%、14%;在执行速度上提高了47%、42%。 Since in the massive data management, the compressed data can be done some operations without decompressing first, under the condition of normal distribution, according to features of column data storage, a new compression algorithm which oriented column storage, called CCA( Column Compression Algorithm), was proposed. Firstly,the length of data was classified; secondly, the sampling method was used to get more repetitive prefix; finally the dictionary coding was utilized to compress, meanwhile the Column Index( CI) and Column Reality( CR) were acted as data compression structure to reduce storage requirement of massive data storage, thus the basic relational algebraic operations such as select,project and join were directly and effectively supported. A prototype database system based on CCA, called D-DBMS( DingDatabase Management System), was implemented. The theoretical analyses and the results of experiment on 1 TB data show that the proposed compression algorithm can significantly improve the performance of massive data storage efficiency and data manipulation. Compared to BAP( Bit Address Physical) and TIDC( Tuple ID Center) method, the compression rate of CCA was improved by 51% and 14%, and its running speed was improved by 47% and 42%.
出处 《计算机应用》 CSCD 北大核心 2016年第1期21-26,51,共7页 journal of Computer Applications
基金 国家自然科学基金资助项目(81273649) 黑龙江省自然科学基金资助项目(F201434)~~
关键词 大数据压缩 列索引 列实体 关系代数操作 massive data compression Column Index(CI) Column Reality(CR) relational algebraic operation
  • 相关文献

参考文献12

  • 1LIN Y, AGRAWAL D, CHEN C, et al. Llama: leveraging columnar storage for scalable join processing in the MapReduce framework [C]// Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. New York: ACM, 2011: 961-972.
  • 2WONG H K T, LI J, OLKENG F, et al. Bit transposition for very large scientific and statistical databases [J]. Algorithmica, 1986, 1(1): 289-309.
  • 3LI J, ROTEM D, WONG H K T. A new compression method with fast searching on large database [C]// Proceedings of the 13th International Conference on Very Large Data Bases. San Francisco, CA: Morgan Kaufmann, 1987: 311-318.
  • 4LI J, SRIVASTAVA J. Efficient aggregation algorithms for com-pressed data warehouses [J]. IEEE transactions on knowledge and data engineering, 2002, 14(3): 515-529.
  • 5WU W, GAO H, LI J. New algorithm for computing cube on very large compressed data sets [J]. IEEE transactions on knowledge and engineering, 2006, 18(12): 1667-1680.
  • 6贾均刚,张炜,高宏.TIDC:一种基于属性划分的高频度关系数据压缩存储方法[C]//第二十五届中国数据库学术会议(NDBC2008)论文集.桂林:[出版者不详],2008:14-22.
  • 7王振玺,乐嘉锦,王梅,刘国华.列存储数据区级压缩模式与压缩策略选择方法[J].计算机学报,2010,33(8):1523-1530. 被引量:15
  • 8MVLLER I, RATSCH C, FRBER F. Adaptive string dictionary compression in in-memory column-store database systems [C]// Proceedings of the 17th International Conference on Extending Database Technology. Athens: [s.n.], 2014: 152-158.
  • 9FAUST M, SCHWALB D, PLATTNER H. Composite group-keys space-efficient indexing of multiple columns for compressed in-memory column stores [C]// IMDM 2013: Proceedings of the First and Second International Workshops on In-Memory Data Management and Analysis. Berlin: Springer, 2014: 42-54.
  • 10STONEBRAKER M, ABADI D J, BATKIN A, et al. C-store: a column-oriented DBMS [C]// Proceedings of the 31st International Conference on Very Large Data Bases. [S.l.]: VLDB Endowment, 2005: 553-564.

二级参考文献20

  • 1Stratos Idreos et al.Self-organizing tuple reconstruction in column-stores//Proceedings of the SIGMOD.Providence,Rhode Island,USA,2009:297-308.
  • 2Huffman D.A method for the construction of minimum-redundancy codes.IEEE Transactions on Information Theory,1952,9(40):1098-1101.
  • 3Witten I H,Neal R,Cleary J.Arithmetic coding for data compression.Communications of the ACM,1987,30(6):520-540.
  • 4Roth M A,Van Horn S J.Database compression.ACM SIGMOD Record,1993,22(3):31-39.
  • 5Tanaka H,Leon-Garcia A.Efficient run-length encodings.IEEE Transactions on Information Theory,1982,6(28):880-890.
  • 6Ziv J,Lempl A.A universal algorithm for sequential data compression.Proceedings of the IEEE Transactions on Information Theory,1977,22(1):337-343.
  • 7Abadi D J et al.Query execution in column-oriented database systems[Ph.D.dissertation].Cambridge,Massachusetts:Department of Electrical Engineering and Computer Science,Massachusetts Institute of Technology,2008.
  • 8Trondheim,Norway,Mike Stonebraker,Abadi D J et al.C-store-A column oriented DBMS//Proceedings of the 31st VLDB Conference.Trondheim,Norway,2005:553-564.
  • 9Weyla S,Friesb J,Wiederholdc G,Germano F.A modular self-describing clinical databank system.Computers and Biomedical Research,1975,8(3):279-293.
  • 10Wong H K T et al.Bit transposed files//Proceedings of the 11th International Conference on Very Large Data Bases Stockholm.Sweden,1985:448-457.

共引文献36

同被引文献20

引证文献2

二级引证文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部