压缩数据上的关系代数操作算法被引量：2

Relational algebraic operation algorithm on compressed data

下载PDF

导出

摘要针对在大数据管理中,在压缩的数据上无需解压即可进行相关操作的问题,在数据服从正态分布的前提下,根据列数据存储的特点,提出了一种新的面向列存储的压缩方法——CCA。首先,通过对列数据的长度进行归类;然后,采用抽样的方法获得重复度较高的前缀;最后,使用字典编码进行压缩,提出了列索引(CI)和列实体(CR)作为数据压缩结构来降低大数据存储的空间需求,从而直接有效地在压缩数据上支持选择、投影、连接等基本操作,并实现了基于CCA的数据库原型系统——D-DBMS。理论分析和在1 TB数据上的实验结果表明,该压缩算法能够显著提高大数据的存储效率和数据操作性能,与BAP和TIDC压缩方法相比,在压缩率分别提高了51%、14%;在执行速度上提高了47%、42%。 Since in the massive data management, the compressed data can be done some operations without decompressing first, under the condition of normal distribution, according to features of column data storage, a new compression algorithm which oriented column storage, called CCA（ Column Compression Algorithm）, was proposed. Firstly,the length of data was classified; secondly, the sampling method was used to get more repetitive prefix; finally the dictionary coding was utilized to compress, meanwhile the Column Index（ CI） and Column Reality（ CR） were acted as data compression structure to reduce storage requirement of massive data storage, thus the basic relational algebraic operations such as select,project and join were directly and effectively supported. A prototype database system based on CCA, called D-DBMS（ DingDatabase Management System）, was implemented. The theoretical analyses and the results of experiment on 1 TB data show that the proposed compression algorithm can significantly improve the performance of massive data storage efficiency and data manipulation. Compared to BAP（ Bit Address Physical） and TIDC（ Tuple ID Center） method, the compression rate of CCA was improved by 51% and 14%, and its running speed was improved by 47% and 42%.

作者丁鑫哲张兆功李建中谭龙刘勇

机构地区黑龙江大学计算机科学技术学院哈尔滨工业大学计算机科学技术学院

出处《计算机应用》 CSCD 北大核心 2016年第1期21-26,51,共7页 journal of Computer Applications

基金国家自然科学基金资助项目(81273649) 黑龙江省自然科学基金资助项目(F201434)~~

关键词大数据压缩列索引列实体关系代数操作 massive data compression Column Index（CI） Column Reality（CR） relational algebraic operation

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献12

1LIN Y, AGRAWAL D, CHEN C, et al. Llama: leveraging columnar storage for scalable join processing in the MapReduce framework [C]// Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. New York: ACM, 2011: 961-972.
2WONG H K T, LI J, OLKENG F, et al. Bit transposition for very large scientific and statistical databases [J]. Algorithmica, 1986, 1(1): 289-309.
3LI J, ROTEM D, WONG H K T. A new compression method with fast searching on large database [C]// Proceedings of the 13th International Conference on Very Large Data Bases. San Francisco, CA: Morgan Kaufmann, 1987: 311-318.
4LI J, SRIVASTAVA J. Efficient aggregation algorithms for com-pressed data warehouses [J]. IEEE transactions on knowledge and data engineering, 2002, 14(3): 515-529.
5WU W, GAO H, LI J. New algorithm for computing cube on very large compressed data sets [J]. IEEE transactions on knowledge and engineering, 2006, 18(12): 1667-1680.
6贾均刚,张炜,高宏.TIDC:一种基于属性划分的高频度关系数据压缩存储方法[C]//第二十五届中国数据库学术会议(NDBC2008)论文集.桂林:[出版者不详],2008:14-22.
7王振玺,乐嘉锦,王梅,刘国华.列存储数据区级压缩模式与压缩策略选择方法[J].计算机学报,2010,33(8):1523-1530. 被引量：15
8MVLLER I, RATSCH C, FRBER F. Adaptive string dictionary compression in in-memory column-store database systems [C]// Proceedings of the 17th International Conference on Extending Database Technology. Athens: [s.n.], 2014: 152-158.
9FAUST M, SCHWALB D, PLATTNER H. Composite group-keys space-efficient indexing of multiple columns for compressed in-memory column stores [C]// IMDM 2013: Proceedings of the First and Second International Workshops on In-Memory Data Management and Analysis. Berlin: Springer, 2014: 42-54.
10STONEBRAKER M, ABADI D J, BATKIN A, et al. C-store: a column-oriented DBMS [C]// Proceedings of the 31st International Conference on Very Large Data Bases. [S.l.]: VLDB Endowment, 2005: 553-564.

二级参考文献20

1Stratos Idreos et al.Self-organizing tuple reconstruction in column-stores//Proceedings of the SIGMOD.Providence,Rhode Island,USA,2009:297-308.
2Huffman D.A method for the construction of minimum-redundancy codes.IEEE Transactions on Information Theory,1952,9(40):1098-1101.
3Witten I H,Neal R,Cleary J.Arithmetic coding for data compression.Communications of the ACM,1987,30(6):520-540.
4Roth M A,Van Horn S J.Database compression.ACM SIGMOD Record,1993,22(3):31-39.
5Tanaka H,Leon-Garcia A.Efficient run-length encodings.IEEE Transactions on Information Theory,1982,6(28):880-890.
6Ziv J,Lempl A.A universal algorithm for sequential data compression.Proceedings of the IEEE Transactions on Information Theory,1977,22(1):337-343.
7Abadi D J et al.Query execution in column-oriented database systems[Ph.D.dissertation].Cambridge,Massachusetts:Department of Electrical Engineering and Computer Science,Massachusetts Institute of Technology,2008.
8Trondheim,Norway,Mike Stonebraker,Abadi D J et al.C-store-A column oriented DBMS//Proceedings of the 31st VLDB Conference.Trondheim,Norway,2005:553-564.
9Weyla S,Friesb J,Wiederholdc G,Germano F.A modular self-describing clinical databank system.Computers and Biomedical Research,1975,8(3):279-293.
10Wong H K T et al.Bit transposed files//Proceedings of the 11th International Conference on Very Large Data Bases Stockholm.Sweden,1985:448-457.

共引文献36

1陈虎,李国栋,吴文远,张林亚,奚建清.多核处理器上列数据库复杂查询的原语级并行性开发[J].计算机研究与发展,2011,48(S3):86-94.
2敖锦蓉,赵煜,付峰.基于云计算的BI系统混合架构研究[J].移动通信,2012(3):27-31. 被引量：1
3王星,宋金玉,陈爽,陈萍.基于列数据库的RDF数据管理实现[J].计算机技术与发展,2012,22(6):53-56. 被引量：3
4张海祥,何晓宇,李鹏,吕伟.航天器电测数据库表结构的设计与实现[J].航天器工程,2012,21(3):113-116. 被引量：2
5张红,陈飞.商务智能研究综述[J].中国卫生信息管理杂志,2012,9(3):52-56. 被引量：6
6胡平,张金钟.远程故障诊断终端的数据压缩技术研究与实现[J].计算机工程与应用,2012,48(34):130-135. 被引量：6
7孙林超,陈群,肖玉泽,白松.行列混合存储数据库系统的研究[J].计算机应用研究,2013,30(2):480-482. 被引量：3
8王秋茸.体育信息管理系统设计的关键技术研究[J].电子设计工程,2013,21(18):66-68. 被引量：4
9李伟卫,李梅,张阳,申爱丽.基于分布式数据仓库的分类分析研究[J].计算机应用研究,2013,30(10):2936-2939. 被引量：10
10邵慧萌,舒红平,郑皎凌,许源平,文立玉.基于分片的高维稀疏数据存储模式优化研究[J].计算机工程与应用,2013,49(18):99-104.

同被引文献20

1于利胜,张延松,王珊,张倩.基于行存储模型的模拟列存储策略研究[J].计算机研究与发展,2010,47(5):878-885. 被引量：10
2王振玺,乐嘉锦,王梅,刘国华.列存储数据区级压缩模式与压缩策略选择方法[J].计算机学报,2010,33(8):1523-1530. 被引量：15
3李超,张明博,邢春晓,胡劲松.列存储数据库关键技术综述[J].计算机科学,2010,37(12):1-7. 被引量：24
4郑翠芳.几种常用无损数据压缩算法研究[J].计算机技术与发展,2011,21(9):73-76. 被引量：46
5雒莎,葛海波.基于查找表的自适应Huffman编码算法[J].西安邮电学院学报,2011,16(5):76-79. 被引量：9
6杨永军,徐江,许帅,舒逸.实时数据库有损压缩算法的研究[J].计算机技术与发展,2012,22(9):5-8. 被引量：5
7张荣梅.哈夫曼算法及其应用研究[J].电脑知识与技术,2013,9(5):3062-3065. 被引量：2
8张天宇,贺金鑫,王阳,付友萍.基于NoSQL数据库的地学大数据高效存储方法[J].吉林大学学报（信息科学版）,2013,31(6):604-608. 被引量：13
9刘春阳,宋雷雄,郑雪峰,涂序彦.基于XML树-表结构的多元巡检数据存储方法[J].计算机仿真,2014,31(6):133-136. 被引量：5
10张滨,乐嘉锦.基于列存储的MapReduce并行连接算法[J].计算机工程,2014,40(8):70-75. 被引量：5

引证文献2

1魏玲,郭新朋.行列混合存储的数据压缩策略研究[J].小型微型计算机系统,2017,38(6):1267-1272.
2何诚刚.大规模电子通信信息存储效率管理仿真[J].计算机仿真,2017,34(9):175-178. 被引量：16

二级引证文献16

1张文辉,王红玉.大数据下网络信道通信传输效率控制仿真[J].计算机仿真,2018,35(6):221-224. 被引量：4
2李琴.大规模电子通信信息存储效率管理仿真[J].电子技术与软件工程,2018(12):19-19.
3徐超,姜国标,陈勇.区块链技术支持下电子数据保障方法探究[J].软件导刊,2019,18(5):1-4. 被引量：2
4侯晓凌.集群环境下矢量空间数据长期存储方法仿真[J].计算机仿真,2019,36(5):484-487. 被引量：3
5仲崇丽.多形态模式的信息存储非线性重构仿真[J].计算机仿真,2019,36(8):463-466.
6李振波.云存储环境下公共资源大数据分层存储仿真[J].计算机仿真,2019,36(10):383-386. 被引量：3
7潘珊珊.非集中式元数据存储结构优化设计仿真[J].计算机仿真,2019,36(10):396-399. 被引量：2
8黄玲.移动端网络产品销售推送信息自动分类仿真[J].计算机仿真,2019,36(9):393-396. 被引量：2
9宋皓铭.大规模电子通信信息存储效率管理仿真[J].信息记录材料,2020,21(2):152-153. 被引量：1
10黄斌,谢艳新,唐友,李颜甲.数据存储信息序列化完整性及效率评估仿真[J].计算机仿真,2020,37(4):159-163. 被引量：4

1李高和,石军,候端正.基于关系代数的XML数据查询[J].计算机工程与设计,2004,25(8):1415-1418. 被引量：6
2刘宝良,李建中,高宏.支持第三级存储器的查询优化方法的研究[J].计算机研究与发展,2008,45(8):1379-1385.
3陈凌.用电驴,这些服务器知识你必知[J].电脑爱好者,2007(1):59-59.
4姚会娟,周祥.无线局域网安全协议BAP研究[J].无线互联科技,2016,13(11):35-36.
5芬欧蓝泰携手蓝星技术合作开发BAP标签[J].丝网印刷,2009(4):57-57.
6李芬.芬欧蓝泰携手蓝星技术合作开发BAP标签[J].印刷杂志,2009(5):97-97.
7姚炜,阮志坚.基于Docker技术实现BAP高效部署[J].工业控制计算机,2017,30(4):80-82. 被引量：1
8李爱武,刘宁,严升则.基于MapReduce面向列的数据库存储方案研究[J].微电子学与计算机,2013,30(6):92-96. 被引量：1
9林文煜,戴青云,曹江中,何小明,李能.一种基于内容的海量图像检索框架的设计与实现[J].电脑知识与技术,2016,12(3X):212-215. 被引量：3
10吴海峰,詹文法,程一飞.独立于测试数据的字典编码方法[J].电子测量与仪器学报,2016,30(4):638-644. 被引量：4

计算机应用

2016年第1期

浏览历史

内容加载中请稍等...

压缩数据上的关系代数操作算法被引量：2

参考文献12

二级参考文献20

共引文献36

同被引文献20

引证文献2

二级引证文献16

相关作者

相关机构

相关主题

浏览历史

压缩数据上的关系代数操作算法 被引量：2

参考文献12

二级参考文献20

共引文献36

同被引文献20

引证文献2

二级引证文献16

相关作者

相关机构

相关主题

浏览历史

压缩数据上的关系代数操作算法被引量：2