摘要
在分布式系统中,数据的存储结构直接影响了大数据的存储效率和处理性能。在行式存储结构下,数据从本地读取,加载速度快,但压缩效率低且存在数据冗余;在列式存储结构下,数据压缩效率高,但数据的跨节点访问增加了网络传输消耗。针对行式存储结构和列式存储结构的缺点,提出一种以行列结合的存储方式,对数据存储结构进行改进。实验结果表明,改进的数据存储结构在加载速度上略低于行式存储;在数据压缩上,比行式存储和列式存储的效率都高。行列结合的存储结构不仅避免行式存储的额外磁盘I/O开销,同时也减少了列式存储不必要的网络传输,极大地提高分布式系统对大数据存储效率及处理性能。
In a distributed system, the data storage structure directly affects the storage efficiency and processing performance of big data. In the row store structure, the data is loaded locally and the speed is fast, but it also loads additional columns, and it's hard to compress. The column store structure has high compression efficiency, but it has additional network transferring overhead. To overcome their storages and improve the data storage structure, this paper presents a new data storage structure combining row and column. The experiment result shows that it' s inferior a little in data loading to the row store structure, and it has high compression efficiency comparing with the row store structure and column store structure. It not only avoids additional disk I/O, but also cuts down the unnecessary network transfer time in column store. So, the row - column store can greatly improve big data storage and processing performance in distributed system.
出处
《河北工程大学学报(自然科学版)》
CAS
2014年第4期69-73,共5页
Journal of Hebei University of Engineering:Natural Science Edition
关键词
大数据
分布式
行列存储
big data
distributed system
row- column store