摘要
在数据爆发式增长的今天,特别是通信、金融、互联网等领域产生的大规模数据,在存储和查询方面给业界带来了前所未有的压力.在这种背景下,当前的数据库和数据仓库系统通过对数据进行压缩编码,在节约空间的同时减少了数据表查询时所需的I/O,获得性能上的提升,但大部分系统在面对实际大规模企业数据应用时依然无法在压缩比、导入时间或查询性能上完全满足企业需求.通过基于一定的规则对数据重新进行编码和精简,实现了一种新型超精简型编码的数据库系统HEGA-STORE.采用行列混合存储的架构;提出基于列内和列间规则挖掘和编码的数据导入存储计划;同时在规则挖掘和编码中使用GPU作为协处理器并行处理算法从而提高效率.通过开发编解码原型系统,对大规模网易易信通信记录数据和网易后台日志数据的导入和查询分别进行了测试,并与其他压缩编码算法和数据库、数据仓库产品进行比较.对比实验结果表明,相比同类数据库和数据仓库产品,原型系统拥有极高的压缩比,并且在导入速度和全表扫描查询速度也处于领先地位,同时使用GPU和CPU协作进行数据处理时也能进一步提高系统性能,验证了提出的超精简型编码数据库系统的实际应用价值.
In the big data era, business applications generate huge volumes of data, making it extremely challenging to store and manage those data. One possible solution adopted in previous database systems is to employ some types of encoding techniques, which can effectively reduce the size of data and consequential improve the query performance. However, existing encoding approaches still cannot make a good tradeof{ between the compression ratio, importing time and query performance. In this paper, to address the problem, we propose a new encoding-based database system, HEGA-STORE, which adopts the hybrid row-oriented and column-oriented storage model. In HEGA-STORE, we design a GPU-assistant encoding scheme by combining the rule-based encoding and conventional compression algorithms. By exploiting the computation power of GPU, we efficiently improve the performance of encoding and decoding algorithms. To evaluate the performance of HEGA-STORE, it is deployed in Netease to support log analysis. We compare HEGA-STORE with other database systems and the results show that HEGA-STORE can provide better performance for data import and query processing. It is a much compact encoding database for big data applications.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2015年第2期362-376,共15页
Journal of Computer Research and Development
基金
国家科技支撑计划基金项目(2013BAG06B01)
国家"八六三"高技术研究发展计划基金项目(SS2013AA040601)
国家自然科学基金项目(61472348)