摘要
由于传统的数据处理系统的数据存储与数据处理能力有限,不能满足处理大量数据的需求。为了发挥数据的价值,高效、高性能地处理大量数据集,提出基于Spark系统结合SIMBA的思路共同建立的大数据分析处理系统,基于Spark SQL的查询方式进行检索;在Spark中嵌入索引管理机制,将其封装在RDD内,用于提高查询效率;通过建立线段树存储数据的方式提高数据检索的效率。对于数据预处理时采用Range Partitioner分区策略的方式对数据进行分区,基于全局过滤和局部索引进行查询。保证该系统在进行查询操作时能够保持高吞吐量和低延迟特性,提高查询效率。
As the traditional data processing system,the ability to save and process data is limited,can't meet the needs of dealing with large amounts of data. In order to maximize the value of data sets with high efficiency and high performance,a large data analysis and processing system based on Spark system and SIMBA is proposed,which is based on Spark SQL query method. The index management mechanism is embedded in Spark system,encapsulated in the RDD,which improve the efficiency of query. Through the establishment of line tree to store data,we improve the efficiency of data retrieval. For pre-processing data,Range Partitioner partitioning strategy is used to partition data and query based on global filtering and local index.
出处
《计算机应用与软件》
北大核心
2018年第2期96-101,共6页
Computer Applications and Software
基金
安徽省高校自然科学研究重点项目(KJ2015A130)