摘要
数据的指数级增长给数据管理和分析带来了严峻的挑战.连接聚集查询是数据分析中一种常用运算,而MapReduce是一种用于大规模数据集并行处理的编程模型,研究基于MapReduce的连接聚集查询算法有着学术意义和应用价值.首先在归纳和扩展现有连接算法的基础上总结出4种基于MapReduce的连接聚集查询算法;接着根据应用场景的不同又提出另外两种实现算法;同时提出I/O代价是决定基于MapReduce的连接聚集查询算法性能的主要因素;最后通过大量实验分析这6种算法在不同查询应用下的优劣,总结了它们各自的适用场景,并分析了各个算法的性能与数据特征之间的关系.
The exponential growth of data has brought serious challenges to the data management and analysis.Aggregate-join query is a common data analysis operation,and MapReduce is a programming model for implementing parallel processing on large-scale datasets.Therefore the research on MapReduce-based aggregate-join query algorithms has some academic significance and application value.Through the induction and expansion of the existing join algorithms,four kinds of MapReducebased aggregate-join algorithms are proposed.And on the basis of different application scenarios, another two implementation algorithms are proposed.The opinion that the cost of reads/writes are key factors in determining the performance of the algorithms is also put forward.Experimental results show the pros and cons of six algorithms under different query applications,application scenarios of them are concluded,and relations between performance and data characteristics are analyzed.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2013年第S1期306-311,共6页
Journal of Computer Research and Development
基金
国家自然科学基金项目(61202088)
辽宁省自然科学基金项目(200102059)
中央高校基本科研业务费专项资金项目(N120817001)