Partition-Based Online Aggregation with Shared Sampling in the Cloud 被引量：2

Partition-Based Online Aggregation with Shared Sampling in the Cloud

导出

摘要 Online aggregation is an attractive sampling-based technology to response aggregation queries by an estimate to the final result, with the confidence interval becoming tighter over time. It has been built into a MapReduce-based cloud system for big data analytics, which allows users to monitor the query progress, and save money by killing the computation early once sufficient accuracy has been obtained. However, there are several limitations that restrict the performance of online aggregation generated from the gap between the current mechanism of MapHeduce paradigm and the requirements of online aggregation, such as： 1） the low sampling efficiency due to the lack of consideration of skewed data distribution for online aggregation in MapReduce, and 2） the large redundant I/O cost of online aggregation caused by the independent job execution mechanism of MapReduce. In this paper, we present OLACloud, a MapReduce-based cloud system to well support online aggregation for different data distributions and large-scale concurrent query processing. We propose a content-aware repartition method with a fair-allocation block placement strategy to increase the sampling efficiency and guarantee the storage and computation load balancing simultaneously. We also develop a shared sampling method to share the sampling opportunities among multiple queries to reduce redundant I/O cost. We also implement OLACloud in Hadoop, and conduct an extensive experimental study on the TPC-H benchmark for skewed data distribution. Our results demonstrate the efficiency and effectiveness of OLACloud. Online aggregation is an attractive sampling-based technology to response aggregation queries by an estimate to the final result, with the confidence interval becoming tighter over time. It has been built into a MapReduce-based cloud system for big data analytics, which allows users to monitor the query progress, and save money by killing the computation early once sufficient accuracy has been obtained. However, there are several limitations that restrict the performance of online aggregation generated from the gap between the current mechanism of MapHeduce paradigm and the requirements of online aggregation, such as： 1） the low sampling efficiency due to the lack of consideration of skewed data distribution for online aggregation in MapReduce, and 2） the large redundant I/O cost of online aggregation caused by the independent job execution mechanism of MapReduce. In this paper, we present OLACloud, a MapReduce-based cloud system to well support online aggregation for different data distributions and large-scale concurrent query processing. We propose a content-aware repartition method with a fair-allocation block placement strategy to increase the sampling efficiency and guarantee the storage and computation load balancing simultaneously. We also develop a shared sampling method to share the sampling opportunities among multiple queries to reduce redundant I/O cost. We also implement OLACloud in Hadoop, and conduct an extensive experimental study on the TPC-H benchmark for skewed data distribution. Our results demonstrate the efficiency and effectiveness of OLACloud.

作者王宇翔罗军舟宋爱波东方

机构地区 School of Computer Science and Engineering CCF ACM IEEE School of Computer Science and EngineeringSoutheast University

出处《Journal of Computer Science & Technology》 SCIE EI CSCD 2013年第6期989-1011,共23页 计算机科学技术学报（英文版）

基金 supported by the National Basic Research 973 Program of China under Grant No.2010CB328104 the National Natural Science Foundation of China under Grant Nos.61070161,61202449,61320106007 the National High Technology Research and Development 863 Program of China under Grant No.2013AA013503 the Specialized Research Fund for the Doctoral Program of Higher Education of China under Grant No.20110092130002 the Jiangsu Provincial Key Laboratory of Network and Information Security under Grant No.BM2003201 the Key Laboratory of Computer Network and Information Integration of Ministry of Education of China under Grant No.93K-9 the Shanghai Key Laboratory of Scalable Computing and Systems of China under Grant No.2010DS680095

关键词 CLOUD MAPREDUCE PARTITION online aggregation shared sampling cloud, MapReduce, partition, online aggregation, shared sampling

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献24

1Herodotou H, Lim H, Luo Get al. Starfish: A self-tuning system for big data analytics. In Proc. the 15th CIDR, Apr. 2011, pp.261-272.
2Wu S, Ooi B C, Tan K L. Continuous sampling for online aggregation over multiple queries. In Proc. the 2010 Interna- tional Conference on Management of Data ( SIGMOD), June 2010, pp.651-662.
3Chaudhuri S, Das G, Datar Met al. Overcoming limitations of sampling for aggregation queries. In Proc. the 17th Int. Conf. Data Engineering, Apr. 2001, pp.534-544.
4Laptev N, Zeng K, Zaniolo C. Early accurate results for ad- vanced analytics on MapReduce. PVLDB, 2012, 5(10): 1028- 1039.
5Hellerstein J M, Haas P J, Wang H J. Online aggregation. ACM SIGMOD Record., 1997, 26(2): 171-182.
6Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008, 51(1): 107-113.
7Borkar V, Carey M, Grover R et al. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proc. the 27th International Conference on Data Engineering, Apr. 2011, pp.1151-1162.
8Pansare N, Borkar V R, Jermaine C et al. Online aggregation for large MapReduce jobs. PVLDB, 2011, 4(11): 1135-1145.
9Bose J H, Andrzejak A, Hogqvist M. Beyond online aggrega- tion: Parallel and incremental data mining with online map- reduce. In Proc. MDAC, Apr. 2010, Article No.3.
10Condie T, Conway N, Alvaro Pet al. Online aggregation and continuous query support in MapReduce. In Proc. the 2010 International Conference on Management of Data, June 2010, pp.1115-1118.

同被引文献4

1高彦杰,陈冠诚.SparkSQL：基于内存的大数据处理引擎[J].程序员,2014(8):104-107. 被引量：4
2陈军成,丁治明,高需.大数据热点技术综述[J].北京工业大学学报,2017,43(3):358-367. 被引量：14
3冯诗淳,曹斌,晁德文,林博,尹建伟.结合HBase的散列概要森林索引方案[J].小型微型计算机系统,2018,39(1):100-104. 被引量：5
4盛家,房俊,郭晓乾,王承栋.时序数据多维聚合查询服务的实现[J].重庆大学学报（自然科学版）,2020,43(7):121-128. 被引量：4

引证文献2

1盛家,房俊,郭晓乾,王承栋.时序数据多维聚合查询服务的实现[J].重庆大学学报（自然科学版）,2020,43(7):121-128. 被引量：4
2赵博,左昌麒,房俊.Flexisample:个性化近似聚合查询系统[J].计算机与数字工程,2021,49(12):2431-2436.

二级引证文献4

1赵博,左昌麒,房俊.Flexisample:个性化近似聚合查询系统[J].计算机与数字工程,2021,49(12):2431-2436.
2房俊,赵博,左昌麒.基于两阶段分层抽样的近似聚合查询方法[J].数据采集与处理,2022,37(5):1049-1058.
3赵东明,邱圆辉,康瑞,宋韶旭,黄向东,王建民.面向聚合查询的Apache IoTDB物理元数据管理[J].软件学报,2023,34(3):1027-1048. 被引量：8
4罗睿,何清,陈丰,王毅,田晨,李小波,韩秀清.基于电力生产画面的时序数据查询统计组件开发及应用[J].热力发电,2023,52(11):165-172. 被引量：1

1王意洁,王勇军,胡守仁.面向对象数据库的并行查询处理[J].计算机科学,2000,27(2):43-47. 被引量：2
2闫威,马宗民.基于多谓词选择的海量XML数据并行查询方法[J].小型微型计算机系统,2015,36(7):1415-1420. 被引量：3
3李芳.小插件让她的iPhone炫起来[J].计算机应用文摘,2012(5):65-65.
4我来自南京.给HTPC弄张XBOX的脸[J].计算机应用文摘,2008,24(22):60-62.
5欧阳雯.“万家购物”还能持续多久?[J].中国防伪报道,2012(6):23-25.
6赵越,王意洁,王媛,李小勇.一种高效的不确定数据流并行Skyline查询处理方法[J].计算机研究与发展,2013,50(S2):132-139. 被引量：3
7林宏伟.基于SNMP协议的校园网用户监视系统模型[J].贵州师范大学学报（自然科学版）,2006,24(2):95-98.
8Intranet上异构数据库的设计与实现[J].管理观察,1998(11):58-59.
9张震,张学忠,李龙.基于KPCA和LDA融合改进的人脸识别算法研究[J].郑州大学学报（工学版）,2015,36(5):116-120. 被引量：3
10林宏伟.基于SNMP协议的内联网用户监视系统模型[J].贵阳学院学报（自然科学版）,2007,2(3):22-25.

Journal of Computer Science & Technology

2013年第6期

浏览历史

内容加载中请稍等...

Partition-Based Online Aggregation with Shared Sampling in the Cloud 被引量：2

参考文献24

同被引文献4

引证文献2

二级引证文献4

相关作者

相关机构

相关主题

浏览历史