动态分片在线聚集

Dynamic Data-Partitioned Online Aggregation

下载PDF

导出

摘要传统的在线聚集方法为了避免执行中随机I/O导致的性能下降,假设数据本身近似随机分布于数据文件中,用顺序I/O来代替随机I/O.数据随机分布于数据文件的假设在很多实际的应用场景中是难以成立的,从而导致查询结果产生很大误差.提出了动态数据分片在线聚集算法DDPOA(dynamicdata-partitioned online aggregation),将整个数据集分片,对各个子数据集独立计算,线性组合子集结果进而得到全集最终结果,一方面降低了在线聚集对整体数据集随机分布的要求,提高了准确性,另一方面动态调整分片数量以改善I/O性能,缩短完成时间.真实系统负载上的实验表明:DDPOA与传统在线聚集相比,在完成时间相差不大的情况下准确性有了大幅提高. To avoid the performance degradation due to random IO,traditional online aggregation algorithms assume that the source data are already randomized in the data file,so sequential access approximately equals to random sampling over the data.But this assumption doesn＇t hold in many real scenes which leads to obvious error when running the algorithms.The authors propose a new method： dynamic data-partitioned online aggregation（DDPOA）.DDPOA logically splits the data into non-conjunctive partitions,each of which consists of consecutive data items in the data file,computes estimates based on individual partition,and then uses specific linear combination of these values to estimate the final result.DDPOA weakens the randomization requirement over the whole dataset and makes the estimates more accurate.Accessing partitioned data could cause lower performance due to random disk IO.To handle IO performance issue,DDPOA dynamically adjusts the partitions during execution.Adjacent partitions that are similar enough will be judged and merged into one which improves the IO performance without losing the accuracy.Experiment on real dataset from network security monitor system DBroker shows that DDPOA is much better than traditional algorithms in terms of accuracy with little performance overhead.When it comes to the dataset satisfying the randomization assumption,DDPOA is as good as the traditional algorithms.

作者安明远孙秀明孙凝晖

机构地区中国科学院计算技术研究所计算机系统结构重点实验室中国科学院研究生院中国科学院电子学研究所

出处《计算机研究与发展》 EI CSCD 北大核心 2010年第11期1928-1935,共8页 Journal of Computer Research and Development

基金国家"八六三"高技术研究发展计划基金项目(2006AA01A102)~~

关键词数据库近似查询在线聚集采样动态分片 database approximate query online aggregation sampling dynamic partition

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献12

1Garofalakis M N, Gibbons P B, Approximate query processing: Taming the TeraBytes! A Tutorial [C] //Proc of the 27th Int Conf on Very Large Data Bases. San Francisco, CA: Morgan Kaufmann, 2001.
2Hellerstein J M, Haas P J, Wang H J. Online aggregation [C] //Proc of the 16th ACM SIGMOD Conf on Management of Data. New York: ACM, 1997: 171-182.
3刘莹,王启荣,孙凝晖.基于SN结构的事件流并行数据库加载均衡策略研究[J].计算机研究与发展,2009,46(1):159-166. 被引量：1
4Haas P J, Hellerstein J M. Ripple joins for online aggregation [C] //Proc of the 18th ACM SIGMOD Conf on Management of Data. New York: ACM, 1999:287-298.
5Dittrich J P, Seeger B, Taylor D S, et al. Progressive merge join: A generic and non-blocking sort based join algorithm [C]//Proc of the 28th lnt Conf on Very Large Data Bases. San Francisco, CA: Morgan Kaufmann, 2002:299- 310.
6Luo G, Ellmann C J, Haas P J, et al. A scalable hash ripple join algorithm [C] //Proc of the 21st ACM SIGMOD Conf on Management of Data. New York: ACM, 2002:252-262.
7Dittrich J P, Sccger B, Taylor D S, et al. On producing join results early[C]//Procofthe22th ACM/PODS. New York: ACM, 2003:134-142.
8Jermaine C, Dobra A, Arumugam S, et al A disk based join with probabilistic guarantees[C] //Proc of the 24th ACM SIGMOD Conf on Management of Data. New York: ACM, 2005, 563-574.
9Jermaine C, Arumugam S, Pol A, et al. Scalable approximate query processing with the DBO engine [C] // Proc of the 26th ACM SIGMOD Conf on Management of Data. New York: ACM, 2007:725-736.
10Raman V, Raman B, Hellerstein J M. Online dynamic reordering for interactive data processing [C] //Proc of the 25th Int Conf on Very Large Data Bases. San Francisco, CA:Morgan Kaufmann, 1999: 709-720.

二级参考文献13

1Carney D, Cetintemel U, Cherniack M, et al. Monitoring streams: A new class of data management applications [C] // Proc of the 28th Int Conf on Very Large Data Bases. San Francisco, CA: Morgan Kaufmann, 2002:215-226
2Cranor C, Johnson T, Spataschek O, et al. Gigascope: A stream database for network applications [C] //Proc of the 22nd ACM SIGMOD Conf on Management of Data. New York: ACM, 2003:647-651
3Sullivan M, Heybey A. Tribeca: A system for managing large databases of network traffic [C] //Proc of the USENIX Annual Technical Conference. Berkeley, CA: USENIX Association, 1998:2-12
4Babcock B, Babu S, Datar M, et al. Models and issues in data stream systems[C] //Proc of the 21st ACM SIGMOD/ PODS. New York: ACM, 2002:1-16
5DeWitt D J, Gray J. Parallel database systems: The future of database processing or a passing Fad [J]. ACM SIGMOD Record, 1990, 19(4): 104-112
6Boral H, Alexander W, Clay L, et al. Prototyping Bubba, a highly parallel database system [J]. IEEE Trans on Knowledge and Data Engineering, 1990, 2( 1 ) : 4-24
7Walton C B, Dale A G, Jenevein R M. A taxonomy and performance model of data skew effects in parallel joins [C]// Proc of the 7th Int Conf on Very I.arge Data Bases. San Francisco, CA: Morgan Kaufmann, 1991: 537-548
8Copeland G P, Alexander W, Boughter E E, et al. Data placement in Bubba [J]. ACM SIGMOD Record, 1988, 17 (3): 99-108
9Rahm E, Marek R. Analysis of dynamic load balancing strategies for parallel shared nothing database systems [C] // Proc of the 19th Int Conf on Very Large Data Bases. San Francisco, CA: MorganKaufmann, 1993:182-193
10Wang J, Tsutaya Y, Segawa N, et al. Approaches to balancing data load of shared-nothing clusters and their performance comparison [C] //Proc of the 9th Int Conf on Parallel and Distributed Systems. Los Alamitos, CA: IEEE Computer Society, 2002:293-301

1史英杰,孟小峰.云数据管理系统中查询技术研究综述[J].计算机学报,2013,36(2):209-225. 被引量：46
2曼曼常碌.我驭迅雷狂下载[J].网友世界,2007(3):36-36.
3林笠,陈荣,黄巍.遗传算法在汇编语言程序分片中的应用[J].计算机应用研究,2004,21(1):131-133.
4汪凤鸣,慈祥,孟小峰.云环境下的Max/Min在线聚集技术研究[J].小型微型计算机系统,2015,36(10):2177-2182.
5章勤,鄢娟,金海,韩宗芬.昊宇网络计算平台体系结构研究[J].计算机研究与发展,2003,40(12):1725-1730. 被引量：15
6刘欣阳,王国仁,乔百友,韩东红.决策树的并行训练策略[J].计算机科学,2004,31(8):129-130. 被引量：1
7刘青宝,金燕,侯东风,张维明.数据流层次窗口模型及聚集查询算法[J].计算机科学,2007,34(5):194-196. 被引量：3
8汪洋,吴跃.移动挖掘——从移动设备上监视股票市场[J].福建电脑,2004,20(6):46-47.
9陈静,葛超,朱开宇.基于正交最小二乘法的神经网络中心选取算法的研究[J].陶瓷研究与职业教育,2008,6(1):19-21. 被引量：2
10王昕,滕昱.浅谈“如何不用锁机制实现并发”[J].程序员,2005(2):110-115.

计算机研究与发展

2010年第11期

浏览历史

内容加载中请稍等...

动态分片在线聚集

参考文献12

二级参考文献13

相关作者

相关机构

相关主题

浏览历史