期刊文献+
共找到4篇文章
< 1 >
每页显示 20 50 100
人在回路的数据准备技术研究进展 被引量:7
1
作者 范举 陈跃国 杜小勇 《大数据》 2019年第6期1-18,共18页
随着数据分析技术的迅猛发展,数据准备越来越成为一个瓶颈性问题。以真实的数据分析场景为背景,分析了数据准备的两大核心挑战:人力成本高与时间周期长。在此基础上,介绍了人在回路数据准备技术的研究进展。交互式数据准备技术面向终端... 随着数据分析技术的迅猛发展,数据准备越来越成为一个瓶颈性问题。以真实的数据分析场景为背景,分析了数据准备的两大核心挑战:人力成本高与时间周期长。在此基础上,介绍了人在回路数据准备技术的研究进展。交互式数据准备技术面向终端用户,通过与用户的交互预测其意图,并通过有效的预测算法来节省数据准备的时间。基于众包的数据准备技术引入互联网上的海量用户作为众包工人扩展计算能力,从而支持数据准备的基本任务,并研究如何对众包做质量控制与成本优化。最后,对人在回路的数据准备做出总结并探讨未来的挑战性问题。 展开更多
关键词 数据治理 数据准备 众包 交互机制
下载PDF
Efficient query processing framework for big data warehouse: an almost join-free approach 被引量:3
2
作者 Huiju WANG Xiongpai QIN +4 位作者 Xuan ZHOU Furong LI Zuoyan QIN Qing ZHU Shan WANG 《Frontiers of Computer Science》 SCIE EI CSCD 2015年第2期224-236,共13页
The rapidly increasing scale of data warehouses is challenging today's data analytical technologies. A con- ventional data analytical platform processes data warehouse queries using a star schema -- it normalizes the... The rapidly increasing scale of data warehouses is challenging today's data analytical technologies. A con- ventional data analytical platform processes data warehouse queries using a star schema -- it normalizes the data into a fact table and a number of dimension tables, and during query processing it selectively joins the tables according to users' demands. This model is space economical. However, it faces two problems when applied to big data. First, join is an expensive operation, which prohibits a parallel database or a MapReduce-based system from achieving efficiency and scalability simultaneously. Second, join operations have to be executed repeatedly, while numerous join results can actually be reused by different queries. In this paper, we propose a new query processing frame- work for data warehouses. It pushes the join operations par- tially to the pre-processing phase and partially to the post- processing phase, so that data warehouse queries can be transformed into massive parallelized filter-aggregation oper- ations on the fact table. In contrast to the conventional query processing models, our approach is efficient, scalable and sta- ble despite of the large number of tables involved in the join. It is especially suitable for a large-scale parallel data ware- house. Our empirical evaluation on Hadoop shows that our framework exhibits linear scalability and outperforms some existing approaches by an order of magnitude. 展开更多
关键词 data warehouse large scale TAMP join-free multi-version schema
原文传递
MiNT-OLAP cluster: minimizing network transmission cost in OLAP cluster for main memory analytical database 被引量:1
3
作者 Min JIAO Yansong ZHANG +1 位作者 Zhanwei WANG Shan WANG 《Frontiers of Computer Science》 SCIE EI CSCD 2012年第6期668-676,共9页
Powerful storage, high performance and scalability are the most important issues for analytical databases. These three factors interact with each other, for example, powerful storage needs less scalability but higher ... Powerful storage, high performance and scalability are the most important issues for analytical databases. These three factors interact with each other, for example, powerful storage needs less scalability but higher performance, high performance means less consumption of indexes and other materializations for storage and fewer processing nodes, larger scale relieves stress on powerful storage and the high performance processing engine. Some analytical databases (ParAccel, Teradata) bind their performance with advanced hardware supports, some (Asterdata, Greenplum) rely on the high scalability framework of MapReduce, some (MonetDB, Sybase IQ, Vertica) highlight performance on processing engine and storage engine. All these approaches can be integrated into an storage-performance-scalability (S- P-S) model, and future large scale analytical processing can be built on moderate clusters to minimize expensive hardware dependency. The most important thing is a simple software framework is fundamental to maintain pace with the development of hardware technologies. In this paper, we propose a schemaaware on-line analytical processing (OLAP) model with deep optimization from native features of the star or snowflake schema. The OLAP model divides the whole process into several stages, each stage pipes its output to the next stage, we minimize the size of output data in each stage, whether in central processing or clustered processing. We extend this mechanism to cluster processing using two major techniques, one is using NetMemory as a broadcasting protocol based dimension mirror synchronizing buffer, the other is predicatevector based DDTA-OLAP cluster model which can minimize the data dependency of star-join using bitmap vectors. Our OLAP model aims to minimize network transmission cost (MINT in short) for OLAP clusters and support a scalable but simple distributed storage model for large scale clustering processing. Finally, the experimental results show the speedup and scalability performance. 展开更多
关键词 OLAP cluster MINT NetMemory schemaaware OLAP
原文传递
HC-Store: putting MapReduce's foot in two camps
4
作者 Huiju WANG Furong LI +4 位作者 Xuan ZHOU Yu CAO Xiongpai QIN Jidong CHEN Shan WANG 《Frontiers of Computer Science》 SCIE EI CSCD 2014年第6期859-871,共13页
MapReduce is a popular framework for large- scale data analysis. As data access is critical for MapReduce's performance, some recent work has applied different storage models, such as column-store or PAX-store, to Ma... MapReduce is a popular framework for large- scale data analysis. As data access is critical for MapReduce's performance, some recent work has applied different storage models, such as column-store or PAX-store, to MapReduce platforms. However, the data access patterns of different queries are very different. No storage model is able to achieve the optimal performance alone. In this paper, we study how MapReduce can benefit from the presence of two different column-store models - pure column-store and PAX-store. We propose a hybrid storage system called hybrid columnstore (HC-store). Based on the characteristics of the incoming MapReduce tasks, our storage model can determine whether to access the underlying pure column-store or PAX-store. We studied the properties of the different storage models and create a cost model to decide the data access strategy at runtime. We have implemented HC-store on top of Hadoop. Our experimental results show that HC-store is able to outperform PAX-store and column-store, especially when confronted with diverse workload. 展开更多
关键词 MAPREDUCE Hadoop HC-store cost model column-store PAX-store
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部