期刊文献+
共找到2篇文章
< 1 >
每页显示 20 50 100
CrowdOLA: Online Aggregation on Duplicate Data Powered by Crowdsourcing 被引量:3
1
作者 An-Zhen Zhang Jian-Zhong Li +3 位作者 Hong Gao Yu-Biao Chen Heng-Zhao Ma Mohamed Jaward Bah 《Journal of Computer Science & Technology》 SCIE EI CSCD 2018年第2期366-379,共14页
Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation (OLA), which provides a quick sketch of massive data before a long wait of the final accurate q... Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation (OLA), which provides a quick sketch of massive data before a long wait of the final accurate query result, has drawn significant research attention. However, the direct processing of OLA on duplicate data will lead to incorrect query answers, since sampling from duplicate records leads to an over representation of the duplicate data in the sample. This violates the prerequisite of uniform distributions in most statistical theories. In this paper, we propose CrowdOLA, a novel framework for integrating online aggregation processing with deduplication. Instead of cleaning the whole dataset, Crow~ dOLA retrieves block-level samples continuously from the dataset, and employs a crowd-based entity resolution approach to detect duplicates in the sample in a pay-as-you-go fashion. After cleaning the sample, an unbiased estimator is provided to address the error bias that is introduced by the duplication. We evaluate CrowdOLA on both real-world and synthetic workloads. Experimental results show that CrowdOLA provides a good balance between efficiency and accuracy. 展开更多
关键词 online aggregation entity resolution crowdsourcing cloud computing
原文传递
Partition-Based Online Aggregation with Shared Sampling in the Cloud 被引量:2
2
作者 王宇翔 罗军舟 +1 位作者 宋爱波 东方 《Journal of Computer Science & Technology》 SCIE EI CSCD 2013年第6期989-1011,共23页
Online aggregation is an attractive sampling-based technology to response aggregation queries by an estimate to the final result, with the confidence interval becoming tighter over time. It has been built into a MapRe... Online aggregation is an attractive sampling-based technology to response aggregation queries by an estimate to the final result, with the confidence interval becoming tighter over time. It has been built into a MapReduce-based cloud system for big data analytics, which allows users to monitor the query progress, and save money by killing the computation early once sufficient accuracy has been obtained. However, there are several limitations that restrict the performance of online aggregation generated from the gap between the current mechanism of MapHeduce paradigm and the requirements of online aggregation, such as: 1) the low sampling efficiency due to the lack of consideration of skewed data distribution for online aggregation in MapReduce, and 2) the large redundant I/O cost of online aggregation caused by the independent job execution mechanism of MapReduce. In this paper, we present OLACloud, a MapReduce-based cloud system to well support online aggregation for different data distributions and large-scale concurrent query processing. We propose a content-aware repartition method with a fair-allocation block placement strategy to increase the sampling efficiency and guarantee the storage and computation load balancing simultaneously. We also develop a shared sampling method to share the sampling opportunities among multiple queries to reduce redundant I/O cost. We also implement OLACloud in Hadoop, and conduct an extensive experimental study on the TPC-H benchmark for skewed data distribution. Our results demonstrate the efficiency and effectiveness of OLACloud. 展开更多
关键词 CLOUD MAPREDUCE PARTITION online aggregation shared sampling
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部