期刊文献+
共找到2篇文章
< 1 >
每页显示 20 50 100
Hadoop平台的海量数据并行随机抽样 被引量:11
1
作者 宛婉 周国祥 《计算机工程与应用》 CSCD 2014年第20期115-118,共4页
在"信息爆炸"的当今社会,海量数据对数据挖掘提出新的挑战。在数据挖掘转向云计算平台实现并行化的同时,研究并行化数据随机抽样进一步降低处理的数据规模。提出一种单次扫描即可实现清理脏数据并实现等概率抽样的mapreduce... 在"信息爆炸"的当今社会,海量数据对数据挖掘提出新的挑战。在数据挖掘转向云计算平台实现并行化的同时,研究并行化数据随机抽样进一步降低处理的数据规模。提出一种单次扫描即可实现清理脏数据并实现等概率抽样的mapreduce并行抽样算法。在hadoop平台上实现并与普通随机抽样方法进行比较,得出其时间效率非常高,是一种行之有效的方法。为以后数据挖掘中的抽样研究和推动数据挖掘在海量数据下的发展奠定良好基础。 展开更多
关键词 云计算 HADOOP mapreduce 并行计算 数据挖掘 随机抽样
下载PDF
HEDC++:An Extended Histogram Estimator for Data in the Cloud
2
作者 史英杰 孟小峰 +1 位作者 Fusheng Wang 干艳桃 《Journal of Computer Science & Technology》 SCIE EI CSCD 2013年第6期973-988,共16页
With increasing popularity of cloud-based data management, improving the performance of queries in the cloud is an urgent issue to solve. Summary of data distribution and statistical information has been commonly used... With increasing popularity of cloud-based data management, improving the performance of queries in the cloud is an urgent issue to solve. Summary of data distribution and statistical information has been commonly used in traditional databases to support query optimization, and histograms are of particular interest. Naturally, histograms could be used to support query optimization and efficient utilization of computing resources in the cloud. Histograms could provide helpful reference information for generating optimal query plans, and generate basic statistics useful for guaranteeing the load balance of query processing in the cloud. Since it is too expensive to construct an exact histogram on massive data, building an approximate histogram is a more feasible solution. This problem, however, is challenging to solve in the cloud environment because of the special data organization and processing mode in the cloud. In this paper, we present HEDC++, an extended histogram estimator for data in the cloud, which provides efficient approximation approaches for both equi-width and equi-depth histograms. We design the histogram estimate workflow based on an extended MapReduce framework, and propose novel sampling mechanisms to leverage the sampling efficiency and estimate accuracy. We experimentally validate our techniques on Hadoop and the results demonstrate that HEDC++ can provide promising histogram estimate for massive data in the cloud. 展开更多
关键词 histogram estimate sampling cloud computing mapreduce
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部