摘要
在面对海量数据分类问题时,时间和空间复杂性已成为传统算法的瓶颈。在对传统的BP-AdaBoost算法进行分析的基础上,结合云计算平台,给出传统BP-AdaBoost算法的MapReduce并行化方法。Map函数完成每个弱分类器预测误差εt的计算与重新标记,Reduce函数根据Map函数得到的中间结果合并计算出平均误差,供下一轮MapReduce计算任务使用。将改进后的算法部署在Hadoop集群上,能够实现高效并行的海量数据强分类。并通过集群上的三个对比实验,验证了该算法的可行性,它不仅能处理海量数据,而且降低了算法的时间复杂度,具有较好的加速比和准确性。
While dealing with massive data classification,the time and space complexities have become the bottleneck of traditional classification algorithms. Based on analysing traditional BP-AdaBoost algorithm,we propose a MapReduce parallel implementation method for traditional BP-AdaBoost algorithm in combination with cloud computing platform. The Map function completes the calculation and retagging of the forecasting deviation εtof every weak classifier,while the Reduce function calculates the average deviation in consolidation based on the middle results derived by Map function and which is for the use in next turn of MapReduce calculation work. Deploying the improved algorithm on Hadoop cluster,it is able to achieve efficient parallel strong classification of massive data. By three comparative experiments on Hadoop cluster,the feasibility of the algorithm is verified. It can deal with massive data,and can also reduce the time complexity,as well as has better linear speedup ratio and accuracy.
出处
《计算机应用与软件》
CSCD
北大核心
2014年第8期261-264,共4页
Computer Applications and Software
基金
国家自然科学基金项目(31271615)