摘要
近几年来,大数据机器学习和数据挖掘并行化算法研究成为大数据领域一个较为重要的研究热点。Spark提供了一个称为Spark R的编程接口,方便一般应用领域的数据分析人员使用所熟悉的R语言在Spark平台上完成数据分析和计算。基于Spark R设计并实现了多种常用的并行化的机器学习分类算法,包括多项式贝叶斯分类算法、支持向量机(support vector machine,SVM)算法和Logistic Regression算法。对于SVM和Logistic Regression算法,在常规的并行化策略的基础上为了进一步提升训练速度,设计采用了并行化局部优化的迭代计算模式。实验结果表明,所设计实现的基于Spark R的并行化分类算法与Hadoop Map Reduce的方案相比,速度上提升了8倍左右。
In recent years,parallelizing algorithms for big data machine learning and data mining have become an important research issue in the field of big data.Spark provides a programming interface called Spark R to support data analysts who are familiar with the R language in the general application areas to conduct the data analysis and computations on the Spark platform.This paper proposes the design and implementation of several widely-used parallel classification algorithms including Multinomial Naive Bayes,SVM(support vector machine) and Logistic Regression based on Spark R.This paper also presents how to optimize the SVM and Logistic Regression algorithms to improve the training speed based on conventional parallel strategies.The experimental results show that the efficiency of the classification algorithms based on Spark R outperforms Hadoop Map Reduce with 8 times of speedup without losing scalability.
出处
《计算机科学与探索》
CSCD
北大核心
2015年第11期1281-1294,共14页
Journal of Frontiers of Computer Science and Technology
基金
江苏省科技支撑计划项目No.BE2014131~~
关键词
SparkR
分类算法
并行化
局部迭代
内存计算
Spark R
classification algorithm
parallelization
local iteration
in-memory computation