期刊文献+

基于SparkR的分类算法并行化研究 被引量:14

Parallelization of Classification Algorithms Based on SparkR
下载PDF
导出
摘要 近几年来,大数据机器学习和数据挖掘并行化算法研究成为大数据领域一个较为重要的研究热点。Spark提供了一个称为Spark R的编程接口,方便一般应用领域的数据分析人员使用所熟悉的R语言在Spark平台上完成数据分析和计算。基于Spark R设计并实现了多种常用的并行化的机器学习分类算法,包括多项式贝叶斯分类算法、支持向量机(support vector machine,SVM)算法和Logistic Regression算法。对于SVM和Logistic Regression算法,在常规的并行化策略的基础上为了进一步提升训练速度,设计采用了并行化局部优化的迭代计算模式。实验结果表明,所设计实现的基于Spark R的并行化分类算法与Hadoop Map Reduce的方案相比,速度上提升了8倍左右。 In recent years,parallelizing algorithms for big data machine learning and data mining have become an important research issue in the field of big data.Spark provides a programming interface called Spark R to support data analysts who are familiar with the R language in the general application areas to conduct the data analysis and computations on the Spark platform.This paper proposes the design and implementation of several widely-used parallel classification algorithms including Multinomial Naive Bayes,SVM(support vector machine) and Logistic Regression based on Spark R.This paper also presents how to optimize the SVM and Logistic Regression algorithms to improve the training speed based on conventional parallel strategies.The experimental results show that the efficiency of the classification algorithms based on Spark R outperforms Hadoop Map Reduce with 8 times of speedup without losing scalability.
出处 《计算机科学与探索》 CSCD 北大核心 2015年第11期1281-1294,共14页 Journal of Frontiers of Computer Science and Technology
基金 江苏省科技支撑计划项目No.BE2014131~~
关键词 SparkR 分类算法 并行化 局部迭代 内存计算 Spark R classification algorithm parallelization local iteration in-memory computation
  • 相关文献

参考文献19

  • 1刘华元,袁琴琴,王保保.并行数据挖掘算法综述[J].电子科技,2006,19(1):65-68. 被引量:15
  • 2Dean J,Ghemawat S.Map Reduce:simplified data processing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
  • 3Zaharia M,Chowdhury M,Das T,et al.Resilient distributed datasets:a fault-tolerant abstraction for in-memory cluster computing[C]//Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation,San Jose,USA,Apr 25-27,2012.Berkeley,CA,USA:USENIX Association,2012.
  • 4The R Foundation.The R project for statistical computing[EB/OL].[2014-10-06].http://www.r-project.org/.
  • 5Amplab-extras.Spark R(R frontend for Spark)[EB/OL].[2014-09-25].http://amplab-extras.github.io/Spark R-pkg/.
  • 6Liu Chuang.Research on classification algorithms based on multicore computing[D].Nanjing:Nanjing University of Aeronautics and Astronautics,2011.
  • 7Jin Lei,Wang Zhaokang,Gu Rong,et al.Training large scale deep neural networks on the Intel Xeon Phi many-core coprocessor[C]//Proceedings of the 2014 IEEE 28th International Parallel&Distributed Processing Symposium Workshops(Par Learning),Phoenix,USA,May 19-25,2014.Piscataway,NJ,USA:IEEE,2014:1622-1630.
  • 8Woodsend K,Gondzio J.Hybrid MPI/Open MP parallel linear support vector machine training[J].Journal of Machine Learning Research,2009,10:1937-1953.
  • 9Narang A,Gupta R,Joshi A,et al.Highly scalable parallel collaborative filtering algorithm[C]//Proceedings of the 2010International Conference on High Performance Computing,Dona Paula,Dec 19-22,2010.Piscataway,NJ,USA:IEEE,2010:1-10.
  • 10The Apache Software Foundation.Apache Mahout:scalable machine learning and data mining[EB/OL].(2014)[2014-10-06].http://mahout.apache.org/.

二级参考文献10

  • 1张玉芳,彭时名,吕佳.基于文本分类TFIDF方法的改进与应用[J].计算机工程,2006,32(19):76-78. 被引量:120
  • 2Hadoop WT. The definitive guide.O'Reilly Media,Inc, 2009.
  • 3Taiwan Hadoop Forum.http://forum.hadoop.tw/2009.
  • 4Apache Hadoop.(2009-09-12).http://hadoop.apache.org/.
  • 5McCallum A, Nigam K. A Comparison of Event Models for Naive Bayes Text Classification. AAAF ICML-98 Workshop on Learning for Text Categorization 1998:41-48.
  • 6Dean J, Ghemawat S. MapReduce: Simplifed Data Processing on Large Clusters. Proc.of the 6th Symposium on Operating System Design and Implementation, San Francisco, 2004.
  • 7Cutting D. Scalable Computing with MapReduce. Proc.of O'Reilly Open Source Convention, Poland. 2005.
  • 8Salton G, Clement TY. On the construction of effective vocabularies for information retrieval. Proc. of the 1973 Meeting on Programming Languages and Information Retrieval, New York ACM, 1973:11.
  • 9How BC, Narayanan K. An empirical study of feature selec- tion for text categorization based on term weightage. Proc. of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence Washington DC: IEEE Computer Society, 2004:599-602.
  • 10Basils R, Moschiiti A, Pazienza M. A test classifier based on linguistic processing. Proc. of IJCAIp, Machine Learning for Information Filtering, 1999.

共引文献22

同被引文献78

  • 1刘平峰,聂规划,陈冬林.基于知识的电子商务智能推荐系统平台设计[J].计算机工程与应用,2007,43(19):199-201. 被引量:19
  • 2朱映辉,江玉珍.BIRCH聚类算法优化及并行化研究[J].计算机工程与设计,2007,28(18):4345-4346. 被引量:8
  • 3叶明华.保险欺诈心理动因分析[J].中国保险,2007(8):60-61. 被引量:2
  • 4Chang C C, Lin C J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2011,2 (3) : 75--102.
  • 5Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In- Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. Berkeley: USENIX Association, 2012,2 -- 16.
  • 6Iehihashi H, Honda K, Notsu A. Comparison of scaling behavior between fuzzy c-means based classifier with many parameters and LibSVM. Fuzzy Systems,2011,35(2) :386--393.
  • 7Joseph S M, Hameed A. Online handwritten malaya[am character recognition using LIBSVM in matlab. Australian Computer Society, 2014, 15(1) :21--25.
  • 8郑哗,李剑.Scala程序设计.北京:人民邮电出版社,2010,1—196.
  • 9黄海旭,高宇翔.Scala编程.北京:电子工业出版社,2010,30-278.
  • 10浅谈分布式计算的开发与实现(一).http://www.cnblogs.com/mushroom/p/4959904.html.2015.

引证文献14

二级引证文献82

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部