期刊文献+

基于Spark平台的分类算法性能比较分析

Performance Comparison and Analysis of Classification Algorithms Based on Spark Platform
下载PDF
导出
摘要 针对目前大数据与机器学习技术的快速发展,使用基于Spark平台的MLlib机器学习库实现前馈神经网络(Feedforward Artificial Neural Network)、支持向量机(Support Vector Machine)与随机森林(Random Forest)三种机器学习算法,并分析与评估三种算法在大数据平台下的运行与分类性能。实验结果表明,随着节点数的增加,三种算法在大数据平台上消耗的时间都逐步变少。当数据集小于100MB时神经网络与支持向量机算法加速比较高,数据集大于1GB时随机森林算法加速比优于其他两种算法。神经网络算法在数据集100MB时可扩展性最小,支持向量机算法在数据集500MB时可扩展性最小。随机森林算法在数据集大于1GB时规模增长性优于其他两种算法。通过对于三种分类算法的时间效率与准确性比较,支持向量机算法消耗的时间最少,但是分类准确性最低。神经网络算法消耗的时间最长,分类准确性低于随机森林算法。随机森林算法的分类准确性最高,但是算法运行时间高于支持向量机算法。集成分类算法在大数据平台上表现出较好的时间性能与分类准确性。 In view of the rapid development of big data and machine learning technology,MLlib machine learning library based on Spark platform is used to implement feedforward artificial neural network,support vector machine and random forest,three machine learning algorithms,the operation and classification performance of the three algorithms under the big data platform are analyzed and evaluated.The experimental results show that with the increase of the number of nodes,the time consumed by the three algorithms on the big data platform gradually decreases.When the dataset is less than 100MB,the acceleration ratio of neural network and support vector machine algorithm is higher,and when the dataset is larger than 1GB,the acceleration ratio of random forest algorithm is better than the other two algorithms.The neural network algorithm has the least scalability when the data set is 100MB,and the support vector machine algorithm has the least scalability when the data set is 500MB.The random forest algorithm has better scale growth than the other two algorithms when the data set is larger than 1GB.By comparing the time efficiency and ac-curacy of the three classification algorithms,the SVM algorithm consumes the least time,but the classification accuracy is the low-est.Neural network algorithm consumes the longest time,and the classification accuracy is lower than random forest algorithm.Ran-dom forest algorithm has the highest classification accuracy,but its running time is higher than support vector machine algorithm.The integrated classification algorithm shows better time performance and classification accuracy on the big data platform.
作者 赵蕾 夏吉安 吴洋 崔辉 ZHAO Lei;XIA Ji'an;WU Yang;CUI Hui(School of Computer and Software,Nanjing Vocational University of Industry Technology,Nanjing 210023)
出处 《计算机与数字工程》 2024年第3期688-691,704,共5页 Computer & Digital Engineering
基金 2020年度中国高校产学研创新基金项目(编号:2020HYB02005) 2022年度江苏省产学研合作项目(编号:BY2022560) 2020年度江苏省工业软件工程技术研究项目(编号:ZK20-04-12)资助。
关键词 大数据 Hadoop框架 Spark框架 机器学习 性能评估 big data Hadoop framework Spark framework machine learning performance evaluation
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部