基于Spark平台的分类算法性能比较分析

Performance Comparison and Analysis of Classification Algorithms Based on Spark Platform

下载PDF

导出

摘要针对目前大数据与机器学习技术的快速发展,使用基于Spark平台的MLlib机器学习库实现前馈神经网络(Feedforward Artificial Neural Network)、支持向量机(Support Vector Machine)与随机森林(Random Forest)三种机器学习算法,并分析与评估三种算法在大数据平台下的运行与分类性能。实验结果表明,随着节点数的增加,三种算法在大数据平台上消耗的时间都逐步变少。当数据集小于100MB时神经网络与支持向量机算法加速比较高,数据集大于1GB时随机森林算法加速比优于其他两种算法。神经网络算法在数据集100MB时可扩展性最小,支持向量机算法在数据集500MB时可扩展性最小。随机森林算法在数据集大于1GB时规模增长性优于其他两种算法。通过对于三种分类算法的时间效率与准确性比较,支持向量机算法消耗的时间最少,但是分类准确性最低。神经网络算法消耗的时间最长,分类准确性低于随机森林算法。随机森林算法的分类准确性最高,但是算法运行时间高于支持向量机算法。集成分类算法在大数据平台上表现出较好的时间性能与分类准确性。 In view of the rapid development of big data and machine learning technology,MLlib machine learning library based on Spark platform is used to implement feedforward artificial neural network,support vector machine and random forest,three machine learning algorithms,the operation and classification performance of the three algorithms under the big data platform are analyzed and evaluated.The experimental results show that with the increase of the number of nodes,the time consumed by the three algorithms on the big data platform gradually decreases.When the dataset is less than 100MB,the acceleration ratio of neural network and support vector machine algorithm is higher,and when the dataset is larger than 1GB,the acceleration ratio of random forest algorithm is better than the other two algorithms.The neural network algorithm has the least scalability when the data set is 100MB,and the support vector machine algorithm has the least scalability when the data set is 500MB.The random forest algorithm has better scale growth than the other two algorithms when the data set is larger than 1GB.By comparing the time efficiency and ac-curacy of the three classification algorithms,the SVM algorithm consumes the least time,but the classification accuracy is the low-est.Neural network algorithm consumes the longest time,and the classification accuracy is lower than random forest algorithm.Ran-dom forest algorithm has the highest classification accuracy,but its running time is higher than support vector machine algorithm.The integrated classification algorithm shows better time performance and classification accuracy on the big data platform.

作者赵蕾夏吉安吴洋崔辉 ZHAO Lei;XIA Ji'an;WU Yang;CUI Hui(School of Computer and Software,Nanjing Vocational University of Industry Technology,Nanjing 210023)

机构地区南京工业职业技术大学计算机与软件学院

出处《计算机与数字工程》 2024年第3期688-691,704,共5页 Computer & Digital Engineering

基金 2020年度中国高校产学研创新基金项目(编号:2020HYB02005) 2022年度江苏省产学研合作项目(编号:BY2022560) 2020年度江苏省工业软件工程技术研究项目(编号:ZK20-04-12)资助。

关键词大数据 Hadoop框架 Spark框架机器学习性能评估 big data Hadoop framework Spark framework machine learning performance evaluation

分类号 TP39 [自动化与计算机技术—计算机应用技术] S3 [农业科学—农艺学]

引文网络
相关文献

1王俊戈,崔晟頔,陈子旸,申琦,余杨,陈栋,赵真坚,王书杰,吴平先,郭宗义,王金勇,唐国庆.2款猪液相芯片性能比较分析[J].现代畜牧科技,2024(1):1-5.
2马超宇,黄羽,张金利,孟洁,谢印国.游乐设施大数据中心的建设与应用[J].中国特种设备安全,2024,40(3):72-75.
3赵恩毅.大数据中的数据清洗与预处理技术研究[J].信息记录材料,2024,25(3):195-197.
4王金忠,吴焰龙.基于随机森林的智能电网多源数据异常检测[J].电子设计工程,2024,32(7):149-152.
5张永贺,吴砚辉,马本言,霍道明,李鉴衡,魏巍.基于深度学习的风机叶片缺陷识别[J].人工智能,2024(3):77-84.
6刘志斌,潘永雄,吴健鸿.准谐振与有源钳位反激变换器的性能比较和分析[J].电子技术应用,2024,50(4):115-120.
7李美兰.基于智慧系统区域自然资源监测监管平台设计研究[J].华北自然资源,2024(2):122-125.
8朱皓,周楚淮,钟浩生.PE改性硫沥青的高温性能与环境评价研究[J].西部交通科技,2024(1):80-82.
9石添介,刘飞阳,张晓.机载超轻量化卷积神经网络加速器设计[J].航空工程进展,2024,15(2):188-194.
10丁金虎.基于改进蚁群算法的消防救援机器人路径规划方法[J].科技创新与应用,2024,14(14):133-136.

计算机与数字工程

2024年第3期

浏览历史

内容加载中请稍等...

基于Spark平台的分类算法性能比较分析

相关作者

相关机构

相关主题

浏览历史