摘要
为了研究Apache Spark和Apache Flink两个框架在进行批量大数据处理时的性能差异和相似性,分析Spark和Flink的引擎差异,重点对比了两个框架在运行机器学习算法的细节。比较的算法有:支持向量机(SVM)算法、线性回归(LR)算法和分布式信息理论的特征选择(FS-DIT)算法,其中SVM和LR算法为平台固有算法,FS-DIT根据平台框架特点重新设计。三种机器学习算法的实验结果表明:Spark的性能优于Flink,且总体运行时长低于Flink。另外对Spark目前拥有的MLlib和ML性能也进行了分析研究。整个研究对新旧两种平台上的批处理应用具有一定指导作用。
To study the performance differences and similarities of the two frames Apache Spark and Apache Flink in the batch processing of big data,the differences between Spark and Flink engine are analyzed,and the details of the two frameworks running machine learning algorithms are compared. The algorithms for comparison are supported vector machine( SVM) algorithm and linear regression( LR) algorithm and feature selection of distributed information theory,in which SVM and LR algorithm are supported in the two frameworks,and FS-DIT is re-implemented according to the two frameworks. The experimental results of the three machine learning algorithms show that the performance of Spark is better than Flink,and the overall running time is less than Flink. In addition,the performance of ML and MLlib in the current Spark are also compared and studied. So the whole research has certain guiding function for batch processing applications on the two platforms.
作者
马黎
MA Li(Computer School of Wuhan University, wuhan , 430072, China;Editorial Department of the Journey of Shangqiu Polytechnic, shangqiu ,476000, China)
出处
《中国电子科学研究院学报》
北大核心
2018年第2期191-195,213,共6页
Journal of China Academy of Electronics and Information Technology
基金
河南省教育厅高校重点科研课题(16B120003)