期刊文献+

Spark和Flink平台大数据批量处理的性能分析 被引量:6

Performance Analysis of Batch Processing for Big Data on Spark and Flink
下载PDF
导出
摘要 为了研究Apache Spark和Apache Flink两个框架在进行批量大数据处理时的性能差异和相似性,分析Spark和Flink的引擎差异,重点对比了两个框架在运行机器学习算法的细节。比较的算法有:支持向量机(SVM)算法、线性回归(LR)算法和分布式信息理论的特征选择(FS-DIT)算法,其中SVM和LR算法为平台固有算法,FS-DIT根据平台框架特点重新设计。三种机器学习算法的实验结果表明:Spark的性能优于Flink,且总体运行时长低于Flink。另外对Spark目前拥有的MLlib和ML性能也进行了分析研究。整个研究对新旧两种平台上的批处理应用具有一定指导作用。 To study the performance differences and similarities of the two frames Apache Spark and Apache Flink in the batch processing of big data,the differences between Spark and Flink engine are analyzed,and the details of the two frameworks running machine learning algorithms are compared. The algorithms for comparison are supported vector machine( SVM) algorithm and linear regression( LR) algorithm and feature selection of distributed information theory,in which SVM and LR algorithm are supported in the two frameworks,and FS-DIT is re-implemented according to the two frameworks. The experimental results of the three machine learning algorithms show that the performance of Spark is better than Flink,and the overall running time is less than Flink. In addition,the performance of ML and MLlib in the current Spark are also compared and studied. So the whole research has certain guiding function for batch processing applications on the two platforms.
作者 马黎 MA Li(Computer School of Wuhan University, wuhan , 430072, China;Editorial Department of the Journey of Shangqiu Polytechnic, shangqiu ,476000, China)
出处 《中国电子科学研究院学报》 北大核心 2018年第2期191-195,213,共6页 Journal of China Academy of Electronics and Information Technology
基金 河南省教育厅高校重点科研课题(16B120003)
关键词 批量大数据处理 性能差异 APACHE SPARK APACHE Flink 机器学习算法 batch processing of big data performance differences Apache Spark Apache Flink machine learning algorithms
  • 相关文献

参考文献4

二级参考文献40

  • 1冯璐,冷伏海.共词分析方法理论进展[J].中国图书馆学报,2006,32(2):88-92. 被引量:559
  • 2李钢,王蔚,张胜.支持向量机在脑电信号分类中的应用[J].计算机应用,2006,26(6):1431-1433. 被引量:19
  • 3贾丽会,张修如.BP算法分析与改进[J].计算机技术与发展,2006,16(10):101-103. 被引量:47
  • 4Buck I. GPU computing: programming a massively parallel processor. International Symposium on Code Generation and Optimization(CGO ' 07),California,2007:17-23.
  • 5Polo J, Carrera D, Becerra Y, et al. Performance of accelerated MapReduce workloads in heterogengous clusters. Proceedings of 39th International Conference on Parallel Processing, San Diego, 2010:653N662.
  • 6Huy T Vo, Broson J, Summa B, et al.2011 IEEE Symposium,RI, 2011:81 -89.
  • 7Condie T, Conway N, Alvaro P, et al. MapReduce OnLine, UCB/ EECS-2009-136. Berkeley: Electrical Engineering and Computer Sciences University of California,2009.
  • 8Lu Xiaoyi, Wang Bing, Zha Li, et al. Can MPI benefit Hadoop and MapReduce applications. Proceedings of 2011 International Conference on Parallel Processing Workshops, Taipei, China, 2011:371-379.
  • 9Crochow K, Howe B, Stoermer M, et al. Client+Cloud:evaluating seamless architectures for visual data analytics in the ocean sciences. Proceedings of 22nd International Conference on Scientific and Statistical Database Management, Berlin, 2010: 114-131.
  • 10Liu Z B, Qu W Y, Li H T, et al. A Hybrid Collaborative Filtering Recommendation Mechanism for P2P Networks[ J. Future Genera- tion Computer Systems ,2010,26 ( 8 ) : 1409 - 1417.

共引文献24

同被引文献81

引证文献6

二级引证文献26

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部