Linux内核参数对Spark负载性能影响的研究被引量：3

Impact of Linux kernel parameters on Spark workloads

下载PDF

导出

摘要关于Spark性能的研究目前正在成为热点,但调优策略多位于应用层,而不是系统层。操作系统作为硬件之上的第一层软件,对硬件性能发挥起着根本作用。Linux内核提供了丰富的参数作为优化性能的接口,但实际中,这些参数的作用并没有得到充分发挥。人们更多是采用系统默认值,而不是根据具体环境进行调整。然而本文实验发现,系统默认值并不一定是最好的选择,有时甚至是最坏的。定义了"影响比"这一概念,并基于此概念提出了一种通过分析内核函数的执行情况来认识参数对Spark应用影响的方法。针对Spark内存计算的特点,从大页、NUMA这两个与使用内存紧密相关的方面分析了相关内核参数对几种典型Spark负载的性能影响,并由此得出一些结论。希望本文的分析和结论可以为Spark平台合理设置内核参数提供一些参考。 Research on the performance of Spark becomes a hot topic, however, optimization strategies are mostly used on the application level instead of system level. As the first software above hard- ware, the operating system plays a fundamental role in the performance of hardware. The Linux kernel provides abundant parameters as the interface to optimize the performance of the system. However, in practice, kernel parameters have not fully played their roles. Most people use their default values rather than change them to fit the specific environment. However, our experiments prove that the default values are not always the best choice, and sometimes it is even the worst. We define the concept of ＂influ- ence ratio＂, and put forward a method based on the concept to understand the influence of parameters on Spark applications by analyzing the kernel functions. According to the features of the memory computing of Spark, we analyze the influence of Linux kernel parameters on several typical Spark workloads from the aspects of Transparent Huge Page and NUMA, which closely relates to the use of memory, and then give some conclusions. We hope that the analysis and conclusions can provide some experience of tuning kernel parameters reasonably for the Spark platform.

作者王利王晶张伟功邱柯妮陆克中

机构地区首都师范大学北京成像技术高精尖创新中心首都师范大学信息工程学院首都师范大学高可靠嵌入式系统技术北京市工程研究中心深圳大学计算机与软件学院

出处《计算机工程与科学》 CSCD 北大核心 2017年第7期1219-1226,共8页 Computer Engineering & Science

基金国家自然科学基金(61472260 61402302 61502321) 北京市创新团队计划(IDHT20150507) 北京市科技计划(KM201610028016) 广东省省部产学研项目(2013B090500055) 深圳市基础研究学科布局项目(JCYJ20150529164656096) 国家863计划(2015AA015305)

关键词大数据 SPARK LINUX 大页 NUMA big data spark Linux huge page NUMA

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献1

1陈侨安,李峰,曹越,龙明盛.基于运行数据分析的Spark任务参数优化[J].计算机工程与科学,2016,38(1):11-19. 被引量：22

二级参考文献17

1White T. Hadoop: The definitive guide[J]. O'reilly Media Inc Gravenstein Highway North,2010,215(11):1-4.
2Lakshman A,Malik P. Cassandra..A decentralized structured storage system[J]. Acre Sigops Operating Systems Review, 2010,44(2) :35-40.
3Zaharia M,Chowdhury M,Franklin M J,et al. Spark:Cluster computing with working sets[C]//Proc of the 2nd USENIX Conference on Hot Topics in Cloud Computing, 2010:1765- 1773.
4Seo S, Jang I, Woo K, et al. HPMR: Prefetching and pre- shuffling in shared MapReduce computation envlronment[C] //Proc of the 2009 IEEE International Conference on Cluster Computing, 2009 : 1-8.
5Jiang D,Ooi B C, Shi L, et al. The performance of MapRe- duce:An in-depth study[J]. Proceedings of the VLDB En- dowment, 2010,3 (12) : 472-483.
6Dittrich J. Hadoopq-q- :Making a yellow elephant run like a cheetah (without it even noticing)[J]. Proceedings of the VLDB Endowment, 2010,3 (12) : 518-529.
7Shivnath B. Towards automatic optimization of MapReduce programs[C]//Proc of the 1st ACM Symposium on Cloud Computing, 2010 : 137-142.
8Herodotou H,Lim H, Luo G, et al. Starfish: A self-tuning system for big data analytics[C]//Proc of the 5th Cidr Conf, 2011 : 261-272.
9Shi Ju-wei,Zhou Jia, Lu Jia-heng, et al. MRTuner:A toolkit to enable holistic optimization for MapReduce )obs[C]//Proc of the VLDB Endowment, 2014,7(13) : 1319-1330.
10Aaron D, Andrew O. Optimizing shuffle performance in spark [R]. CA: Berkeley-Department of Electrical Engineering and Computer Sciences, University of California, 2033.

共引文献21

1林子孟,葛欣竹,曹若麟.面向电信应急系统的Spark性能预测与参数调优方法探究[J].电信快报,2020(12):26-30. 被引量：2
2赵军,徐晓燕.基于GraphX的分布式幂迭代聚类[J].计算机应用,2016,36(10):2710-2714. 被引量：3
3李玉波,杨余旺,唐浩,陈光炜.基于Spark的K-means安全区间更新优化算法[J].计算机技术与发展,2017,27(8):1-6. 被引量：1
4陆世鹏.基于Spark Streaming的海量日志实时处理系统的设计[J].电子产品可靠性与环境试验,2017,35(5):71-76. 被引量：7
5丁东亮,于福利,吴东月.基于新拟牛顿方程解决分类问题[J].天津理工大学学报,2017,33(5):19-23. 被引量：2
6柴宁,吴毅坚,赵文耘.基于数据特性的Spark任务性能优化[J].计算机应用与软件,2018,35(1):52-58. 被引量：2
7熊安萍,夏玉冲,杨方方.一种Spark集群下的shuffle优化机制[J].计算机工程与应用,2018,54(4):72-76. 被引量：2
8廖湖声,黄珊珊,徐俊刚,刘仁峰.Spark性能优化技术研究综述[J].计算机科学,2018,45(7):7-15. 被引量：21
9葛庆宝,陶耀东,高岑,田月,孟祥茹.基于关键阶段分析的Spark性能预测模型[J].计算机系统应用,2018,27(8):232-236. 被引量：2
10尉耀稳,余彬,李豪帅,沈鸿达.基于Spark平台的参数优化研究现状[J].电脑知识与技术,2019,15(1):11-13. 被引量：1

同被引文献8

1贺亚茹.Oracle数据库日志文件损坏时修复方法的实验研究[J].计算机应用,2009,29(B12):393-395. 被引量：6
2金弟,庄锡进,王启迪,曹晓初,王宗仁.一种地震资料解释系统[J].计算机系统应用,2014,23(8):63-67. 被引量：2
3陈侨安,李峰,曹越,龙明盛.基于运行数据分析的Spark任务参数优化[J].计算机工程与科学,2016,38(1):11-19. 被引量：22
4杨志伟,郑烇,王嵩,杨坚,周乐乐.异构Spark集群下自适应任务调度策略[J].计算机工程,2016,42(1):31-35. 被引量：19
5门威.基于MapReduce的大数据处理算法综述[J].濮阳职业技术学院学报,2017,30(5):85-88. 被引量：2
6尉耀稳,余彬,李豪帅,沈鸿达.基于Spark平台的参数优化研究现状[J].电脑知识与技术,2019,15(1):11-13. 被引量：1
7韩勇鹏.基于RMAN的oracle数据库备份与还原的设计与实现[J].计算机时代,2019(4):56-59. 被引量：3
8冉琨.探讨Oracle数据库日常维护与优化[J].信息系统工程,2020,33(1):103-104. 被引量：2

引证文献3

1林子孟,葛欣竹,曹若麟.面向电信应急系统的Spark性能预测与参数调优方法探究[J].电信快报,2020(12):26-30. 被引量：2
2尉耀稳,余彬,李豪帅,沈鸿达.基于Spark平台的参数优化研究现状[J].电脑知识与技术,2019,15(1):11-13. 被引量：1
3金弟,范国章,叶月明,王启迪,邵萌珠.基于石油物探移动图形工作站系统实践的研究[J].信息系统工程,2020,33(9):82-86.

二级引证文献3

1林子孟,葛欣竹,曹若麟.面向电信应急系统的Spark性能预测与参数调优方法探究[J].电信快报,2020(12):26-30. 被引量：2
2程智余,江玉,靳幸福.基于多机器学习模型的变电站调试检修自动测试方法研究[J].自动化与仪器仪表,2024(3):268-271.
3陈春茹.基于Spark SQL的数据查询与索引优化系统研究[J].信息技术与信息化,2024(7):170-173.

1刘君.基于Hadoop的海量小文件存储优化方法[J].厦门理工学院学报,2017,25(3):34-39. 被引量：1
2张华,郑建志,郑永通,王绍然,赖福霖.地震监测系统服务器RAID选择[J].黑龙江科技信息,2017(11):147-148. 被引量：1
3泛在智能:开启智能时代新篇章[J].中国无线电,2017(5):51-51.

计算机工程与科学

2017年第7期

浏览历史

内容加载中请稍等...

Linux内核参数对Spark负载性能影响的研究被引量：3

参考文献1

二级参考文献17

共引文献21

同被引文献8

引证文献3

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

Linux内核参数对Spark负载性能影响的研究 被引量：3

参考文献1

二级参考文献17

共引文献21

同被引文献8

引证文献3

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

Linux内核参数对Spark负载性能影响的研究被引量：3