期刊文献+

Linux内核参数对Spark负载性能影响的研究 被引量:3

Impact of Linux kernel parameters on Spark workloads
下载PDF
导出
摘要 关于Spark性能的研究目前正在成为热点,但调优策略多位于应用层,而不是系统层。操作系统作为硬件之上的第一层软件,对硬件性能发挥起着根本作用。Linux内核提供了丰富的参数作为优化性能的接口,但实际中,这些参数的作用并没有得到充分发挥。人们更多是采用系统默认值,而不是根据具体环境进行调整。然而本文实验发现,系统默认值并不一定是最好的选择,有时甚至是最坏的。定义了"影响比"这一概念,并基于此概念提出了一种通过分析内核函数的执行情况来认识参数对Spark应用影响的方法。针对Spark内存计算的特点,从大页、NUMA这两个与使用内存紧密相关的方面分析了相关内核参数对几种典型Spark负载的性能影响,并由此得出一些结论。希望本文的分析和结论可以为Spark平台合理设置内核参数提供一些参考。 Research on the performance of Spark becomes a hot topic, however, optimization strategies are mostly used on the application level instead of system level. As the first software above hard- ware, the operating system plays a fundamental role in the performance of hardware. The Linux kernel provides abundant parameters as the interface to optimize the performance of the system. However, in practice, kernel parameters have not fully played their roles. Most people use their default values rather than change them to fit the specific environment. However, our experiments prove that the default values are not always the best choice, and sometimes it is even the worst. We define the concept of "influ- ence ratio", and put forward a method based on the concept to understand the influence of parameters on Spark applications by analyzing the kernel functions. According to the features of the memory computing of Spark, we analyze the influence of Linux kernel parameters on several typical Spark workloads from the aspects of Transparent Huge Page and NUMA, which closely relates to the use of memory, and then give some conclusions. We hope that the analysis and conclusions can provide some experience of tuning kernel parameters reasonably for the Spark platform.
出处 《计算机工程与科学》 CSCD 北大核心 2017年第7期1219-1226,共8页 Computer Engineering & Science
基金 国家自然科学基金(61472260 61402302 61502321) 北京市创新团队计划(IDHT20150507) 北京市科技计划(KM201610028016) 广东省省部产学研项目(2013B090500055) 深圳市基础研究学科布局项目(JCYJ20150529164656096) 国家863计划(2015AA015305)
关键词 大数据 SPARK LINUX 大页 NUMA big data spark Linux huge page NUMA
  • 相关文献

参考文献1

二级参考文献17

  • 1White T. Hadoop: The definitive guide[J]. O'reilly Media Inc Gravenstein Highway North,2010,215(11):1-4.
  • 2Lakshman A,Malik P. Cassandra..A decentralized structured storage system[J]. Acre Sigops Operating Systems Review, 2010,44(2) :35-40.
  • 3Zaharia M,Chowdhury M,Franklin M J,et al. Spark:Cluster computing with working sets[C]//Proc of the 2nd USENIX Conference on Hot Topics in Cloud Computing, 2010:1765- 1773.
  • 4Seo S, Jang I, Woo K, et al. HPMR: Prefetching and pre- shuffling in shared MapReduce computation envlronment[C] //Proc of the 2009 IEEE International Conference on Cluster Computing, 2009 : 1-8.
  • 5Jiang D,Ooi B C, Shi L, et al. The performance of MapRe- duce:An in-depth study[J]. Proceedings of the VLDB En- dowment, 2010,3 (12) : 472-483.
  • 6Dittrich J. Hadoopq-q- :Making a yellow elephant run like a cheetah (without it even noticing)[J]. Proceedings of the VLDB Endowment, 2010,3 (12) : 518-529.
  • 7Shivnath B. Towards automatic optimization of MapReduce programs[C]//Proc of the 1st ACM Symposium on Cloud Computing, 2010 : 137-142.
  • 8Herodotou H,Lim H, Luo G, et al. Starfish: A self-tuning system for big data analytics[C]//Proc of the 5th Cidr Conf, 2011 : 261-272.
  • 9Shi Ju-wei,Zhou Jia, Lu Jia-heng, et al. MRTuner:A toolkit to enable holistic optimization for MapReduce )obs[C]//Proc of the VLDB Endowment, 2014,7(13) : 1319-1330.
  • 10Aaron D, Andrew O. Optimizing shuffle performance in spark [R]. CA: Berkeley-Department of Electrical Engineering and Computer Sciences, University of California, 2033.

共引文献21

同被引文献8

引证文献3

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部