期刊文献+

Intel Knights Corner的结点级内存访问优化 被引量:2

Node-level Memory Access Optimization on Intel Knights Corner
下载PDF
导出
摘要 传统编程优化(Traditional Programming Optimization,TPO)在Intel Knights Corner(KNC)上收效甚微,因此提出内存访问优化(Memory Access Optimization,MAO)。将MAO应用到已经过TPO的程序Diffusion 3D上,发现其性能仍然提高了39.1%。主要有2个贡献:1)提出MAO,认为TPO+MAO有助于在KNC上获取最优化性能;2)发现对于stencil代码,基于intrinsic的MAO比基于编译器的MAO更高效。这些发现对于在KNC上优化大规模应用有启发意义。 Traditional programming optimization (TPO) has limited effects on Intel Knights Corner (KNC). Therefore, we proposed memory access optimization (MAO) for KNC. We applied MAO to TPO version of Diffusion 3D, and its performance is improved by 39. 1%. We made two contributions in this paper: 1) MAO is indispensable to KNC and TPOq-MAO is the path to Ninja Performance—the best optimized performance. 2) Intrinsic-based MAO is more effi- cient to stencil code than compiler-based MAO. Our findings on MAO will inspire optimizations of large-scale applica-tions on KNC.
出处 《计算机科学》 CSCD 北大核心 2015年第11期37-42,共6页 Computer Science
基金 国家高技术研究发展计划(863):高性能计算环境应用服务优化关键技术研究 日本学术振兴会RONPAKU Fellowship资助
关键词 传统编程优化 INTEL Knights CORNER 内存访问优化 最优化性能 Traditional programming optimization(TPO), Intel Knights Corner (KNC), Memory access optimization(MAO), Ninja performance
  • 相关文献

参考文献16

  • 1Satish N,Kim C,Chhugani J, et al. Can traditional programmingbridge the Ninja performance gap for parallel computing applica-tions. [C] // 2012 39th Annual International Symposium onComputer Architecture CISCA). 2012:440-451.
  • 2Xue W,Yang C,Fu H,et al. Enabling and Scaling a Global Shal-low-Water Atmospheric Model on Tianhe-2 [C] //Proceedings ofthe 2014 IEEE 28th International Parallel and Distributed Pro-cessing Symposium. 2014.
  • 3PennycookSJ, Hughes CJ,Smelyanskiy M, et al. ExploringSIMD for Molecular Dynamics.Using Intel Xeon Processors andIntel Xeon Phi Coprocessors[C] //Proceedings of the 2013 IEEE27th International Symposium on Parallel and Distributed Pro-cessing. 2013:1085-1097.
  • 4Heinecke A,Vaidyanathan K, Smelyanskiy M, et al. Design andImplementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel Xeon Phi Coprocessor [C] // Pro-ceedings of the 2013 IEEE 27th International Symposium onParallel and Distributed Processing. 2013 : 126-137.
  • 5Krishnaiyer R, Kultursay E,Chawla P,et al. Compiler-BasedData Prefetching and Streaming Non-temporal Store Generationfor the Intel(R) Xeon Phi(TM) Coprocessor[C] // Proceedingsof the 2013 IEEE 27th International Symposium on Parallel andDistributed Processing Workshops and PhD Forum. 2013 : 1575-1586.
  • 6Hofmann J,Treibig J,Hager G,et al. Performance Engineeringfor a Medical Imaging Application on the Intel Xeon Phi Accele-ratorCC]//2014 27th International Conference on Presented atthe Architecture of Computing Systems (ARCS). 2014:1-8.
  • 7Jeffers J, Reinders J. Intel Xeon Phi Coprocessor High Perform-ance Programming (1st edition) [M]. Morgan Kaufmann Pub-lishers Inc,2013.
  • 8Rahman R. Intel Xeon Phi Coprocessor Architecture and Tools:The Guide for Application Developers[M] // Intel Xeon Phi Cop-rocessor Architecture and Tools:The Guide for Application De-velopers(lst edition). 2013.
  • 9Saini S, Jin H, Jespersen D, et al. An early performance evalua-tion of many integrated core architecture based SGI rackablecomputing system[C] // Proceedings of the International Confe-rence on High Performance Computing, Networking, Storageand Analysis. 2013.
  • 10Hofmann J. Performance Evaluation of the Intel ManylntegratedCore Architecture for 3D Image Reconstruction in ComputedTomography (Master Thesis) [M]. Friedrich-Alexander-Univer-sity Erlangen-Nuremberg,2010.

引证文献2

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部