期刊文献+

利用Stencil建模及评估Intel IMCI vgather指令 被引量:1

Modeling and evaluating Intel IMCI vgather instruction using stencils
下载PDF
导出
摘要 Intel Xeon Phi协处理器的指令集IMCI引入了硬件实现的vgather指令,旨在帮助512位SIMD寄存器访问非连续内存地址上的数据。然而实验结果显示,vgather很有可能成为应用在Xeon Phi协处理器上关键的性能瓶颈之一。基于以上结论,针对vgather的性能建模可以帮助用户深入地掌握和理解Xeon Phi协处理器的性能特性。在实验方法上,本文方法与现存的通过程序段内嵌入汇编代码进行数据统计不同,使用PAPI等性能分析工具直接收集硬件计数器的统计结果,作为模型的实验数据。本文的性能模型基于AGI事件次数和根据VPU_DATA_READ次数估算得出的vgather所导致的平均延迟构建而成。该模型能够对Xeon Phi应用代码中由vgather所导致的总延迟进行预测。最终,为了验证模型预测的准确性,将该模型应用在三维7点stencil应用代码上,预测结果显示,vgather耗时占计算总耗时的约40%。再将该结果与利用intrinsics指令去除vgather后的计算耗时进行了对比验证,结果显示模型预测准确。基于上述结论,采用硬件计数器的统计结果在Xeon Phi协处理器上针对vgather构建了性能模型。同时,通过与其他平台的vgather对比,认为该模型也可以应用在同样具备vgather的Intel CPU处理器平台上。 Vgather is a hardware-implemented vector instruction introduced by Intel Initial Many- Core Instructions (IMCI) for Xeon Phi. Its target is to help SIMD registers access data from non-contig- uous memory locations. However, experimental results show that it can also be one of the key perform- ance bottlenecks on Xeon Phi. We model the performance of Vgather by using the profiling tool PAPI to directly collect the results of hardware performance counters. Address Generation Interlock (AGI) e- vents are profiled as the number of Vgather and the average latency of Vgather are estimated with VPU _DATA_READ events based on which we model the total latencies of Vgather instructions. 3D7P sten- cils are used to evaluate our model and the results show that Vgather spents nearly 40~ of total kernel time. We implement a Vgather-free version with intrinsic instruction to validate this model. Our contri- bution includes modeling Intel IMCI vgather instruction with hardware counters and validating it by stencils. The model can also be applicable on CPUs.
出处 《计算机工程与科学》 CSCD 北大核心 2016年第9期1741-1747,共7页 Computer Engineering & Science
基金 国家863计划(2014AA01A302) 日本学术振兴会RONPAKU Fellowship资助
关键词 性能建模 vgather XEON PHI 硬件计数器 performance modeling vgather Xeon Phi hardware performance counlers
  • 相关文献

参考文献1

二级参考文献16

  • 1Satish N,Kim C,Chhugani J, et al. Can traditional programmingbridge the Ninja performance gap for parallel computing applica-tions. [C] // 2012 39th Annual International Symposium onComputer Architecture CISCA). 2012:440-451.
  • 2Xue W,Yang C,Fu H,et al. Enabling and Scaling a Global Shal-low-Water Atmospheric Model on Tianhe-2 [C] //Proceedings ofthe 2014 IEEE 28th International Parallel and Distributed Pro-cessing Symposium. 2014.
  • 3PennycookSJ, Hughes CJ,Smelyanskiy M, et al. ExploringSIMD for Molecular Dynamics.Using Intel Xeon Processors andIntel Xeon Phi Coprocessors[C] //Proceedings of the 2013 IEEE27th International Symposium on Parallel and Distributed Pro-cessing. 2013:1085-1097.
  • 4Heinecke A,Vaidyanathan K, Smelyanskiy M, et al. Design andImplementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel Xeon Phi Coprocessor [C] // Pro-ceedings of the 2013 IEEE 27th International Symposium onParallel and Distributed Processing. 2013 : 126-137.
  • 5Krishnaiyer R, Kultursay E,Chawla P,et al. Compiler-BasedData Prefetching and Streaming Non-temporal Store Generationfor the Intel(R) Xeon Phi(TM) Coprocessor[C] // Proceedingsof the 2013 IEEE 27th International Symposium on Parallel andDistributed Processing Workshops and PhD Forum. 2013 : 1575-1586.
  • 6Hofmann J,Treibig J,Hager G,et al. Performance Engineeringfor a Medical Imaging Application on the Intel Xeon Phi Accele-ratorCC]//2014 27th International Conference on Presented atthe Architecture of Computing Systems (ARCS). 2014:1-8.
  • 7Jeffers J, Reinders J. Intel Xeon Phi Coprocessor High Perform-ance Programming (1st edition) [M]. Morgan Kaufmann Pub-lishers Inc,2013.
  • 8Rahman R. Intel Xeon Phi Coprocessor Architecture and Tools:The Guide for Application Developers[M] // Intel Xeon Phi Cop-rocessor Architecture and Tools:The Guide for Application De-velopers(lst edition). 2013.
  • 9Saini S, Jin H, Jespersen D, et al. An early performance evalua-tion of many integrated core architecture based SGI rackablecomputing system[C] // Proceedings of the International Confe-rence on High Performance Computing, Networking, Storageand Analysis. 2013.
  • 10Hofmann J. Performance Evaluation of the Intel ManylntegratedCore Architecture for 3D Image Reconstruction in ComputedTomography (Master Thesis) [M]. Friedrich-Alexander-Univer-sity Erlangen-Nuremberg,2010.

共引文献1

引证文献1

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部