利用Stencil建模及评估Intel IMCI vgather指令被引量：1

Modeling and evaluating Intel IMCI vgather instruction using stencils

下载PDF

导出

摘要 Intel Xeon Phi协处理器的指令集IMCI引入了硬件实现的vgather指令,旨在帮助512位SIMD寄存器访问非连续内存地址上的数据。然而实验结果显示,vgather很有可能成为应用在Xeon Phi协处理器上关键的性能瓶颈之一。基于以上结论,针对vgather的性能建模可以帮助用户深入地掌握和理解Xeon Phi协处理器的性能特性。在实验方法上,本文方法与现存的通过程序段内嵌入汇编代码进行数据统计不同,使用PAPI等性能分析工具直接收集硬件计数器的统计结果,作为模型的实验数据。本文的性能模型基于AGI事件次数和根据VPU_DATA_READ次数估算得出的vgather所导致的平均延迟构建而成。该模型能够对Xeon Phi应用代码中由vgather所导致的总延迟进行预测。最终,为了验证模型预测的准确性,将该模型应用在三维7点stencil应用代码上,预测结果显示,vgather耗时占计算总耗时的约40%。再将该结果与利用intrinsics指令去除vgather后的计算耗时进行了对比验证,结果显示模型预测准确。基于上述结论,采用硬件计数器的统计结果在Xeon Phi协处理器上针对vgather构建了性能模型。同时,通过与其他平台的vgather对比,认为该模型也可以应用在同样具备vgather的Intel CPU处理器平台上。 Vgather is a hardware-implemented vector instruction introduced by Intel Initial Many- Core Instructions （IMCI） for Xeon Phi. Its target is to help SIMD registers access data from non-contig- uous memory locations. However, experimental results show that it can also be one of the key perform- ance bottlenecks on Xeon Phi. We model the performance of Vgather by using the profiling tool PAPI to directly collect the results of hardware performance counters. Address Generation Interlock （AGI） e- vents are profiled as the number of Vgather and the average latency of Vgather are estimated with VPU _DATA_READ events based on which we model the total latencies of Vgather instructions. 3D7P sten- cils are used to evaluate our model and the results show that Vgather spents nearly 40~ of total kernel time. We implement a Vgather-free version with intrinsic instruction to validate this model. Our contri- bution includes modeling Intel IMCI vgather instruction with hardware counters and validating it by stencils. The model can also be applicable on CPUs.

作者林新华王一超秦强李硕文敏华松岡聡

机构地区上海交通大学高性能计算中心东京工业大学 Intel公司

出处《计算机工程与科学》 CSCD 北大核心 2016年第9期1741-1747,共7页 Computer Engineering & Science

基金国家863计划(2014AA01A302) 日本学术振兴会RONPAKU Fellowship资助

关键词性能建模 vgather XEON PHI 硬件计数器 performance modeling vgather Xeon Phi hardware performance counlers

分类号 TP303 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献1

1林新华,李硕,赵嘉明,松岗聪.Intel Knights Corner的结点级内存访问优化[J].计算机科学,2015,42(11):37-42. 被引量：2

二级参考文献16

1Satish N,Kim C,Chhugani J, et al. Can traditional programmingbridge the Ninja performance gap for parallel computing applica-tions. [C] // 2012 39th Annual International Symposium onComputer Architecture CISCA). 2012:440-451.
2Xue W,Yang C,Fu H,et al. Enabling and Scaling a Global Shal-low-Water Atmospheric Model on Tianhe-2 [C] //Proceedings ofthe 2014 IEEE 28th International Parallel and Distributed Pro-cessing Symposium. 2014.
3PennycookSJ, Hughes CJ,Smelyanskiy M, et al. ExploringSIMD for Molecular Dynamics.Using Intel Xeon Processors andIntel Xeon Phi Coprocessors[C] //Proceedings of the 2013 IEEE27th International Symposium on Parallel and Distributed Pro-cessing. 2013:1085-1097.
4Heinecke A,Vaidyanathan K, Smelyanskiy M, et al. Design andImplementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel Xeon Phi Coprocessor [C] // Pro-ceedings of the 2013 IEEE 27th International Symposium onParallel and Distributed Processing. 2013 : 126-137.
5Krishnaiyer R, Kultursay E,Chawla P,et al. Compiler-BasedData Prefetching and Streaming Non-temporal Store Generationfor the Intel(R) Xeon Phi(TM) Coprocessor[C] // Proceedingsof the 2013 IEEE 27th International Symposium on Parallel andDistributed Processing Workshops and PhD Forum. 2013 : 1575-1586.
6Hofmann J,Treibig J,Hager G,et al. Performance Engineeringfor a Medical Imaging Application on the Intel Xeon Phi Accele-ratorCC]//2014 27th International Conference on Presented atthe Architecture of Computing Systems (ARCS). 2014:1-8.
7Jeffers J, Reinders J. Intel Xeon Phi Coprocessor High Perform-ance Programming (1st edition) [M]. Morgan Kaufmann Pub-lishers Inc,2013.
8Rahman R. Intel Xeon Phi Coprocessor Architecture and Tools:The Guide for Application Developers[M] // Intel Xeon Phi Cop-rocessor Architecture and Tools:The Guide for Application De-velopers(lst edition). 2013.
9Saini S, Jin H, Jespersen D, et al. An early performance evalua-tion of many integrated core architecture based SGI rackablecomputing system[C] // Proceedings of the International Confe-rence on High Performance Computing, Networking, Storageand Analysis. 2013.
10Hofmann J. Performance Evaluation of the Intel ManylntegratedCore Architecture for 3D Image Reconstruction in ComputedTomography (Master Thesis) [M]. Friedrich-Alexander-Univer-sity Erlangen-Nuremberg,2010.

共引文献1

1郝赫,司雨蒙,韦建文,文敏华,林新华.天体物理成团研究中的非规则访存优化[J].计算机科学与探索,2017,11(1):80-90. 被引量：1

引证文献1

1王一超,廖秋承,左思成,谢锐,林新华.一种ARM处理器面向高性能计算的性能评估[J].计算机科学,2019,46(8):95-99. 被引量：5

二级引证文献5

1杨薇薇.基于嵌入式ARM结构的仓储物流机器人控制算法设计[J].北部湾大学学报,2022,37(4):43-48. 被引量：3
2郭晓龙,牛晋宇,杜永萍.基于树莓派的高效卷积优化方法[J].计算机技术与发展,2023,33(5):96-104.
3何子兰,武利会,罗春风,严司玮,陈道品.基于动态电压频率调整的多核处理器功耗分析预测研究[J].自动化技术与应用,2023,42(9):139-142.
4张恩红,周钦强,麦博儒,王楠,田群.ARM架构高性能计算机系统部署测试分析与应用[J].广东气象,2023,45(6):96-100.
5张战炳,于潇雪,高亦沁,周芸,周衍晓,林新华.基于华为鲲鹏处理器的计算课程教学环境构建[J].软件导刊,2023,22(12):154-160.

1富弘毅,周海芳,杨学军.OpenMP并行程序的性能数据采集[J].计算机工程,2005,31(19):67-69. 被引量：1
2车永刚,王正华,李晓梅.一个基于硬件计数器的程序性能测试与分析工具[J].计算机科学,2004,31(1):170-174. 被引量：3
3社群＋直播，抓住下一个网红经济百亿元的风口[J].计算机应用文摘,2016,0(11):76-76.
4papi酱：短视频女王[J].高中生学习（作文素材与时评）,2016,0(7):11-11.
5项威.互联网大背景下自媒体发展方式——以papi酱为例[J].西部广播电视,2016,37(17):7-7.
6曹君.stencil计算在intel+mic众核上的并行优化[J].电子技术与软件工程,2016(17):148-148.
7赵家森.在VC++6.0中精确测量程序的运行时间[J].计算机应用研究,2003,20(4):158-160. 被引量：3
8林新华,秦强,李硕,文敏华,松岗聪.使用Stencil评估Intel AVX2 Vgather指令[J].计算机科学,2017,44(1):20-24.
9王卅,张文博,吴恒,宋云奎,魏峻,钟华,黄涛.一种基于硬件计数器的虚拟机性能干扰估算方法[J].软件学报,2015,26(8):2074-2090. 被引量：7
10Wen-Jing Ma,Kan Gao,Guo-Ping Long.Highly Optimized Code Generation for Stencil Codes with Computation Reuse for GPUs[J].Journal of Computer Science & Technology,2016,31(6):1262-1274.

计算机工程与科学

2016年第9期

浏览历史

内容加载中请稍等...

利用Stencil建模及评估Intel IMCI vgather指令被引量：1

参考文献1

二级参考文献16

共引文献1

引证文献1

二级引证文献5

相关作者

相关机构

相关主题

浏览历史

利用Stencil建模及评估Intel IMCI vgather指令 被引量：1

参考文献1

二级参考文献16

共引文献1

引证文献1

二级引证文献5

相关作者

相关机构

相关主题

浏览历史

利用Stencil建模及评估Intel IMCI vgather指令被引量：1