摘要
Intel Xeon Phi协处理器的指令集IMCI引入了硬件实现的vgather指令,旨在帮助512位SIMD寄存器访问非连续内存地址上的数据。然而实验结果显示,vgather很有可能成为应用在Xeon Phi协处理器上关键的性能瓶颈之一。基于以上结论,针对vgather的性能建模可以帮助用户深入地掌握和理解Xeon Phi协处理器的性能特性。在实验方法上,本文方法与现存的通过程序段内嵌入汇编代码进行数据统计不同,使用PAPI等性能分析工具直接收集硬件计数器的统计结果,作为模型的实验数据。本文的性能模型基于AGI事件次数和根据VPU_DATA_READ次数估算得出的vgather所导致的平均延迟构建而成。该模型能够对Xeon Phi应用代码中由vgather所导致的总延迟进行预测。最终,为了验证模型预测的准确性,将该模型应用在三维7点stencil应用代码上,预测结果显示,vgather耗时占计算总耗时的约40%。再将该结果与利用intrinsics指令去除vgather后的计算耗时进行了对比验证,结果显示模型预测准确。基于上述结论,采用硬件计数器的统计结果在Xeon Phi协处理器上针对vgather构建了性能模型。同时,通过与其他平台的vgather对比,认为该模型也可以应用在同样具备vgather的Intel CPU处理器平台上。
Vgather is a hardware-implemented vector instruction introduced by Intel Initial Many- Core Instructions (IMCI) for Xeon Phi. Its target is to help SIMD registers access data from non-contig- uous memory locations. However, experimental results show that it can also be one of the key perform- ance bottlenecks on Xeon Phi. We model the performance of Vgather by using the profiling tool PAPI to directly collect the results of hardware performance counters. Address Generation Interlock (AGI) e- vents are profiled as the number of Vgather and the average latency of Vgather are estimated with VPU _DATA_READ events based on which we model the total latencies of Vgather instructions. 3D7P sten- cils are used to evaluate our model and the results show that Vgather spents nearly 40~ of total kernel time. We implement a Vgather-free version with intrinsic instruction to validate this model. Our contri- bution includes modeling Intel IMCI vgather instruction with hardware counters and validating it by stencils. The model can also be applicable on CPUs.
出处
《计算机工程与科学》
CSCD
北大核心
2016年第9期1741-1747,共7页
Computer Engineering & Science
基金
国家863计划(2014AA01A302)
日本学术振兴会RONPAKU Fellowship资助