处理器性能波动检测的计时方法及评价指标

Timing Method and Evaluation Metrics for CPU Performance Variation Detections

下载PDF

导出

摘要超级计算机中的性能波动通常表现为软件在同一硬件上运行得忽快忽慢,或在配置相同的硬件上运行得快慢不一.在多种性能波动来源中,处理器性能波动隐蔽性强且危害巨大,可导致超级计算机整机性能急剧下降.然而,当前处理器性能波动研究面临两大难题.首先,现有工具难以检测微小的性能波动.为了准确检测纳秒级的处理器性能波动,计时方法需要具有很高的精度和灵敏度.然而,现有工具在真实应用中用于计时测量时,计时结果波动可达数万拍,难以检测处理器性能波动.其次,现有方法难以客观评价不同工具的性能波动检测能力,缺乏量化评价指标.一次性能波动检测包含大量计时结果,其分布可能受性能波动和计时波动的共同影响.然而,现有方法无法评价这些测量结果是否真实反映了性能波动的特征.为解决第一个问题,本文对PAPI在不同缓存状态下的计时波动进行了测量和原因分析.随后,基于x86和Armv8指令集的内存屏障和序列化指令,设计了序列化屏障计时方法,用以抑制计时波动.为解决第二个问题,本研究对计时波动进行建模,首次提出了跨平台的计时方法精度和灵敏度指标及评价方法,定量评估了计时方法对微小时间波动的测量能力,为性能波动的检测和判定提供了依据.实验表明,在英特尔Xeon 6248和华为鲲鹏920-6426处理器上,与PAPI相比,序列化屏障计时方法的精度提高了2.2~30.2倍,灵敏度提高了1.9~44.8倍,并且能够检测到纳秒级别的性能波动. Performance variation is characterized by inconsistencies in run times on the same hardware or by periods of unaligned performance on identical hardwares.The CPU performance variation is one of the most harmful and insidious causes of performance degradation.Even a tiny variation can negatively affect the overall performance of a supercomputer.CPU performance variation detections currently face two challenges.First,identifying tiny processor performance variation is difficult with existing profiling tools like PAPI.The magnitude of processor performance variations can be as low as the nanosecond level.To accurately detect such variations,timing methods need to have high precision and sensitivity.However,researchers have found that tools like PAPI and LIKWID,when used for timing measurements in real applications,have large overheads and fluctuations that can reach tens of thousands of cycles,making it difficult to capture nanosecond-level run time changes.Second,existing methods struggle to objectively evaluate the performance variation detection capabilities of different tools.A single performance variation detection consists of thousands of timing results.The distribution characteristics of these measurements include variations in the runtime of the tested code,fluctuations in the overhead of the timing method itself,and the impact of the timing operation on the tested code.However,current methods cannot determine whether the timing results truly reflect the distribution of the code’s runtime.To address the first problem,this study first focused on PAPI,the most commonly used performance measurement tool at present.By simulating the cache environment of real applications,we measured and analyzed PAPI’s timing fluctuations under different cache states for the first time.The experimental results showed that when measuring the run time of a computation process that does not change,PAPI’s measurements exhibited significant long-tail deviations.Combining performance counter analysis,the main causes of PAPI’s timing fluctuations included timing overhead,operating system noise,out-of-order execution,and cache misses.Subsequently,this study designed a serialized barrier timing method based on the memory barrier and serialized instructions of the x86 and Armv8 instruction sets,which suppressed timing fluctuations.In comparative experiments,the amplitude of timing fluctuations of the serialized barrier timing method was significantly lower than that of PAPI.To address the second problem,this study combined experiments and modeling to perform qualitative and quantitative analyses of the sources of instability in timing fluctuations and their impact on measurement values.For the first time,this paper proposed cross-platform precision and sensitivity indicators for timing methods,along with evaluation methods aimed at detecting processor performance variation.This paper suggests that in performance variation detection,the shorter the time that can be accurately measured,the higher the precision;the smaller the amplitude of performance variation that can be accurately distinguished,the higher the sensitivity.The precision and sensitivity indicators quantitatively evaluated the timing methods'ability to measure minute time fluctuations,thereby providing a basis for the detection and determination of performance variation.According to our evaluations,on the Intel Xeon 6248 and Huawei Kunpeng 920-6426 processors,compared to PAPI,the serialized barrier timing method was 2.2~30.2 times more precise and 1.9~44.8 times more sensitive,and is able to detect nanosecond-level performance variation.

作者廖秋承左思成王一超林新华 LIAO Qiu-Cheng;ZUO Si-Cheng;WANG Yi-Chao;LIN Xin-Hua(Center for High Performance Computing,Shanghai Jiao Tong University,Shanghai 200240)

机构地区上海交通大学高性能计算中心

出处《计算机学报》 EI CSCD 北大核心 2024年第2期456-472,共17页 Chinese Journal of Computers

基金国家自然科学基金(62072300)资助。

关键词高性能计算处理器微架构性能波动性能分析性能评测 high performance computing microarchitecture performance variation performance analysis performance evaluation

分类号 TP301 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献1

1王子聪,陈小文,郭阳.片上多核处理器Cache访问均衡性研究[J].计算机学报,2019,42(11):2403-2416. 被引量：3

共引文献2

1胡九川,范东睿,程建聪,严龙,叶笑春,李灵枝,万良易,钟海斌.内存与片上渗透缓存之间数据迁移的理论分析[J].通信学报,2021,42(8):217-225. 被引量：1
2方燕飞,刘齐,董恩铭,李雁冰,过锋,王谛,何王全,漆锋滨.面向E级超算系统的众核片上存储层次研究[J].计算机工程,2023,49(12):10-24. 被引量：1

1兰爱琳,黄春华,夏铭,李茂娟,汪思齐,楼迪栋.斑马鱼肾损伤造模方法及评价指标的研究进展[J].中国比较医学杂志,2023,33(12):123-126.
2樊哲,南子渊,郝一帆,杜子东,陈云霁.基于自适应静态数据布局策略的深度学习张量程序自动生成框架[J].高技术通讯,2023,33(11):1160-1171.
3郑纬民.构建支持大模型训练的计算机系统需要考虑的4个问题[J].大数据,2024,10(1):1-8. 被引量：1
4董丰,周基航,贾彦东.银行资产负债表、金融系统性风险与双支柱调控框架[J].经济研究,2023,58(8):62-82. 被引量：6
5徐光龙,金鹰.信创CPU与Intel CPU在NUMA架构方面的调优实践[J].计算机应用文摘,2024,40(3):35-37.
6李娜.无人机遥感技术在测绘工程测量中的应用研究[J].中文科技期刊数据库（引文版）工程技术,2024(1):0077-0080.
7李叶青.生活饮用水微生物检验方法和评价标准分析[J].中文科技期刊数据库（全文版）自然科学,2024(1):0016-0019.
8陆朝阳.量子,匪夷所思但不“高冷”[J].科学大观园,2024(2):20-23.
9余运俊,张鹏飞,龚汉城,陈敏.面向边缘计算的轻量级网络硬件加速设计[J].计算机科学,2023,50(S02):820-826.
10詹海洋,胡敏敏,殷喜喜,文凯,张德华.基于Arduino的单摆周期测量实验改进[J].中学物理教学参考,2023(20):59-60.

计算机学报

2024年第2期

浏览历史

内容加载中请稍等...

处理器性能波动检测的计时方法及评价指标

参考文献1

共引文献2

相关作者

相关机构

相关主题

浏览历史