面向申威异构众核处理器的矩阵乘分块参数模型

Analytical Tile Size Model for Matrix Multiplication for Sunway Heterogeneous Many-core Architecture

下载PDF

导出

摘要针对矩阵乘计算的编译优化,解决了由于申威异构众核处理器复杂体系结构及存储层次导致的程序优化难问题,过程中循环分块参数对于程序的优化效果极为重要。基于申威最新一代SW26010-Pro异构众核处理器提出了矩阵乘计算分块参数模型,旨在为矩阵乘计算编译优化的计算分解提供分析模型支撑。模型通过对申威处理器上的存储空间及数据传输过程进行分析,能够确定最优循环分块参数,并对数据传输时间及程序执行时间做出预测。测试证明模型能够在存储空间限制条件下得到最优循环分块参数,且程序执行时间预测平均准确率达到了96.87%。 The compiler optimization for matrix multiplication reduces the difficulty of program optimization caused by the complex architecture and storage hierarchy of Sunway heterogeneous many-core processor.In the process of compiler optimization,the tile size is extremely important for the optimization effect of the program.This paper proposes an analytical matrix multiplication tile size model based on SW26010-Pro heterogeneous many-core processor,aiming to provide analytical model support for the computation decomposition of matrix multiplication compiler optimization.The model can determine the optimal tile size,and predict the data transfer time and program execution time.The model is tested and proven to be able to obtain the optimal tile size under the storage space limitation,and the average accuracy of program execution time prediction reaches 96.87%.

作者陶小涵庞建民朱雨王博漾徐金龙 TAO Xiaohan;PANG Jianmin;ZHU Yu;WANG Boyang;XU Jinlong(Information Engineering University,Zhengzhou 450001,China;Zhengzhou University,Zhengzhou 450001,China)

机构地区信息工程大学郑州大学

出处《信息工程大学学报》 2023年第1期65-71,共7页 Journal of Information Engineering University

基金国家自然科学基金资助项目(61702546)。

关键词异构众核处理器矩阵乘计算分块参数分析模型 heterogeneous many-core architecture matrix multiplication tile size analytical model

分类号 TP314 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献4

1郑方,许勇,李宏亮,谢向辉,陈左宁.一种面向高性能计算的自主众核处理器结构[J].中国科学：信息科学,2015,45(4):523-534. 被引量：12
2胡向东,柯希明,尹飞,张新,马永飞,颜世云,马超.高性能众核处理器申威26010[J].计算机研究与发展,2021,58(6):1155-1165. 被引量：13
3李雁冰,赵荣彩,丁锐,赵博.面向异构多核处理器的分块交叉数据传输[J].信息工程大学学报,2015,16(1):98-106. 被引量：1
4郑方,李宏亮,吕晖,过锋,许晓红,谢向辉.Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture[J].Journal of Computer Science & Technology,2015,30(1):145-162. 被引量：13

二级参考文献85

1张珩,沈海华.龙芯2号微处理器的功能验证[J].计算机研究与发展,2006,43(6):974-979. 被引量：26
2Manferdelli J L, Govindaraju N K, Crall C. Challenges and opportunities in many-core computing. Proceedings of the IEEE, 2008, 96(5): 808-815.
3Shalf J, Dosanjh S, Morrison J. Exascale computing technology challenges. In Proc. the 9th Int. High Performance Computing for Computational Science- VECPAR, June 2011, pp.1-25.
4Daga M, Aji A M, Feng W. On the efficacy of a fused CPU+GPU processor (or APU) for parallel computing. In Proc. Symposium on Application Accelerators in HighPerformance Computing, July 2011, pp.141-149.
5Chung E S, Milder P A, Hoe J C, Mai K. Single-chip heterogeneous computing: Does the future include custom logic, FPGAs, and GPGPUs? In Proc. the 43rd Annual IEEE/ACM International Symposium on Micmarchitecture (MICRO), December 2010, pp.225-236.
6Lee V W, Grochowski E, Geva R. Performance benefits of heterogeneous computing in HPC workloads. In Proc. the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), May 2012, pp.16-26.
7Kumar R, Farkas K I, Jouppi N P et al. Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction. In Proc. the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2003, pp.81-92.
8Lee V W, Kim C, Chhugani J et al. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proc. the 37th Annual International Symposium on Computer Architecture (ISCA), June 2010, pp. 451-460.
9Wittenbrink C M, Kilgariff E, Prabhu A. Fermi GF100 GPU architecture. IEEE Micro, 2011, 31(2): 50-59.
10Kapasi U J, Dally W J, Rixner S et al. The imagine stream processor. In Proc. IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD ), September 2002, pp. 282-288.

共引文献35

1张俊,吴庆慧.螺钉连接式固定桥初探[J].重庆医科大学学报,2000,25(2):205-207.
2Haohuan FU,Junfeng LIAO,Jinzhe YANG,Lanning WANG,Zhenya SONG,Xiaomeng HUANG,Chao YANG,Wei XUE,Fangfang LIU,Fangli QIAO,Wei ZHAO,Xunqiang YIN,Chaofeng HOU,Chenglong ZHANG,Wei GE,Jian ZHANG,Yangang WANG,Chunbo ZHOU,Guangwen YANG.The Sunway TaihuLight supercomputer： system and applications[J].Science China(Information Sciences),2016,59(7):109-124. 被引量：62
3山蕊,沈绪榜,蒋林,朱筠,宋辉.面向阵列处理器的分布式共享存储结构设计[J].北京邮电大学学报,2017,40(4):9-15. 被引量：4
4张昆,郑方,谢向辉.以访存为中心的阵列众核处理器核心流水线设计[J].计算机工程与科学,2017,39(12):2167-2175. 被引量：2
5戴卓臣,陆江东.面向数据加密的多核多线程并行研究[J].电子设计工程,2018,26(8):183-187. 被引量：3
6刘鑫,郭恒,孙茹君,陈左宁.“神威·太湖之光”计算机系统大规模应用特征分析与E级可扩展性研究[J].计算机学报,2018,41(10):2209-2220. 被引量：17
7伍明川,黄磊,刘颖,何先波,冯晓兵.面向神威·太湖之光的国产异构众核处理器OpenCL编译系统[J].计算机学报,2018,41(10):2236-2250. 被引量：7
8李颖颖,庞建民,李雁冰,翟胜伟.一种面向众核处理器的嵌套循环多维并行识别方法[J].计算机应用研究,2018,35(11):3311-3314. 被引量：3
9陶小涵,庞建民,高伟,王琦,姚金阳.基于SW26010处理器的FT程序的性能优化[J].计算机科学,2019,46(4):321-328. 被引量：6
10李雁冰,赵荣彩,韩林,赵捷,徐金龙,李颖颖.一种面向异构众核处理器的并行编译框架[J].软件学报,2019,30(4):981-1001. 被引量：7

1武铮,许乐,安虹,金旭,文可.针对SW26010众核处理器的单精度矩阵乘算法[J].小型微型计算机系统,2023,44(4):673-681.
2张伟,魏英皓,贾永坡.炼钢厂10 kV高压电动机防重合闸的线路改造[J].山西冶金,2023,46(2):151-153.
3黄沛昱,赵强,李煜龙.基于FPGA的卷积神经网络硬件加速器设计[J].计算机应用与软件,2023,40(3):38-44. 被引量：3
4唐亚波.大型水电站双机溜负荷故障机理分析及处理措施研究[J].电工技术,2023(3):30-33.
5孙长江,李皇,王文青.面向矩阵计算的加速系统设计[J].电子与封装,2023,23(4):51-59.
6杨铮鑫,王明罡,党鹏飞,鲍宁波.基于VMD-HT的滚动轴承故障诊断[J].机械设计与制造,2023(3):15-18. 被引量：6
7王全泽.行政协议履行程序与效力确认之诉的重叠与纾解——以房屋征收补偿协议为视角[J].人民司法,2023(10):84-89.
8李宏顺.660 MW机组凝结水精处理再生系统问题分析及处理[J].流体测量与控制,2023,4(2):56-59. 被引量：3
9邓嵘.党内法规制定的标准化及其推进路径[J].复印报刊资料（中国共产党）,2022(8):94-104.
10张立,黎铁军,张建民.一种面向蒙特卡洛程序的128核可扩展体系结构[J].计算机工程与科学,2023,45(4):590-598.

信息工程大学学报

2023年第1期

浏览历史

内容加载中请稍等...

面向申威异构众核处理器的矩阵乘分块参数模型

参考文献4

二级参考文献85

共引文献35

相关作者

相关机构

相关主题

浏览历史