面向通用HPC的高性能DSP设计权衡被引量：4

Design Tradeoffs of High Performance DSPs for General-Purpose HPC

下载PDF

导出

摘要 GPU由于其计算能力高达数TFLOPS,被高性能计算领域用于加速并行运算.但GPU较低的峰值性能利用率和功耗效率,已经成为了系统性能进一步提升的瓶颈.为了解决这个问题,作者开始研究将高性能DSP用于通用高性能计算领域.为了高效支撑通用高性能计算,文中提出了高性能DSP的结构框架,并通过映射GotoBLAS库到该结构上,建立了GEMM在该结构上的性能模型.作者研究了影响GEMM效率的主要因素,包括性能、存储层次、核的大小以及核的数量.文中总结了一些有指导意义的结论用于构建面向通用高性能计算的高效DSP.实验结果表明,通过尽可能少的硬件代价,可以在TFLOPS DSP上获得接近峰值的性能. The traditional HPC area employs GPUs which can afford TFLOPS level computing ability to accelerate the parallel computing.The low peak performance utilization and the low power efficiency of GPUs have become the bottlenecks for the system performance improvement.We start introducing high performance DSPs into general-purpose HPC area to address this issue.To support general-purpose HPC effectively,this paper constructs a performance model for the GEMM on high performance DSPs by mapping GotoBLAS onto the proposed architecture.We investigate factors that influence the performance and efficiency of GEMM,including performance,memory hierarchy,core size and number of cores.Some suggestive conclusions are summarized to help designing DSPs that are efficient for the general-purpose HPC.Evaluation results show that it can achieve a near-peak performance on the TFLOPS DSP with as few hardware cost as possible.

作者张凯陈书明王耀华宁希

机构地区国防科学技术大学计算机学院

出处《计算机学报》 EI CSCD 北大核心 2013年第4期790-798,共9页 Chinese Journal of Computers

基金国家自然科学基金(60906014 61070036) 国防科学技术大学高性能计算联合博导组科研基金教育部博士点基金(20094307110009)资助

关键词高性能计算矩阵乘法数字信号处理器模型设计权衡 HPC GEMM DSP model design tradeoffs

分类号 TP302 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献23

1Esmaeilzadeh Hadi, Blem Emily, Amant Renee St,Sankaralingam Karthikeyan, Burger Doug. Dark silicon andthe end of multicore scaling//Proceedings of the ACM/IEEEInternational Symposium on Computer Architecture. SanJose, USA, 2011:365-376.
2Texas Instruments Incorporated. TMS320C66x multicoreDSPs for high-performance computing. USA:TI SPRT619,2011.
3Igual Francisco D,Ali Murtaza,Friedmann Arnon, StotzerEric, Wentz Timothy, van de Geijn Robert. UnleashingDSPs for general-purpose HPC. USA:TI FLAME WorkingNote# 61, Feb.,2012.
4Woh Mark,Seo Sangwon,Mahlke Scott, Mudge Trevor,Chakrabarti Chaitali, Flautner Krisztian. AnySP:Anytimeanywhere anyway signal processing//Proceedings of theACM/IEEE International Symposium on Computer Architec-ture. Austin, USA, 2009:128-139.
5Kagstrom B, Ling P, Van Loan C. GEMM-based Level3 BLAS:High performance model implementations and per-formance evaluation benchmark. ACM Transactions onMathematical Software, 1998,24(3):268-302.
6Volkov V,Demmel J. Benchmarking GPUs to tune denselinear algebra//Proceedings of the ACM/IEEE Supercomput-ing. Austin, USA, 2008:1-11.
7Lin Colin Y,So Hayden K-H,Leong Philip H W. A modelfor peak matrix performance on FPGAs//Proceedings of theIEEE International Symposium on Field- Programmable Cus-tom Computing Machines. Salt Lake City, USA, 2011:251.
8Nath R et al. An improved MAGMA GEMM for fermiGPUs. USA; NVIDIA Technical Report:LAPACK WN #227, 2010.
9Tan Guangming, Li Linchuan, Triechler Sean, PhillipsEverett, BaoYungang,Sun Ninghui. Fast implementation ofDGEMM on fermi GPU//Proceedings of the ACM/IEEE Su-percomputing. Seatle,USA,2011:1-11.
10Li Jiajia, Li Xingjian, Tan Guangming, Chen Mingyu,SunNinghui. An optimized large-scale hybrid DGEMM designfor CPUs and ATI GPUs//Proceedings of the ACM/IEEESupercomputing. Salt Lake City, USA, 2012:377-386.

同被引文献45

1李波,葛宝珊,李炜,姚春莲.基于通用DSP的多模式视频编码器[J].计算机学报,2004,27(12):1648-1656. 被引量：3
2马骥,乔双,李丹.基于Web的嵌入式DSP测控系统设计[J].东北师大学报（自然科学版）,2007,39(1):41-45. 被引量：3
3Hadjipaschalis I,Poullikkas A,Efthimiou V.Overview of current and future energy storage technologies for electric power applications[J].Renewable and Sustainable Energy Reviews,2009,13(6):1513-1522.
4Li K,Kumpf R,Horton P,et al.A quantitative analysis of disk drive power management in portable computers[C]∥USENIX winter.2002:279-291.
5Gurumurthi S,Sivasubramaniam A,Kandemir M,et al.DRPM:dynamic speed control for power management in server class disks[C]∥30th Annual International Symposium on Computer Architecture,2003.IEEE,2003:169-179.
6Son S W,Chen G,Kandemir M,et al.Exposing disk layout to compiler for reducing energy consumption of parallel disk based systems[C]∥Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming.ACM,2005:174-185.
7Son S W,Kandemir M,Choudhary A.Software-directed diskpower management for scientific applications[C]∥19th IEEE International Parallel and Distributed Processing Symposium,2005.IEEE,2005:4-13.
8Weissel A,Beutel B,Bellosa F.Cooperative I/O:A novel I/O semantics for energy-aware applications [J].ACM SIGOPS Opera-ting Systems Review,2002,36(SI):117-129.
9Papathanasiou A E,Scott M L.Energy efficient prefetching and caching[C]∥Proceedings of the 2004 USENIX Annual Technical Conference.Berkeley,CA,USA,2004:255-268.
10Pinheiro E,Bianchini R.Energy conservation techniques for disk array-based servers[C]∥Proceedings of the 18th annual international conference on Supercomputing.ACM,2004:68-78.

引证文献4

1邓定胜.高性能计算中一种改进的数据访问节能技术研究[J].计算机科学,2015,42(2):191-197. 被引量：1
2姜宏旭,刘亭杉,李辉勇,张萍,段洣毅.FPGA+DSP异构视频处理系统中基于SRIO的数据高效传输方法[J].计算机学报,2015,38(6):1119-1130. 被引量：22
3周玉轩,杨絮,秦传义,杨志伟,朱一峰,段锦.HDM网络架构与混合式数据分发策略[J].计算机研究与发展,2020,57(9):1911-1927.
4杨兵,毛臻,周永亮,余国良,张诚.基于CPU+FPGA的机载可重构信息处理系统研究[J].仪器与设备,2022,10(2):126-135.

二级引证文献23

1赵谦.RapidIO总线在嵌入式信号处理计算机中的应用研究[J].信息通信,2019,32(11):103-103. 被引量：5
2吕爽,衡志炜,马艳军.西南区域气象中心IBM高性能计算机管理及应用[J].高原山地气象研究,2015,35(2):71-76. 被引量：7
3刁丹丹,王晓东.基于TMS320C6678的SRIO接口设计[J].电子科技,2017,30(5):110-112. 被引量：1
4李克俭,李洋,柯宝中,雷琳.基于FPGA的寻址与运算操作数存储IP核设计[J].广西科技大学学报,2017,28(4):72-79. 被引量：3
5许义宝,胡永兵,陈庆然.基于FPGA的多节点光纤传输系统设计与实现[J].计算机技术与发展,2018,28(3):197-200. 被引量：4
6杜岳涛.基于遗传算法的激光图像处理系统设计研究[J].自动化与仪器仪表,2018,0(6):86-89. 被引量：4
7樊圆圆,蔡骏,乔启鸣,吴亚联.基于LiteOS的全向轮平衡车设计[J].物联网技术,2018,8(7):80-82. 被引量：1
8仵松颀,李争平,陈雷,袁明芊.基于FPGA的航姿测量单元设计与实现[J].物联网技术,2018,8(7):90-92.
9张华鹏,宋茂忠,柳涛.卫星导航模拟器模块间高速串行数据传输设计[J].电子设计工程,2018,26(17):140-144. 被引量：2
10郭洪宾,于惠钧,罗梓张,龚星宇,刘安海,黄星.基于FPGA的高速图像采集处理系统优化[J].湖南工业职业技术学院学报,2019,19(1):4-7. 被引量：1

1阮利,秦广军,肖利民,祝明发.基于龙芯多核处理器的云计算节点机[J].通信学报,2013,34(12):131-141. 被引量：3
2毕庶本,慕宗昭.新一代TFlps档超级计算机的开发[J].新浪潮,1998(6):53-57.
3晏小波,唐滔,杨学军.FT64并行系统上的EP和GEMM并行算法设计与实现[J].计算机研究与发展,2008,45(z1):87-92. 被引量：2
4蒋孟奇,张云泉,宋刚,李玉成.GOTOBLAS一般矩阵乘法高效实现机制的研究[J].计算机工程,2008,34(7):84-86. 被引量：8
5李晓雯,崔翔,殷瑞杰,刘强.缓存结构GPU矩阵乘法算法的自动优化[J].现代电子技术,2014,37(10):137-140.
6AMD发新款服务器级别显卡[J].数码设计,2012(12):30-30.
7刘昊,刘芳芳,张鹏,杨超,蒋丽娟.基于申威1600的3级BLAS GEMM函数优化[J].计算机系统应用,2016,25(12):234-239. 被引量：10
8联想将开发世界最快超级计算机[J].电脑编程技巧与维护,2005(9):3-3.
9郑方,许勇,李宏亮,谢向辉,陈左宁.一种面向高性能计算的自主众核处理器结构[J].中国科学：信息科学,2015,45(4):523-534. 被引量：12
10价格高达1．4万元，AMD发布双芯卡皇Radeon Pro Duo[J].微型计算机,2016,0(15):25-25.

计算机学报

2013年第4期

浏览历史

内容加载中请稍等...

面向通用HPC的高性能DSP设计权衡被引量：4

参考文献23

同被引文献45

引证文献4

二级引证文献23

相关作者

相关机构

相关主题

浏览历史

面向通用HPC的高性能DSP设计权衡 被引量：4

参考文献23

同被引文献45

引证文献4

二级引证文献23

相关作者

相关机构

相关主题

浏览历史

面向通用HPC的高性能DSP设计权衡被引量：4