期刊文献+

基于龙芯3B处理器的Linpack优化实现 被引量:3

Optimization of Linpack for Loongson 3B processor
下载PDF
导出
摘要 HPL是高性能计算广泛采用的Linpack测试软件包.针对龙芯3B处理器体系结构的特点,为Linpack中的核心部分——矩阵乘法设计矩阵分块策略,利用龙芯3B的cache锁机制将频繁调用的数据分块锁在cache中,从而显著降低cache缺失率.同时为龙芯3B处理器中的访存加速部件设计了高效的预取算法,以实现计算时间掩盖访存时间.另外,分别对Linpack所调用的dtrsm和行交换等热点函数进行优化,并通过参数训练来优化Linpack参数.实验结果表明,在龙芯3B处理器上,单节点4核以及双节点8核的Linpack实测性能均达到理论峰值的60%左右,优化后的Linpack性能较优化前提升了10倍左右. High performance Linpack(HPL) is a linpack benchmark package widely adopted in high performance computing.An efficient partition strategy is introduced by Loongson 3B processor's architectural features in the matrix multiplication,and the cache lock mechanism which locks the frequently used data blocks into the locked cache is introduced to reduce the missing cache.To make the computation cost hides the memory access cost,a new prefetching algorithm is included in the memory access acceleration device.Other functions,such as dtrsm and line swapping,are optimized,and the optimal value is achieved for each parameter by training.Experimental results indicate that both single-node(4 cores) and double-node(8 cores) have achieved about 60% of theoretical peak performance,which are nearly 10 times performance improvement compared with non-optimized Linpack.
出处 《深圳大学学报(理工版)》 EI CAS 北大核心 2014年第3期286-292,共7页 Journal of Shenzhen University(Science and Engineering)
基金 国家高技术研究发展计划资助项目(2012AA01A30904) 广东省院士工作站建设项目(2012B090500020)~~
关键词 计算机系统结构 龙芯3B处理器 线性系统软件包 矩阵乘法 数据预取 computer architecture Loongson 3B processor linear system package matrix multiplication data prefetching
  • 相关文献

参考文献6

二级参考文献43

  • 1Vangal S R, Howard J, Ruhl G, et al. An 80-tile sub- 100-W teraFLOPS processor in 65-nm CMOS [J]. IEEE Journal of Solid-State Circuits, 2008, 43(1) : 29- 41.
  • 2Kahle J A, Day M N, Hofstee H P, et al. Introduction to the cell multiprocessor[J]. IBM Journal of Research and Development, 2005, 49 (4/5) 589-604:.
  • 3Kapasi U, Dally W J, Rixner S, et al. The imagine stream processor [C]// Proceedings of the 2002 International Confernce on Computer Design. Freiburg, Germany: IEEE Press, 2002: 282-288.
  • 4Waingold E, Taylor M, Sarkar V, et al. Baring it all to software., raw maehines[J]. IEEE Computer, 1997, 30(9) : 86-93.
  • 5Sankaralingam K, Nagarajan R, McDonald R, et al. Distributed microarchitectural protocols in the TRIPS prototype processor [C]// Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. Washington, USA: IEEE Computer Society, 2006: 480-491.
  • 6Gunnels J A, Henry G M, van de Geijn R A. A family of high performance matrix multiplication algorithms [C]// Proceedings of the International Conference on Computational Science - Part I. London, UK: Springer, 2001: 51-60.
  • 7Goto K. van de Geijn R A. On reducing TLB misses in matrix multiplication[R]. CS-TR-02-55, Department of Computer Scienees, The University of Texas at Austin, 2002.
  • 8Goto K. van de Geijn R A. Anatomy of high- performance matrix multiplication [ J ]. ACM Transactions on Mathematical Software, 2008, 34(3): Article 12(1-25).
  • 9Gunnels J, Lin C, Morrow G, et al. A flexible class of parallel matrix multiplication algorithms [C]// First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing. Washington, USA: IEEE Computer Society, 1998, 12: 110-116.
  • 10Marker B, van Zee F G, Goto K, et al. Toward sealable matrix multiply on multithreaded architectures [C]// Proceedings of the 13th International European Conference on Parallel and Distributed Computing. Rennes, France: ACM Press, 2007: 748-757.

共引文献45

同被引文献4

引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部