期刊文献+

面向国产异构系统的HPL异构协同设计

Orchestrating HPL between CPU and China accelerator
下载PDF
导出
摘要 HPL是高性能计算广泛采用的Linpack测试软件包,传统HPL算法中,求解矩阵将以块为单位循环分布到所有处理器,由于国产加速器(China Accelerator)的底层矩阵乘接口仅支持定制接口,传统HPL算法已不适合CPU+China Accelerator异构系统,因此,必须基于定制接口完成矩阵分布细致划分与封装dPEM,以提供一个通用的HPL测试配置环境;同时,为了充分发挥国产异构系统的效率,设计了异构协同矩阵乘调度算法OA4MM,以提高国产异构系统的效率。实验验证了dPEM的有效性和OA4MM算法的高效性,OA4MM较传统的异构HPL调度算法性能提升近10%。 HPL is a Linpack benchmark package widely used in high performance computing test.Matrix is divided into sub-matrix and distributed into computing elements in traditional HPL algorithm.However,it is ineffective for China Accelerator because of a specified interface on matrix multiplication built in China Accelerator.Thus,dPEM(delicate Partition and Encapsulation on Matrix)is advised to expose a friendly testing configuration environment.Furthermore,we propose OA4 MM(Orchestrating Algorithm for Matrix multiplication)based on heterogeneous system composed of CPU and China Accelerator.Experimental results validate dPEM and OA4 MM on CPU + China Accelerator.OA4 MM can promote productivity up to 10%in comparison to heterogeneous HPL.
出处 《计算机工程与科学》 CSCD 北大核心 2018年第1期10-14,共5页 Computer Engineering & Science
基金 国家重点研发计划(2017YFB0202104) 国家自然科学基金(61602495 61402039 11401580 11665012) 计算机软件新技术国家重点实验室(南京大学)开放课题(KFKT2016B25) 国防科技大学预研计划(ZK16-03-06) 国家重点实验室专项基金(Y62612A87S) 中国科学院光谱成像技术重点实验室开放基金(LIST201602D)
关键词 HPL 国产加速器 矩阵分布细致划分与封装 异构协同矩阵乘调度 HPL China accelerator delicate partition and encapsulation on matrix orchestrating algorithm for matrix multiplication
  • 相关文献

参考文献2

二级参考文献18

  • 1李文龙,刘利,汤志忠.软件流水中的循环展开优化[J].北京航空航天大学学报,2004,30(11):1111-1115. 被引量:16
  • 2张文力,陈明宇,樊建平.HPL测试性能仿真与预测[J].计算机研究与发展,2006,43(3):557-562. 被引量:13
  • 3E. Caron, G. Utard. On the performance of parallel factorization of out-of-core matrices. Parallel Computing, 2004, 30(3) : 357-375.
  • 4J. Dongarra. Linear algebra algorithms ( continued ). http://www. cs. utk. edu/- dongarra/WEBPAGES/SPRING-2000/lect08, pdf, 2000-02-29
  • 5R. P. Brent, P. E. Strazdins. Implementation of BLAS level 3 and Linpack benchmark on the API000. Fujitsu Scientific and Technical Journal, 1993, 29( 1 ) : 61 - 70.
  • 6W. Zhang, J. Fan, M. Chen. Efficient determination of block size NB for parallel Linpack test. In: Proe. lASTED lnt'l Conf.Parallel and Distributed Computing and Systems (PDCS 2004).Combridge: MIT Press, 2004. 439-92.
  • 7W, Zhang, M. Chen, J. Fan. HPL performance prevision to intending system improvement. In: Proc. lnt'l Symposium on Parallel and Distributed Processing and Applications ( ISPA 2004),Lecture Notes in Computer Science 3358. Berlin: Springer-Verlag, 2004. 777-782.
  • 8H. W. Meuer, E. Strohmaier, J. Dongarra, et al. TOP500 List for June 2004. http://www.top500.org/lists/2004/06/basic,2004-06-22.
  • 9J. Dongarra, P. Luszczek, A. Petitet. The Linpack benchmark:Past, present, and future. Concurrency and Computation:Practice and Experience, 2003, 15:803-820.
  • 10A. Petitet, R. C. Whaley, J. Dongarra, et al. HPL-A portable implementation of the high performance Linpaek benchmark for distributed-memory computers.http://www.netlib. org/benchmark/hpl/, 2004-01-20.

共引文献14

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部