期刊文献+

基于Parray数组类型的矩阵乘法实现 被引量:1

Matrix Multiplication Implementation Based on Array Types of Parray
下载PDF
导出
摘要 介绍针对异构集群体系结构特点设计的编程接口Parray.Parray使用数组类型对数据的物理存储和逻辑结构进行分离.Parray使用统一的线程数组类型表示各种进程(线程)的创建以及它们之间的控制流转.通过矩阵乘法实例演示Parray程序设计的特点:该程序由一个单CPU线程程序演变为多CPU线程程序、再演变为GPU线程程序——程序的各次演变仅通过数组类型的变化和代码的细微修改即可完成.介绍使用Parray实现的高性能GPU矩阵乘法,在天河1A单节点上的测试性能和CUBLAS 4.0相当,同时该代码可以工作于不同物理存储方式的数组. In this paper, a programming interface of GPU-accelerated heterogeneous clusters named Parray is introduced. In Parray, the concept of array type is involved to separate the physical data layout and logical structure of multi-dimensional data~ the control flow diversion of heteroge- neous computation units is formally unified. An example code of matrix multiplication is shown to demonstrate the programming characteristics of Parray. the code envolves from a single CPU- thread code to multi-threads code and then a GPU code by modifying the array types and several program lines. A GPU-based high performance GEMM implemented in Parray is introduced and achieves almost the same Gflops when testing on a single node of Tian-lA system. Because the code operates directly on the logical structure of data, the same GEMM code can work on different physical array data layouts.
出处 《计算机学报》 EI CSCD 北大核心 2014年第12期2564-2573,共10页 Chinese Journal of Computers
基金 国家"八六三"高技术研究发展计划项目基金(2012AA010902 2012AA010903) 国家自然科学基金(61240045 61170053 61432018 61379048) 博士后科学基金(2013M540821) 河南省教育厅科学技术研究重点项目(13A520065)资助~~
关键词 GPU集群 程序设计 矩阵乘法 编程接口 性能优化 GPU-accelerated cluster programming method matrix multiplication programminginterface performance optimization
  • 相关文献

参考文献8

  • 1Cui Xiang, Chen Yifeng, Mei Hong. Improving performance of matrix multiplication and FFT on GPU//Proceedings of the 15th International Conference on Parallel and Distributed Systems. Shenzhen, China, 2009:42-48.
  • 2Cui Xiang, Chen Yifeng, Zhang Changyou, Mei Hong. Auto- tuning dense matrix multiplication for GPGPU with cache// Proceedings of the 16th International Conference on Parallel and Distributed Systems. Shanghai, China, 2010: 237-242.
  • 3Chen Yifeng, Cui Xiang, Mei Hong. Large-scale FFT on GPU clusters//Proceedings of the 24th International Confer- ence on Supercomputing. Tsukuba, Ibaraki, Japan, 2010: 315-324.
  • 4陈一睾,崔翔,梅宏.众核加速的工作站集群软件问题初探//全国高性能计算学术年会.长沙,中国,2009:45-50.
  • 5陈一睾,崔翔,梅宏.PARRAY:一个针对GPU集群的统一编程工具//全国高性能计算学术年会.北京,中国,2010:45-50.
  • 6Chen Yifeng, Cui Xiang, Mei Hong. PARRAY: A unifying array representation for heterogeneous parallelism//Proeeedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New Orleans, USA, 20121 171-180.
  • 7Nvidia. CUDA Compute Unified Cevice Architecture Programming Guide. New Orleans, USA: NVIDIA Corp, 2007.
  • 8Volkov V, Demmel J W. Benchmarking GPUs to tune dense linear algebra//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Salt Lake City, USA, 2008:1-11.

同被引文献7

引证文献1

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部