摘要
本文主要研究了在CPU/GPU异构集群上的并行矩量法的加速技术。本文设计出一种MPI/CUDA软件编程架构,解决了CPU/GPU异构集群上并行LU分解跨节点计算的难题。此架构基于矩阵分块二维循环分布的数据分配策略,利用MPI实现计算节点之间的通信,同时利用GPU加速矩阵更新过程。为了突破GPU显存对LU分解的矩阵规模的限制,本文进一步研究了"显存—内存"核外算法。为了优化算法性能,本文提出了基于"CUDA流"技术和"异步通信"技术的设计方案,实现了GPU通信与计算的重叠,有效隐藏了GPU通信时间,获到了明显的加速效果。
The acceleration technique for the parallel Mo M on CPU/GPU hybrid system platform is studied. In this paper, based on the parallel data distribution scheme of matrix blocked 2-D circle, the MPI/CUDA software program architecture is designed,which uses MPI to achieve the internal communication and GPU to accelerate the matrix updates process. So the bottleneck of across nodes parallel LU factorization on CPU/GPU hybrid cluster is broken up. In order to overcome the restriction of GPU memory to the matrix scale factorized, the 'GPU memory-CPU memory' out-of-core technique is introduced. In order to optimize the performance of this algorithm, the designing scheme based on 'CUDA stream' and 'asynchronous communication' technologies is proposed which contributes to the overlap of GPU communication with computation, so the GPU communication time is hided and the obviously speedup is obtained.
出处
《微波学报》
CSCD
北大核心
2014年第S1期51-54,共4页
Journal of Microwaves
关键词
矩量法
异构平台
GPU加速
并行
核外
隐藏通信
MoM
hybrid system platform
GPU acceleration
parallel
out-of-core
hiding communication