摘要
【目的】LDLT分解是求解很多稀疏对称线性系统的有效工具之一,尤其是对于迭代法难以收敛的问题。然而在GPU上实现LDLT分解存在困难,因为分解过程中存在数据依赖和不规则的数据访问。【方法】本文设计并实现了一个基于GPU的稀疏对称矩阵的LDLT分解,它采用Cholesky的符号分解和右视分解算法、稀疏矩阵依赖图的层次划分,以及CUDA的动态并行核调度技术,算法的所有三层循环都并行化,从而获得更高的并行度。【结果】实验结果表明,针对稀疏对称矩阵的一个典型的测试集,在GPU上实现的LDLT分解相对于UMFPACK最高加速46.2倍。【结论】LDLT分解CUDA实现策略可为高性能GPU异构平台上开展稀疏矩阵的高性能数值算法研究与实现提供借鉴。
[Objective]LDLT decomposition is an effective tool to solve many problems in sparse symmetric linear systems,especially for those problems which are hard to converge using iterative solvers.However,it is difficult to implement LDLT on the GPU for data dependency and irregular data access during the factorization.[Methods]In this paper,an effective GPU-based LDLT decomposition method of sparse symmetric matrix is designed and implemented based on Cholesky symbolic decomposition,right-looking decomposition algorithm and level partition of the dependency graph for the sparse matrix.By using controlled kernel launch for CUDA dynamic parallelism,all three loops of the algorithm are parallelized,so the proposed method can achieve higher parallelism.[Results]Experimental results show that the implementation of LDLT on GPU can achieve a maximum speedup of 46.2 compared to UMFPACK for a typical collection of sparse symmetric matrix.[Conclusions]CUDA implementation of LDLT can give reference to high performance numerical algorithm research and implementation for sparse matrix on GPU-based heterogeneous platforms.
作者
陈鑫峰
王武
Chen Xinfeng;Wang Wu(Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy of Sciences,Beijing 100049,China)
出处
《数据与计算发展前沿》
CSCD
2021年第3期136-147,共12页
Frontiers of Data & Computing
基金
国家重点研发计划项目“复杂电磁环境高性能应用软件系统研制及应用示范”(2017YFB 0202502)
中国科学院“十三五”信息化专项“科研信息化应用工程”(XXH13506-405)。
关键词
LDLT分解
右视算法
GPU
动态并行
LDLT decomposition
right-looking algorithm
GPU
dynamic parallelism