摘要
BLAS库分为两类函数运算:复数函数与实数函数。矩阵乘法函数是BLAS库的核心函数,BLAS库中的许多函数在实现时都调用了矩阵乘法函数。文章结合龙芯3A体系结构的特点,通过对矩阵乘法计算过程的分析选择了先对矩阵分块然后进行任务划分的方式,从而减少了数据拷贝数量,提高了拷贝数据的利用率,并运用循环展开、指令调度、数据分块等技术对子线程的运算进行了优化。优化后的ZGEMM函数的多线程运算速度是ATLAS库的两倍。
There are two kinds of function operation in BLAS library: real functions and complex functions.Matrix multiplication is the most important function in BLAS library,many other functions in BLAS library call this function to complete the calculation.Combined with the characteristics of Loongson-3A architecture,and through analyzing the computation of the matrix multiplication this paper finds out the best way to divide tasks,thereby the movements of data between cache and main memory are reduced.The performance of the child thread is improved by means of loop unrolling,instruction scheduling and matrix partition.The computation speed of our ZGEMM is two times faster than that of the ATLAS library.
出处
《电子技术(上海)》
2011年第12期1-3,共3页
Electronic Technology
基金
国家863计划
多核龙芯处理器系统软件移植与开发(2008AA01902)
核高基重大专项
基于龙芯3号的通信与数学库的研制(2009ZX01028-002-003-005)