摘要
在高性能计算中,求解大规模、大尺度、长时程和病态问题过程中舍入误差的累计都可能会使算法的最终数值结果失真。在不同的计算软硬件资源下,每次运行的结果可能不一致,而这些结果是开发者调试程序和正确性检查的重要依据,会对科研工作的顺利进行造成干扰,因此算法数值结果的可复现性变得至关重要。文中面向飞腾处理器,基于OpenBLAS软件框架,结合美国伯克利国家实验室的Demmel教授团队开发的ReproBLAS软件中提出的可复现的方法与Castado提出的多层分块技术,使用舍入误差分析和无误差变换等技术,设计出了多线程可复现DGEMV的算法。数值实验显示,所提算法实现了数值计算的可复现性,且输出结果与ReproBLAS相同,验证了所提算法的可靠性。同时,所提算法在相同的测试环境下运行速度至少是ReproBLAS实现算法运行速度的2倍。此外,还将所提算法与日本理化研究所Mukunoki提出的OzBLAS中的可复现DGEMV函数进行对比,同为单线程时该算法的运行速度至少是OzBLAS算法的20倍,在相同多线程数量情况下,该算法的运行速度至少是OzBLAS算法的9倍。理论分析和数值实验均表明,该改进算法比国际上现有的可复现数值算法性能更优。
In high-performance computing, the accumulation of rounding error in the process of solving the large-scale, long time and ill-conditioned problem will lead to invalidated results.These results are useful for the developers to debug programs and check their correctness.Therefore, the reproducibility of the numerical results of the algorithm becomes very important.Based on the OpenBLAS’s framework, combining with Demmel’s reproducible method in ReproBLAS and multilayer block technology proposed by Castaldo, this paper designs a reproducible algorithm of multithreaded DGEMV for Phytium processor with rounding error analysis and error free transformation.Numerical experiments show that the output of the algorithm is the same as that of the ReproBLAS,which verifies the reproducibility.Our algorithm is up to 2 x faster than that in ReproBLAS.Compared with the DGEMV function of OzBLAS proposed by Mukunoki, our algorithm runs at least 20 x faster than that in OzBLAS with single thread, and 9 x faster than that in OzBLAS with multi-threads.Theoretical analysis and numerical experiments illustrate that improved algorithm is accurate, validated and efficiency.
作者
陈磊
唐滔
漆海俊
姜浩
何康
CHEN Lei;TANG Tao;QI Hai-jun;JIANG Hao;HE Kang(College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
出处
《计算机科学》
CSCD
北大核心
2022年第10期27-35,共9页
Computer Science
基金
国家重点研发计划(2020YFA0709803)
173项目(2020-JCJQ-ZD-029)
科学挑战专题资助项目(TZ2016002)。