In recent years,more and more attention has been paid to the research and application of graph structure.As the most typical representative of graph structure algorithm,breadth first search algorithm is widely used in...In recent years,more and more attention has been paid to the research and application of graph structure.As the most typical representative of graph structure algorithm,breadth first search algorithm is widely used in many fields.However,the performance of traditional serial breadth first search(BFS)algorithm is often very low in specific areas,especially in large-scale graph structure traversal.However,it is very common to deal with large-scale graph structure in scientific research.At the same time,the computing performance of supercomputer has also made great progress.China’s self-developed supercomputer system Sunway TaihuLight(SW)has won the top 500 list for three consecutive times.The huge computing performance of supercomputer is the key to solve this problem.It can be seen that if we use the computing power of supercomputing to solve the problem of large-scale graph structure traversal,the efficiency of graph structure traversal will be greatly improved.This paper expounds how to realize the breadth first search algorithm of graph structure on the Sunway TaihuLight,and achieved some results.In this way,MPI and thread library called athread of SW platform are used,and the traversal performance is improved dozens of times through the above related technologies and some partition methods of graph structure.展开更多
A Weighted Essentially Non-Oscillatory scheme(WENO) is a solution to hyperbolic conservation laws,suitable for solving high-density fluid interface instability with strong intermittency. These problems have a large an...A Weighted Essentially Non-Oscillatory scheme(WENO) is a solution to hyperbolic conservation laws,suitable for solving high-density fluid interface instability with strong intermittency. These problems have a large and complex flow structure. To fully utilize the computing power of High Performance Computing(HPC) systems, it is necessary to develop specific methodologies to optimize the performance of applications based on the particular system’s architecture. The Sunway TaihuLight supercomputer is currently ranked as the fastest supercomputer in the world. This article presents a heterogeneous parallel algorithm design and performance optimization of a high-order WENO on Sunway TaihuLight. We analyzed characteristics of kernel functions, and proposed an appropriate heterogeneous parallel model. We also figured out the best division strategy for computing tasks,and implemented the parallel algorithm on Sunway TaihuLight. By using access optimization, data dependency elimination, and vectorization optimization, our parallel algorithm can achieve up to 172× speedup on one single node, and additional 58× speedup on 64 nodes, with nearly linear scalability.展开更多
High performance computing(HPC)is a powerful tool to accelerate the Kohn–Sham density functional theory(KS-DFT)calculations on modern heterogeneous supercomputers.Here,we describe a massively parallel implementation ...High performance computing(HPC)is a powerful tool to accelerate the Kohn–Sham density functional theory(KS-DFT)calculations on modern heterogeneous supercomputers.Here,we describe a massively parallel implementation of discontinuous Galerkin density functional theory(DGDFT)method on the Sunway Taihu Light supercomputer.The DGDFT method uses the adaptive local basis(ALB)functions generated on-the-fly during the self-consistent field(SCF)iteration to solve the KS equations with high precision comparable to plane-wave basis set.In particular,the DGDFT method adopts a two-level parallelization strategy that deals with various types of data distribution,task scheduling,and data communication schemes,and combines with the master–slave multi-thread heterogeneous parallelism of SW26010 processor,resulting in large-scale HPC KS-DFT calculations on the Sunway Taihu Light supercomputer.We show that the DGDFT method can scale up to 8,519,680 processing cores(131,072 core groups)on the Sunway Taihu Light supercomputer for studying the electronic structures of twodimensional(2 D)metallic graphene systems that contain tens of thousands of carbon atoms.展开更多
With the advent of the big data era,the amounts of sampling data and the dimensions of data features are rapidly growing.It is highly desired to enable fast and efficient clustering of unlabeled samples based on featu...With the advent of the big data era,the amounts of sampling data and the dimensions of data features are rapidly growing.It is highly desired to enable fast and efficient clustering of unlabeled samples based on feature similarities. As a fundamental primitive for data clustering,the k-means operation is receiving increasingly more attentions today.To achieve high performance k-means computations on modern multi-core/many-core systems,we propose a matrix-based fused framework that can achieve high performance by conducting computations on a distance matrix and at the same time can improve the memory reuse through the fusion of the distance-matrix computation and the nearest centroids reduction.We implement and optimize the parallel k-means algorithm on the SW26010 many-core processor,which is the major horsepower of Sunway TaihuLight.In particular,we design a task mapping strategy for load-balanced task distribution,a data sharing scheme to reduce the memory footprint and a register blocking strategy to increase the data locality.Optimization techniques such as instruction reordering and double buffering are further applied to improve the sustained performance.Discussions on block-size tuning and performance modeling are also presented.We show by experiments on both randomly generated and real-world datasets that our parallel implementation of k-means on SW26010 can sustain a double-precision performance of over 348.1 Gflops,which is 46.9% of the peak performance and 84%of the theoretical performance upper bound on a single core group,and can achieve a nearly ideal scalability to the whole SW26010 processor of four core groups.Performance comparisons with the previous state-of-the-art on both CPU and GPU are also provided to show the superiority of our optimized k-means kernel.展开更多
共性数学库PETSc(Portable,Extensible Toolkit for Scientific Computation)是高性能计算的基础模块,是超级计算机计算环境的基础算法库之一,其性能直接影响调用数学库的高性能数值计算应用的效率.面向国际上首台100P神威·太湖之...共性数学库PETSc(Portable,Extensible Toolkit for Scientific Computation)是高性能计算的基础模块,是超级计算机计算环境的基础算法库之一,其性能直接影响调用数学库的高性能数值计算应用的效率.面向国际上首台100P神威·太湖之光异构超级计算机,根据实际研究需要选取PETSc中两个典型用例ex5(单节点线性求解方程组问题)和ex19(多节点求解2D驱动腔问题)进行实验探究.对运行结果分析找到的热点函数主要为PETSc函数库中7个核心函数,针对这7个核心函数(主要包括向量运算与矩阵运算),提出和实现了其异构并行算法,并结合机器的异构体系结构提出了相应的性能优化方法.在超级计算机上的实验结果为:核心函数并行算法在4主核、256从核的单节点上加速比最大可达到16.4;多节点情况下,当输入规模为16 384时,8192个节点相对于256节点的加速比为32,且加速比随着异构处理器数目的增加接近线性增加,表明PETSc核心函数并行算法在神威·太湖之光超级计算机上具有良好的可扩展性.展开更多
基金This work is sponsored by the Sichuan Science and Technology Program(2020YFS0355 and 2020YFG0479).
文摘In recent years,more and more attention has been paid to the research and application of graph structure.As the most typical representative of graph structure algorithm,breadth first search algorithm is widely used in many fields.However,the performance of traditional serial breadth first search(BFS)algorithm is often very low in specific areas,especially in large-scale graph structure traversal.However,it is very common to deal with large-scale graph structure in scientific research.At the same time,the computing performance of supercomputer has also made great progress.China’s self-developed supercomputer system Sunway TaihuLight(SW)has won the top 500 list for three consecutive times.The huge computing performance of supercomputer is the key to solve this problem.It can be seen that if we use the computing power of supercomputing to solve the problem of large-scale graph structure traversal,the efficiency of graph structure traversal will be greatly improved.This paper expounds how to realize the breadth first search algorithm of graph structure on the Sunway TaihuLight,and achieved some results.In this way,MPI and thread library called athread of SW platform are used,and the traversal performance is improved dozens of times through the above related technologies and some partition methods of graph structure.
基金supported by the National High-Tech Research and Development (863) Program of China (No. 2015AA015306)the Science and Technology Plan of Beijing Municipality (No. Z161100000216147)+2 种基金the National Natural Science Foundation of China (No. 61762074)Youth Foundation Program of Qinghai University (No. 2016-QGY-5)the National Natural Science Foundation of Qinghai Province (No. 2019-ZJ7034)
文摘A Weighted Essentially Non-Oscillatory scheme(WENO) is a solution to hyperbolic conservation laws,suitable for solving high-density fluid interface instability with strong intermittency. These problems have a large and complex flow structure. To fully utilize the computing power of High Performance Computing(HPC) systems, it is necessary to develop specific methodologies to optimize the performance of applications based on the particular system’s architecture. The Sunway TaihuLight supercomputer is currently ranked as the fastest supercomputer in the world. This article presents a heterogeneous parallel algorithm design and performance optimization of a high-order WENO on Sunway TaihuLight. We analyzed characteristics of kernel functions, and proposed an appropriate heterogeneous parallel model. We also figured out the best division strategy for computing tasks,and implemented the parallel algorithm on Sunway TaihuLight. By using access optimization, data dependency elimination, and vectorization optimization, our parallel algorithm can achieve up to 172× speedup on one single node, and additional 58× speedup on 64 nodes, with nearly linear scalability.
基金partly supported by the Supercomputer Application Project Trail Funding from Wuxi Jiangnan Institute of Computing Technology(BB2340000016)the Strategic Priority Research Program of Chinese Academy of Sciences(XDC01040100)+6 种基金the National Natural Science Foundation of China(21688102,21803066)the Anhui Initiative in Quantum Information Technologies(AHY090400)the National Key Research and Development Program of China(2016YFA0200604)the Fundamental Research Funds for Central Universities(WK2340000091)the Chinese Academy of Sciences Pioneer Hundred Talents Program(KJ2340000031)the Research Start-Up Grants(KY2340000094)the Academic Leading Talents Training Program(KY2340000103)from University of Science and Technology of China。
文摘High performance computing(HPC)is a powerful tool to accelerate the Kohn–Sham density functional theory(KS-DFT)calculations on modern heterogeneous supercomputers.Here,we describe a massively parallel implementation of discontinuous Galerkin density functional theory(DGDFT)method on the Sunway Taihu Light supercomputer.The DGDFT method uses the adaptive local basis(ALB)functions generated on-the-fly during the self-consistent field(SCF)iteration to solve the KS equations with high precision comparable to plane-wave basis set.In particular,the DGDFT method adopts a two-level parallelization strategy that deals with various types of data distribution,task scheduling,and data communication schemes,and combines with the master–slave multi-thread heterogeneous parallelism of SW26010 processor,resulting in large-scale HPC KS-DFT calculations on the Sunway Taihu Light supercomputer.We show that the DGDFT method can scale up to 8,519,680 processing cores(131,072 core groups)on the Sunway Taihu Light supercomputer for studying the electronic structures of twodimensional(2 D)metallic graphene systems that contain tens of thousands of carbon atoms.
基金the National Key Research and Development Plan of China under Grant No.2016YFB0200603the National Natural Science Foundation of China under Grant No.91530323the Beijing Natural Science Foundation of China under Grant No.JQ18001.
文摘With the advent of the big data era,the amounts of sampling data and the dimensions of data features are rapidly growing.It is highly desired to enable fast and efficient clustering of unlabeled samples based on feature similarities. As a fundamental primitive for data clustering,the k-means operation is receiving increasingly more attentions today.To achieve high performance k-means computations on modern multi-core/many-core systems,we propose a matrix-based fused framework that can achieve high performance by conducting computations on a distance matrix and at the same time can improve the memory reuse through the fusion of the distance-matrix computation and the nearest centroids reduction.We implement and optimize the parallel k-means algorithm on the SW26010 many-core processor,which is the major horsepower of Sunway TaihuLight.In particular,we design a task mapping strategy for load-balanced task distribution,a data sharing scheme to reduce the memory footprint and a register blocking strategy to increase the data locality.Optimization techniques such as instruction reordering and double buffering are further applied to improve the sustained performance.Discussions on block-size tuning and performance modeling are also presented.We show by experiments on both randomly generated and real-world datasets that our parallel implementation of k-means on SW26010 can sustain a double-precision performance of over 348.1 Gflops,which is 46.9% of the peak performance and 84%of the theoretical performance upper bound on a single core group,and can achieve a nearly ideal scalability to the whole SW26010 processor of four core groups.Performance comparisons with the previous state-of-the-art on both CPU and GPU are also provided to show the superiority of our optimized k-means kernel.