For studying and optimizing the performance of general-purpose computing on graphics processing units(GPGPU)based on single instruction multiple threads(SIMT)processor about the neural network application,this work co...For studying and optimizing the performance of general-purpose computing on graphics processing units(GPGPU)based on single instruction multiple threads(SIMT)processor about the neural network application,this work contributes a self-developed SIMT processor named Pomelo and correlated assembly program.The parallel mechanism of SIMT computing mode and self-developed Pomelo processor is briefly introduced.A common convolutional neural network(CNN)is built to verify the compatibility and functionality of the Pomelo processor.CNN computing flow with task level and hardware level optimization is adopted on the Pomelo processor.A specific algorithm for organizing a Z-shaped memory structure is developed,which addresses reducing memory access in mass data computing tasks.Performing the above-combined adaptation and optimization strategy,the experimental result demonstrates that reducing memory access in SIMT computing mode plays a crucial role in improving performance.A 6.52 times performance is achieved on the 4 processing elements case.展开更多
The growing demand for semiconductor devices simulation poses a big challenge for large-scale electronic structure calculations.Among various methods,the linearly scaling three-dimensional fragment(LS3DF)method exhibi...The growing demand for semiconductor devices simulation poses a big challenge for large-scale electronic structure calculations.Among various methods,the linearly scaling three-dimensional fragment(LS3DF)method exhibits excellent scalability in large-scale simulations.Based on algorithmic and system-level optimizations,we propose a highly scalable and highly efficient implementation of LS3DF on a domestic heterogeneous supercomputer equipped with acceler-ators.In terms of algorithmic optimizations,the original all-band conjugate gradient algorithm is refined to achieve faster convergence,and mixed precision computing is adopted to increase overall efficiency.In terms of system-level optimiza-tions,the original two-layer parallel structure is replaced by a coarse-grained parallel method.Optimization strategies such as multi-stream,kernel fusion,and redundant computation removal are proposed to increase further utilization of the com-putational power provided by the heterogeneous machines.As a result,our optimized LS3DF can scale to a 10-million sili-con atoms system,attaining a peak performance of 34.8 PFLOPS(21.2% of the peak).All the improvements can be adapt-ed to the next-generation supercomputers for larger simulations.展开更多
To effectively solve the single-source shortest path(SSSP)problem for massive road networks in geographical information systems,a new synchronization method is proposed in the implementations of parallel SSSP algorith...To effectively solve the single-source shortest path(SSSP)problem for massive road networks in geographical information systems,a new synchronization method is proposed in the implementations of parallel SSSP algorithm.It applies spinlock by inline assembly language for the sake of small overheads of controlling the interaction of multiple threads.The performance of our method is compared with widely used Pthreads application programming interfaces and the powerful sequential solution given by DIMACS.The experimental platform is a shared address space workstation with two processors(i.e.eight cores)at a clock speed of 3 GHz.Problem instances for experiments contain a directed road networks of the USA with more than 23 million vertices and 57 million edges,and its 11 subnetworks of variant sizes.This method answers the SSSP of the USA road network in 1231 ms,while Pthreads costs 1808 ms and DIMACS sequential solution takes 4856 ms.It achieves a speedup of 3.95,which is 47%faster than Pthreads with the speedup of 2.69.When the size of instance is larger,our method achieves a better performance.展开更多
基金the Scientific Research Program Funded by Shaanxi Provincial Education Department(20JY058)。
文摘For studying and optimizing the performance of general-purpose computing on graphics processing units(GPGPU)based on single instruction multiple threads(SIMT)processor about the neural network application,this work contributes a self-developed SIMT processor named Pomelo and correlated assembly program.The parallel mechanism of SIMT computing mode and self-developed Pomelo processor is briefly introduced.A common convolutional neural network(CNN)is built to verify the compatibility and functionality of the Pomelo processor.CNN computing flow with task level and hardware level optimization is adopted on the Pomelo processor.A specific algorithm for organizing a Z-shaped memory structure is developed,which addresses reducing memory access in mass data computing tasks.Performing the above-combined adaptation and optimization strategy,the experimental result demonstrates that reducing memory access in SIMT computing mode plays a crucial role in improving performance.A 6.52 times performance is achieved on the 4 processing elements case.
基金This work was supported by the National Key Research and Development Program of China under Grant No.2021YFB0300600the National Natural Science Foundation of China under Grant Nos.92270206,T2125013,62032023,61972377,T2293702,and 12274360+2 种基金the Chinese Academy of Sciences Project for Young Scientists in Basic Research under Grant No.YSBR-005the Network Information Project of Chinese Academy of Sciences under Grant No.CASWX2021SF-0103the Key Research Program of Chinese Academy of Sciences under Grant No.ZDBSSSW-WHC002.
文摘The growing demand for semiconductor devices simulation poses a big challenge for large-scale electronic structure calculations.Among various methods,the linearly scaling three-dimensional fragment(LS3DF)method exhibits excellent scalability in large-scale simulations.Based on algorithmic and system-level optimizations,we propose a highly scalable and highly efficient implementation of LS3DF on a domestic heterogeneous supercomputer equipped with acceler-ators.In terms of algorithmic optimizations,the original all-band conjugate gradient algorithm is refined to achieve faster convergence,and mixed precision computing is adopted to increase overall efficiency.In terms of system-level optimiza-tions,the original two-layer parallel structure is replaced by a coarse-grained parallel method.Optimization strategies such as multi-stream,kernel fusion,and redundant computation removal are proposed to increase further utilization of the com-putational power provided by the heterogeneous machines.As a result,our optimized LS3DF can scale to a 10-million sili-con atoms system,attaining a peak performance of 34.8 PFLOPS(21.2% of the peak).All the improvements can be adapt-ed to the next-generation supercomputers for larger simulations.
文摘To effectively solve the single-source shortest path(SSSP)problem for massive road networks in geographical information systems,a new synchronization method is proposed in the implementations of parallel SSSP algorithm.It applies spinlock by inline assembly language for the sake of small overheads of controlling the interaction of multiple threads.The performance of our method is compared with widely used Pthreads application programming interfaces and the powerful sequential solution given by DIMACS.The experimental platform is a shared address space workstation with two processors(i.e.eight cores)at a clock speed of 3 GHz.Problem instances for experiments contain a directed road networks of the USA with more than 23 million vertices and 57 million edges,and its 11 subnetworks of variant sizes.This method answers the SSSP of the USA road network in 1231 ms,while Pthreads costs 1808 ms and DIMACS sequential solution takes 4856 ms.It achieves a speedup of 3.95,which is 47%faster than Pthreads with the speedup of 2.69.When the size of instance is larger,our method achieves a better performance.