The growing demand for semiconductor devices simulation poses a big challenge for large-scale electronic structure calculations.Among various methods,the linearly scaling three-dimensional fragment(LS3DF)method exhibi...The growing demand for semiconductor devices simulation poses a big challenge for large-scale electronic structure calculations.Among various methods,the linearly scaling three-dimensional fragment(LS3DF)method exhibits excellent scalability in large-scale simulations.Based on algorithmic and system-level optimizations,we propose a highly scalable and highly efficient implementation of LS3DF on a domestic heterogeneous supercomputer equipped with acceler-ators.In terms of algorithmic optimizations,the original all-band conjugate gradient algorithm is refined to achieve faster convergence,and mixed precision computing is adopted to increase overall efficiency.In terms of system-level optimiza-tions,the original two-layer parallel structure is replaced by a coarse-grained parallel method.Optimization strategies such as multi-stream,kernel fusion,and redundant computation removal are proposed to increase further utilization of the com-putational power provided by the heterogeneous machines.As a result,our optimized LS3DF can scale to a 10-million sili-con atoms system,attaining a peak performance of 34.8 PFLOPS(21.2% of the peak).All the improvements can be adapt-ed to the next-generation supercomputers for larger simulations.展开更多
Genomic sequence alignment is the most critical and time-consuming step in genomic analysis.Alignment algorithms generally follow a seed-and-extend model.Acceleration of the extension phase for sequence alignment has ...Genomic sequence alignment is the most critical and time-consuming step in genomic analysis.Alignment algorithms generally follow a seed-and-extend model.Acceleration of the extension phase for sequence alignment has been well explored in computing-centric architectures on field-programmable gate array(FPGA),application-specific integrated circuit(ASIC),and graphics processing unit(GPU)(e.g.,the Smith-Waterman algorithm).Compared with the extension phase,the seeding phase is more critical and essential.However,the seeding phase is bounded by memory,i.e.,fine-grained random memory access and limited parallelism on conventional system.In this paper,we argue that the processing-in-memory(PIM)concept could be a viable solution to address these problems.This paper describes\PIM-Align"|an application-driven near-data processing architecture for sequence alignment.In order to achieve memory-capacity proportional performance by taking advantage of 3D-stacked dynamic random access memory(DRAM)technology,we propose a lightweight message mechanism between different memory partitions,and a specialized hardware prefetcher for memory access patterns of sequence alignment.Our evaluation shows that the proposed architecture can achieve 20x and 1820x speedup when compared with the best available ASIC implementation and the software running on 32-thread CPU,respectively.展开更多
基金This work was supported by the National Key Research and Development Program of China under Grant No.2021YFB0300600the National Natural Science Foundation of China under Grant Nos.92270206,T2125013,62032023,61972377,T2293702,and 12274360+2 种基金the Chinese Academy of Sciences Project for Young Scientists in Basic Research under Grant No.YSBR-005the Network Information Project of Chinese Academy of Sciences under Grant No.CASWX2021SF-0103the Key Research Program of Chinese Academy of Sciences under Grant No.ZDBSSSW-WHC002.
文摘The growing demand for semiconductor devices simulation poses a big challenge for large-scale electronic structure calculations.Among various methods,the linearly scaling three-dimensional fragment(LS3DF)method exhibits excellent scalability in large-scale simulations.Based on algorithmic and system-level optimizations,we propose a highly scalable and highly efficient implementation of LS3DF on a domestic heterogeneous supercomputer equipped with acceler-ators.In terms of algorithmic optimizations,the original all-band conjugate gradient algorithm is refined to achieve faster convergence,and mixed precision computing is adopted to increase overall efficiency.In terms of system-level optimiza-tions,the original two-layer parallel structure is replaced by a coarse-grained parallel method.Optimization strategies such as multi-stream,kernel fusion,and redundant computation removal are proposed to increase further utilization of the com-putational power provided by the heterogeneous machines.As a result,our optimized LS3DF can scale to a 10-million sili-con atoms system,attaining a peak performance of 34.8 PFLOPS(21.2% of the peak).All the improvements can be adapt-ed to the next-generation supercomputers for larger simulations.
基金The National Key Research and Development Program of China under Grant Nos. 2018YFB0204400,2016YFB0201305, 2016YFB0200803, 2016YFB0200300, and XDC01030000the National Natural Science Foundation of China underGrant Nos. 6197237, and 61702483the CAS QYZDJ-SSW-JSC035 Funding.
文摘Genomic sequence alignment is the most critical and time-consuming step in genomic analysis.Alignment algorithms generally follow a seed-and-extend model.Acceleration of the extension phase for sequence alignment has been well explored in computing-centric architectures on field-programmable gate array(FPGA),application-specific integrated circuit(ASIC),and graphics processing unit(GPU)(e.g.,the Smith-Waterman algorithm).Compared with the extension phase,the seeding phase is more critical and essential.However,the seeding phase is bounded by memory,i.e.,fine-grained random memory access and limited parallelism on conventional system.In this paper,we argue that the processing-in-memory(PIM)concept could be a viable solution to address these problems.This paper describes\PIM-Align"|an application-driven near-data processing architecture for sequence alignment.In order to achieve memory-capacity proportional performance by taking advantage of 3D-stacked dynamic random access memory(DRAM)technology,we propose a lightweight message mechanism between different memory partitions,and a specialized hardware prefetcher for memory access patterns of sequence alignment.Our evaluation shows that the proposed architecture can achieve 20x and 1820x speedup when compared with the best available ASIC implementation and the software running on 32-thread CPU,respectively.