This paper aims to solve large-scale and complex isogeometric topology optimization problems that consumesignificant computational resources. A novel isogeometric topology optimization method with a hybrid parallelstr...This paper aims to solve large-scale and complex isogeometric topology optimization problems that consumesignificant computational resources. A novel isogeometric topology optimization method with a hybrid parallelstrategy of CPU/GPU is proposed, while the hybrid parallel strategies for stiffness matrix assembly, equationsolving, sensitivity analysis, and design variable update are discussed in detail. To ensure the high efficiency ofCPU/GPU computing, a workload balancing strategy is presented for optimally distributing the workload betweenCPU and GPU. To illustrate the advantages of the proposedmethod, three benchmark examples are tested to verifythe hybrid parallel strategy in this paper. The results show that the efficiency of the hybrid method is faster thanserial CPU and parallel GPU, while the speedups can be up to two orders of magnitude.展开更多
With the advancement of Artificial Intelligence(Al)technologies and accumulation of big Earth data,Deep Learning(DL)has become an important method to discover patterns and understand Earth science processes in the pas...With the advancement of Artificial Intelligence(Al)technologies and accumulation of big Earth data,Deep Learning(DL)has become an important method to discover patterns and understand Earth science processes in the past several years.While successful in many Earth science areas,Al/DL applications are often challenging for computing devices.In recent years,Graphics Processing Unit(GPU)devices have been leveraged to speed up Al/DL applications,yet computational performance still poses a major barrier for DL-based Earth science applications.To address these computational challenges,we selected five existing sample Earth science Al applications,revised the DL-based models/algorithms,and tested the performance of multiple GPU computing platforms to support the applications.Application softwarepackages,performance comparisonsacross different platforms,along with other results,are summarized.This article can help understand how various Al/ML Earth science applications can be supported by GPU computing and help researchers in the Earth science domain better adopt GPU computing(such as supermicro,GPU clusters,and cloud computing-based)for their Al/ML applications,and to optimize their science applications to better leverage the computing device.展开更多
Conventional gradient-based full waveform inversion (FWI) is a local optimization, which is highly dependent on the initial model and prone to trapping in local minima. Globally optimal FWI that can overcome this limi...Conventional gradient-based full waveform inversion (FWI) is a local optimization, which is highly dependent on the initial model and prone to trapping in local minima. Globally optimal FWI that can overcome this limitation is particularly attractive, but is currently limited by the huge amount of calculation. In this paper, we propose a globally optimal FWI framework based on GPU parallel computing, which greatly improves the efficiency, and is expected to make globally optimal FWI more widely used. In this framework, we simplify and recombine the model parameters, and optimize the model iteratively. Each iteration contains hundreds of individuals, each individual is independent of the other, and each individual contains forward modeling and cost function calculation. The framework is suitable for a variety of globally optimal algorithms, and we test the framework with particle swarm optimization algorithm for example. Both the synthetic and field examples achieve good results, indicating the effectiveness of the framework. .展开更多
Particle-in-cell (PIC) method has got much benefits from GPU-accelerated heterogeneous systems.However,the performance of PIC is constrained by the interpolation operations in the weighting process on GPU (graphic pro...Particle-in-cell (PIC) method has got much benefits from GPU-accelerated heterogeneous systems.However,the performance of PIC is constrained by the interpolation operations in the weighting process on GPU (graphic processing unit).Aiming at this problem,a fast weighting method for PIC simulation on GPU-accelerated systems was proposed to avoid the atomic memory operations during the weighting process.The method was implemented by taking advantage of GPU's thread synchronization mechanism and dividing the problem space properly.Moreover,software managed shared memory on the GPU was employed to buffer the intermediate data.The experimental results show that the method achieves speedups up to 3.5 times compared to previous works,and runs 20.08 times faster on one NVIDIA Tesla M2090 GPU compared to a single core of Intel Xeon X5670 CPU.展开更多
Peta-scale high-perfomlance computing systems are increasingly built with heterogeneous CPU and GPU nodes to achieve higher power efficiency and computation throughput. While providing unprecedented capabilities to co...Peta-scale high-perfomlance computing systems are increasingly built with heterogeneous CPU and GPU nodes to achieve higher power efficiency and computation throughput. While providing unprecedented capabilities to conduct computational experiments of historic significance, these systems are presently difficult to program. The users, who are domain experts rather than computer experts, prefer to use programming models closer to their domains (e.g., physics and biology) rather than MPI and OpenME This has led the development of domain-specific programming that provides domain-specific programming interfaces but abstracts away some performance-critical architecture details. Based on experience in designing large-scale computing systems, a hybrid programming framework for scientific computing on heterogeneous architectures is proposed in this work. Its design philosophy is to provide a collaborative mechanism for domain experts and computer experts so that both domain-specific knowledge and performance-critical architecture details can be adequately exploited. Two real-world scientific applications have been evaluated on TH-IA, a peta-scale CPU-GPU heterogeneous system that is currently the 5th fastest supercomputer in the world. The experimental results show that the proposed framework is well suited for developing large-scale scientific computing applications on peta-scale heterogeneous CPU/GPU systems.展开更多
This paper presents a parallel method for simulating real-time 3D deformable objects using the volume preservation mass-spring system method on tetrahedron meshes.In general,the conventional mass-spring system is mani...This paper presents a parallel method for simulating real-time 3D deformable objects using the volume preservation mass-spring system method on tetrahedron meshes.In general,the conventional mass-spring system is manipulated as a force-driven method because it is fast,simple to implement,and the parameters can be controlled.However,the springs in traditional mass-spring system can be excessively elongated which cause severe stability and robustness issues that lead to shape restoring,simulation blow-up,and huge volume loss of the deformable object.In addition,traditional method that uses a serial process of the central processing unit(CPU)to solve the system in every frame cannot handle the complex structure of deformable object in real-time.Therefore,the first order implicit constraint enforcement for a mass-spring model is utilized to achieve accurate visual realism of deformable objects with tough constraint error.In this paper,we applied the distance constraint and volume conservation constraints for each tetrahedron element to improve the stability of deformable object simulation using the mass-spring system and behave the same as its real-world counterparts.To reduce the computational complexity while ensuring stable simulation,we applied a method that utilizes OpenGL compute shader,a part of OpenGL Shading Language(GLSL)that executes on the graphic processing unit(GPU)to solve the numerical problems effectively.We applied the proposed methods to experimental volumetric models,and volume percentages of all objects are compared.The average volume percentages of all models during the simulation using the mass-spring system,distance constraint,and the volume constraint method were 68.21%,89.64%,and 98.70%,respectively.The proposed approaches are successfully applied to improve the stability of mass-spring system and the performance comparison from our experimental tests also shows that the GPU-based method is faster than CPU-based implementation for all cases.展开更多
High-performance computational models are required to make the real-time or faster than rea^-time numerical prediction of adverse space weather events and their influence on the geospace environment. The main objectiv...High-performance computational models are required to make the real-time or faster than rea^-time numerical prediction of adverse space weather events and their influence on the geospace environment. The main objective in this article is to explore the application of programmable graphic processing units (GPUs) to the numerical space weather modeling for the study of solar wind background that is a crucial part in the numerical space weather modeling. GPU programming is realized for our Solar-Interplanetary-CESE MHD model (SIP-CESE MHD model) by numerically studying the solar corona/interplanetary so- lar wind. The global solar wind structures are obtained by the established GPU model with the magnetic field synoptic data as input. Meanwhile, the time-dependent solar surface boundary conditions derived from the method of characteristics and the mass flux limit are incorporated to couple the observation and the three-dimensional (3D) MHD model. The simulated evolu- tion of the global structures for two Carrington rotations 2058 and 2062 is compared with solar observations and solar wind measurements t^om spacecraft near the Earth. The MHD model is also validated by comparison with the standard potential field source surface (PFSS) model. Comparisons show that the MHD results are in good overall agreement with coronal and interplanetary structures, including the size and distribution of coronal holes, the position and shape of the streamer belts, and the transition of the solar wind speeds and magnetic field polarities.展开更多
Parallel computing has become an important subject in the field of computer science and has proven to be critical when researching high performance solutions.The evolution of computer architectures(multi-core and many...Parallel computing has become an important subject in the field of computer science and has proven to be critical when researching high performance solutions.The evolution of computer architectures(multi-core and many-core)towards a higher number of cores can only confirm that parallelism is the method of choice for speeding up an algorithm.In the last decade,the graphics processing unit,or GPU,has gained an important place in the field of high performance computing(HPC)because of its low cost and massive parallel processing power.Super-computing has become,for the first time,available to anyone at the price of a desktop computer.In this paper,we survey the concept of parallel computing and especially GPU computing.Achieving efficient parallel algorithms for the GPU is not a trivial task,there are several technical restrictions that must be satisfied in order to achieve the expected performance.Some of these limitations are consequences of the underlying architecture of the GPU and the theoretical models behind it.Our goal is to present a set of theoretical and technical concepts that are often required to understand the GPU and its massive parallelism model.In particular,we show how this new technology can help the field of computational physics,especially when the problem is data-parallel.We present four examples of computational physics problems;n-body,collision detection,Potts model and cellular automata simulations.These examples well represent the kind of problems that are suitable for GPU computing.By understanding the GPU architecture and its massive parallelism programming model,one can overcome many of the technical limitations found along the way,design better GPU-based algorithms for computational physics problems and achieve speedups that can reach up to two orders of magnitude when compared to sequential implementations.展开更多
In this paper,stochastic global optimization algorithms,specifically,genetic algorithm and simulated annealing are used for the problem of calibrating the dynamic option pricing model under stochastic volatility to ma...In this paper,stochastic global optimization algorithms,specifically,genetic algorithm and simulated annealing are used for the problem of calibrating the dynamic option pricing model under stochastic volatility to market prices by adopting a hybrid programming approach.The performance of this dynamic option pricing model under the obtained optimal parameters is also discussed.To enhance the model throughput and reduce latency,a heterogeneous hybrid programming approach on GPU was adopted which emphasized a data-parallel implementation of the dynamic option pricing model on a GPU-based system.Kernel offloading to the GPU of the compute-intensive segments of the pricing algorithms was done in OpenCL.The GPU approach was found to significantly reduce latency by an optimum of 541 times faster than a parallel implementation approach on the CPU,reducing the computation time from 46.24 minutes to 5.12 seconds.展开更多
To achieve real-time control of tokamak plasmas, the equilibrium reconstruction has to be completed sufficiently quickly. For the case of an EAST tokamak experiment, real-time equilibrium reconstruction is generally r...To achieve real-time control of tokamak plasmas, the equilibrium reconstruction has to be completed sufficiently quickly. For the case of an EAST tokamak experiment, real-time equilibrium reconstruction is generally required to provide results within 1ms. A graphic processing unit(GPU) parallel Grad–Shafranov(G-S) solver is developed in P-EFIT code,which is built with the CUDA? architecture to take advantage of massively parallel GPU cores and significantly accelerate the computation. Optimization and implementation of numerical algorithms for a block tri-diagonal linear system are presented. The solver can complete a calculation within 16 μs with 65×65 grid size and 27 μs with 129×129 grid size, and this solver supports that P-EFIT can fulfill the time feasibility for real-time plasma control with both grid sizes.展开更多
Conventionally, multiple reference frame(MRF) method and sliding mesh(SM) method are used in the simulation of stirred tanks, however, both methods have limitations. In this study, a hybrid immersed-boundary(IB)techni...Conventionally, multiple reference frame(MRF) method and sliding mesh(SM) method are used in the simulation of stirred tanks, however, both methods have limitations. In this study, a hybrid immersed-boundary(IB)technique is developed in a finite difference context for the numerical simulation of stirred tanks. IBs based on Lagrangian markers and solid volume fractions are used for moving and stationary boundaries, respectively, to achieve optimal efficiency and accuracy. To cope with the high computational cost in the simulation of stirred tanks, the technique is implemented on computers with hybrid architecture where central processing units(CPUs) and graphics processing units(GPUs) are used together. The accuracy and efficiency of the present technique are first demonstrated in a relatively simple case, and then the technique is applied to the simulation of turbulent flow in a Rushton stirred tank with large eddy simulation(LES). Finally the proposed methodology is coupled with discrete element method(DEM) to accomplish particle-resolved simulation of solid suspensions in small stirred tanks. It demonstrates that the proposed methodology is a promising tool in simulating turbulent flow in stirred tanks with complex geometries.展开更多
Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/N...Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/Navier-Stokes solver is developed for 2-D compressible flows by using NVIDIA′s Compute Unified Device Architecture(CUDA)programming model in CUDA Fortran programming language.The techniques of implementation of CUDA kernels,double-layered thread hierarchy and variety memory hierarchy are presented to form the GPU-based algorithm of Euler/Navier-Stokes equations.The resulting parallel solver is validated by a set of typical test flow cases.The numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform,which demonstrates that a GPU desktop can serve as a costeffective parallel computing platform to accelerate computational fluid dynamics(CFD)simulations substantially.展开更多
Die filling is a critical stage during powder compaction,which can significantly affect the product quality and efficiency.In this paper,a forced feeder is introduced attempting to improve the filling performance of a...Die filling is a critical stage during powder compaction,which can significantly affect the product quality and efficiency.In this paper,a forced feeder is introduced attempting to improve the filling performance of a lab-scale die filling system.The die filling process is analysed with a graphics processing units(GPU)enhanced discrete element method(DEM).Various stirrer designs are assessed for a wide range of process settings(i.e.,stirrer speed,filling speed)to explore their influence on the die filling performance of free-flowing powder.Numerical results show that die filing with the novel helical-ribbon(i.e.,type D)stirrer design exhibits the highest filling ratio,implying that it is the most robust stirrer design for the feeder configuration considered.Furthermore,die filling performance with the type D stirrer design is a function of the stirrer speed and the filling speed.A positive variation of filling ratio(ηf>0%)can be ensured over the whole range of filling speed by adjusting the stirrer speed(i.e.,increasing the stirrer speed).The approach used in this study can not only help understand how the stirrer design affects the die filling performance but also guide the optimization of feeder system and process settings.展开更多
The lattice Boltzmann method(LBM)can gain a great amount of performance benefit by taking advantage of graphics processing unit(GPU)computing,and thus,the GPU,ormulti-GPU based LBMcan be considered as a promising and ...The lattice Boltzmann method(LBM)can gain a great amount of performance benefit by taking advantage of graphics processing unit(GPU)computing,and thus,the GPU,ormulti-GPU based LBMcan be considered as a promising and competent candidate in the study of large-scale fluid flows.However,the multi-GPU based lattice Boltzmann algorithm has not been studied extensively,especially for simulations of flow in complex geometries.In this paper,through coupling with the message passing interface(MPI)technique,we present an implementation of multi-GPU based LBM for fluid flow through porous media as well as some optimization strategies based on the data structure and layout,which can apparently reduce memory access and completely hide the communication time consumption.Then the performance of the algorithm is tested on a one-node cluster equipped with four Tesla C1060 GPU cards where up to 1732 MFLUPS is achieved for the Poiseuille flow and a nearly linear speedup with the number of GPUs is also observed.展开更多
A moisture advection scheme is an essential module of a numerical weather/climate model representing the horizontal transport of water vapor.The Piecewise Rational Method(PRM) scalar advection scheme in the Global/Reg...A moisture advection scheme is an essential module of a numerical weather/climate model representing the horizontal transport of water vapor.The Piecewise Rational Method(PRM) scalar advection scheme in the Global/Regional Assimilation and Prediction System(GRAPES) solves the moisture flux advection equation based on PRM.Computation of the scalar advection involves boundary exchange,and computation of higher bandwidth requirements is complicated and time-consuming in GRAPES.Recently,Graphics Processing Units(GPUs) have been widely used to solve scientific and engineering computing problems owing to advancements in GPU hardware and related programming models such as CUDA/OpenCL and Open Accelerator(OpenACC).Herein,we present an accelerated PRM scalar advection scheme with Message Passing Interface(MPI) and OpenACC to fully exploit GPUs’ power over a cluster with multiple Central Processing Units(CPUs) and GPUs,together with optimization of various parameters such as minimizing data transfer,memory coalescing,exposing more parallelism,and overlapping computation with data transfers.Results show that about 3.5 times speedup is obtained for the entire model running at medium resolution with double precision when comparing the scheme’s elapsed time on a node with two GPUs(NVIDIA P100) and two 16-core CPUs(Intel Gold 6142).Further,results obtained from experiments of a higher resolution model with multiple GPUs show excellent scalability.展开更多
The Moving Particle Semi-implicit (MPS) method performs well in simulating violent free surface flow and hence becomes popular in the area of fluid flow simulation. However, the implementations of searching neighbouri...The Moving Particle Semi-implicit (MPS) method performs well in simulating violent free surface flow and hence becomes popular in the area of fluid flow simulation. However, the implementations of searching neighbouring particles and solving the large sparse matrix equations (Poisson-type equation) are very time-consuming. In order to utilize the tremendous power of parallel computation of Graphics Processing Units (GPU), this study has developed a GPU-based MPS model employing the Compute Unified Device Architecture (CUDA) on NVIDIA GTX 280. The efficient neighbourhood particle searching is done through an indirect method and the Poisson-type pressure equation is solved by the Bi-Conjugate Gradient (BiCG) method. Four different optimization levels for the present general parallel GPU-based MPS model are demonstrated. In addition, the elaborate optimization of GPU code is also discussed. A benchmark problem of dam-breaking flow is simulated using both codes of the present GPU-based MPS and the original CPU-based MPS. The comparisons between them show that the GPU-based MPS model outperforms 26 times the traditional CPU model.展开更多
.The geometric multigrid method(GMG)is one of the most efficient solving techniques for discrete algebraic systems arising from elliptic partial differential equations.GMG utilizes a hierarchy of grids or discretizati....The geometric multigrid method(GMG)is one of the most efficient solving techniques for discrete algebraic systems arising from elliptic partial differential equations.GMG utilizes a hierarchy of grids or discretizations and reduces the error at a number of frequencies simultaneously.Graphics processing units(GPUs)have recently burst onto the scientific computing scene as a technology that has yielded substantial performance and energy-efficiency improvements.A central challenge in implementing GMG on GPUs,though,is that computational work on coarse levels cannot fully utilize the capacity of a GPU.In this work,we perform numerical studies of GMG on CPU–GPU heterogeneous computers.Furthermore,we compare our implementation with an efficient CPU implementation of GMG and with the most popular fast Poisson solver,Fast Fourier Transform,in the cuFFT library developed by NVIDIA.展开更多
Heterogeneous systems with both Central Processing Units (CPUs) and Graphics Processing Units (GPUs) are frequently used to accelerate short-ranged Molecular Dynamics (MD) simulations. The most time-consuming ta...Heterogeneous systems with both Central Processing Units (CPUs) and Graphics Processing Units (GPUs) are frequently used to accelerate short-ranged Molecular Dynamics (MD) simulations. The most time-consuming task in short-ranged MD simulations is the computation of particle-to-particle interac- tions. Beyond a certain distance, these interactions decrease to zero. To minimize the operations to investi- gate distance, previous works have tiled interactions by employing the spatial attribute, which increases the memory access and GPU computations, hence decreasing performance. Other studies ignore the spatial attribute and construct an all-versus-all interaction matrix, which has poor scalability. This paper presents an improved algorithm. The algorithm first bins particles into voxels according to the spatial attributes, and then tiles the all-versus-all matrix into voxel-versus-voxel sub-matrixes. Only the sub-matrixes between neighbor- ing voxels are computed on the GPU. Therefore, the algorithm reduces the distance examine operations and limits additional memory access and GPU computations. This paper also adopts a multi-level program- ming model to implement the algorithm on multi-nodes of Tianhe-lA. By employing (1) a patch design to ex- ploit parallelism across the simulation domain, (2) a communication overlapping method to overlap the communications between CPUs and GPUs, and (3) a dynamic workload balancing method to adjust the workloads among compute nodes, the implementation achieves a speedup of 4.16x on one NVIDIA Tesla M2050 GPU compared to a 2.93 GHz six-core Intel Xeon X5670 CPU. In addition, it runs 2.41x faster on 256 compute nodes of Tianhe-lA (with two CPUs and one GPU inside a node) than on 256 GPU-excluded nodes.展开更多
Google Page Rank is a prevalent algorithm for ranking the significance of nodes or websites in a network,and a recent quantum counterpart for Page Rank algorithm has been raised to suggest a higher accuracy of ranking...Google Page Rank is a prevalent algorithm for ranking the significance of nodes or websites in a network,and a recent quantum counterpart for Page Rank algorithm has been raised to suggest a higher accuracy of ranking comparing to Google Page Rank.The quantum Page Rank algorithm is essentially based on quantum stochastic walks and can be expressed using Lindblad master equation,which,however,needs to solve the Kronecker products of an O(N^(4))dimension and requires severely large memory and time when the number of nodes N in a network increases above 150.Here,we present an efficient solver for quantum Page Rank by using the Runge-Kutta method to reduce the matrix dimension to O(N^(2))and employing Tensor Flow to conduct GPU parallel computing.We demonstrate its performance in solving quantum stochastic walks on Erdos-Rényi graphs using an RTX 2060 GPU.The test on the graph of 6000 nodes requires a memory of 5.5 GB and time of 223 s,and that on the graph of 1000 nodes requires 226 MB and 3.6 s.Compared with QSWalk,a currently prevalent Mathematica solver,our solver for the same graph of 1000 nodes reduces the required memory and time to only 0.2%and 0.05%.We apply the solver to quantum Page Rank for the USA major airline network with up to 922 nodes,and to quantum stochastic walk on a glued tree of 2186 nodes.This efficient solver for large-scale quantum Page Rank and quantum stochastic walks would greatly facilitate studies of quantum information in real-life applications.展开更多
基金the National Key R&D Program of China(2020YFB1708300)the National Natural Science Foundation of China(52005192)the Project of Ministry of Industry and Information Technology(TC210804R-3).
文摘This paper aims to solve large-scale and complex isogeometric topology optimization problems that consumesignificant computational resources. A novel isogeometric topology optimization method with a hybrid parallelstrategy of CPU/GPU is proposed, while the hybrid parallel strategies for stiffness matrix assembly, equationsolving, sensitivity analysis, and design variable update are discussed in detail. To ensure the high efficiency ofCPU/GPU computing, a workload balancing strategy is presented for optimally distributing the workload betweenCPU and GPU. To illustrate the advantages of the proposedmethod, three benchmark examples are tested to verifythe hybrid parallel strategy in this paper. The results show that the efficiency of the hybrid method is faster thanserial CPU and parallel GPU, while the speedups can be up to two orders of magnitude.
基金supported by NSF F I/UCRC(1841520),NASA Goddard CISTO,and NASA AIST programs.
文摘With the advancement of Artificial Intelligence(Al)technologies and accumulation of big Earth data,Deep Learning(DL)has become an important method to discover patterns and understand Earth science processes in the past several years.While successful in many Earth science areas,Al/DL applications are often challenging for computing devices.In recent years,Graphics Processing Unit(GPU)devices have been leveraged to speed up Al/DL applications,yet computational performance still poses a major barrier for DL-based Earth science applications.To address these computational challenges,we selected five existing sample Earth science Al applications,revised the DL-based models/algorithms,and tested the performance of multiple GPU computing platforms to support the applications.Application softwarepackages,performance comparisonsacross different platforms,along with other results,are summarized.This article can help understand how various Al/ML Earth science applications can be supported by GPU computing and help researchers in the Earth science domain better adopt GPU computing(such as supermicro,GPU clusters,and cloud computing-based)for their Al/ML applications,and to optimize their science applications to better leverage the computing device.
文摘Conventional gradient-based full waveform inversion (FWI) is a local optimization, which is highly dependent on the initial model and prone to trapping in local minima. Globally optimal FWI that can overcome this limitation is particularly attractive, but is currently limited by the huge amount of calculation. In this paper, we propose a globally optimal FWI framework based on GPU parallel computing, which greatly improves the efficiency, and is expected to make globally optimal FWI more widely used. In this framework, we simplify and recombine the model parameters, and optimize the model iteratively. Each iteration contains hundreds of individuals, each individual is independent of the other, and each individual contains forward modeling and cost function calculation. The framework is suitable for a variety of globally optimal algorithms, and we test the framework with particle swarm optimization algorithm for example. Both the synthetic and field examples achieve good results, indicating the effectiveness of the framework. .
基金Projects(61170049,60903044)supported by National Natural Science Foundation of ChinaProject(2012AA010903)supported by National High Technology Research and Development Program of China
文摘Particle-in-cell (PIC) method has got much benefits from GPU-accelerated heterogeneous systems.However,the performance of PIC is constrained by the interpolation operations in the weighting process on GPU (graphic processing unit).Aiming at this problem,a fast weighting method for PIC simulation on GPU-accelerated systems was proposed to avoid the atomic memory operations during the weighting process.The method was implemented by taking advantage of GPU's thread synchronization mechanism and dividing the problem space properly.Moreover,software managed shared memory on the GPU was employed to buffer the intermediate data.The experimental results show that the method achieves speedups up to 3.5 times compared to previous works,and runs 20.08 times faster on one NVIDIA Tesla M2090 GPU compared to a single core of Intel Xeon X5670 CPU.
基金Project(61170049) supported by the National Natural Science Foundation of ChinaProject(2012AA010903) supported by the National High Technology Research and Development Program of China
文摘Peta-scale high-perfomlance computing systems are increasingly built with heterogeneous CPU and GPU nodes to achieve higher power efficiency and computation throughput. While providing unprecedented capabilities to conduct computational experiments of historic significance, these systems are presently difficult to program. The users, who are domain experts rather than computer experts, prefer to use programming models closer to their domains (e.g., physics and biology) rather than MPI and OpenME This has led the development of domain-specific programming that provides domain-specific programming interfaces but abstracts away some performance-critical architecture details. Based on experience in designing large-scale computing systems, a hybrid programming framework for scientific computing on heterogeneous architectures is proposed in this work. Its design philosophy is to provide a collaborative mechanism for domain experts and computer experts so that both domain-specific knowledge and performance-critical architecture details can be adequately exploited. Two real-world scientific applications have been evaluated on TH-IA, a peta-scale CPU-GPU heterogeneous system that is currently the 5th fastest supercomputer in the world. The experimental results show that the proposed framework is well suited for developing large-scale scientific computing applications on peta-scale heterogeneous CPU/GPU systems.
基金This work was supported by the Basic Science Research Program through the National Research Foundation of Korea(NRF-2019R1F1A1062752)funded by the Ministry of Education+1 种基金was funded by BK21 FOUR(Fostering Outstanding Universities for Research)(No.:5199990914048)and was also supported by the Soonchunhyang University Research Fund.
文摘This paper presents a parallel method for simulating real-time 3D deformable objects using the volume preservation mass-spring system method on tetrahedron meshes.In general,the conventional mass-spring system is manipulated as a force-driven method because it is fast,simple to implement,and the parameters can be controlled.However,the springs in traditional mass-spring system can be excessively elongated which cause severe stability and robustness issues that lead to shape restoring,simulation blow-up,and huge volume loss of the deformable object.In addition,traditional method that uses a serial process of the central processing unit(CPU)to solve the system in every frame cannot handle the complex structure of deformable object in real-time.Therefore,the first order implicit constraint enforcement for a mass-spring model is utilized to achieve accurate visual realism of deformable objects with tough constraint error.In this paper,we applied the distance constraint and volume conservation constraints for each tetrahedron element to improve the stability of deformable object simulation using the mass-spring system and behave the same as its real-world counterparts.To reduce the computational complexity while ensuring stable simulation,we applied a method that utilizes OpenGL compute shader,a part of OpenGL Shading Language(GLSL)that executes on the graphic processing unit(GPU)to solve the numerical problems effectively.We applied the proposed methods to experimental volumetric models,and volume percentages of all objects are compared.The average volume percentages of all models during the simulation using the mass-spring system,distance constraint,and the volume constraint method were 68.21%,89.64%,and 98.70%,respectively.The proposed approaches are successfully applied to improve the stability of mass-spring system and the performance comparison from our experimental tests also shows that the GPU-based method is faster than CPU-based implementation for all cases.
基金supported by the National Natural Science Foundation of China(Grant Nos.41031066,41231068,41274192,41074121&41074122)the National Basic Research Program of China(Grant No.2012CB825601)+1 种基金the Knowledge Innovation Program of the Chinese Academy of Sciences(Grant No.KZZD-EW-01-4)the Specialized Research Fund for State Key Laboratories
文摘High-performance computational models are required to make the real-time or faster than rea^-time numerical prediction of adverse space weather events and their influence on the geospace environment. The main objective in this article is to explore the application of programmable graphic processing units (GPUs) to the numerical space weather modeling for the study of solar wind background that is a crucial part in the numerical space weather modeling. GPU programming is realized for our Solar-Interplanetary-CESE MHD model (SIP-CESE MHD model) by numerically studying the solar corona/interplanetary so- lar wind. The global solar wind structures are obtained by the established GPU model with the magnetic field synoptic data as input. Meanwhile, the time-dependent solar surface boundary conditions derived from the method of characteristics and the mass flux limit are incorporated to couple the observation and the three-dimensional (3D) MHD model. The simulated evolu- tion of the global structures for two Carrington rotations 2058 and 2062 is compared with solar observations and solar wind measurements t^om spacecraft near the Earth. The MHD model is also validated by comparison with the standard potential field source surface (PFSS) model. Comparisons show that the MHD results are in good overall agreement with coronal and interplanetary structures, including the size and distribution of coronal holes, the position and shape of the streamer belts, and the transition of the solar wind speeds and magnetic field polarities.
基金supported by Fondecyt Project No.1120495.Finally,thanks to Renato Cerro for improving the English of this manuscript.
文摘Parallel computing has become an important subject in the field of computer science and has proven to be critical when researching high performance solutions.The evolution of computer architectures(multi-core and many-core)towards a higher number of cores can only confirm that parallelism is the method of choice for speeding up an algorithm.In the last decade,the graphics processing unit,or GPU,has gained an important place in the field of high performance computing(HPC)because of its low cost and massive parallel processing power.Super-computing has become,for the first time,available to anyone at the price of a desktop computer.In this paper,we survey the concept of parallel computing and especially GPU computing.Achieving efficient parallel algorithms for the GPU is not a trivial task,there are several technical restrictions that must be satisfied in order to achieve the expected performance.Some of these limitations are consequences of the underlying architecture of the GPU and the theoretical models behind it.Our goal is to present a set of theoretical and technical concepts that are often required to understand the GPU and its massive parallelism model.In particular,we show how this new technology can help the field of computational physics,especially when the problem is data-parallel.We present four examples of computational physics problems;n-body,collision detection,Potts model and cellular automata simulations.These examples well represent the kind of problems that are suitable for GPU computing.By understanding the GPU architecture and its massive parallelism programming model,one can overcome many of the technical limitations found along the way,design better GPU-based algorithms for computational physics problems and achieve speedups that can reach up to two orders of magnitude when compared to sequential implementations.
文摘In this paper,stochastic global optimization algorithms,specifically,genetic algorithm and simulated annealing are used for the problem of calibrating the dynamic option pricing model under stochastic volatility to market prices by adopting a hybrid programming approach.The performance of this dynamic option pricing model under the obtained optimal parameters is also discussed.To enhance the model throughput and reduce latency,a heterogeneous hybrid programming approach on GPU was adopted which emphasized a data-parallel implementation of the dynamic option pricing model on a GPU-based system.Kernel offloading to the GPU of the compute-intensive segments of the pricing algorithms was done in OpenCL.The GPU approach was found to significantly reduce latency by an optimum of 541 times faster than a parallel implementation approach on the CPU,reducing the computation time from 46.24 minutes to 5.12 seconds.
基金supported by the National Magnetic Confinement Fusion Research Program of China(Grant No.2014GB103000)the National Natural Science Foundation of China(Grant No.11575245)the National Natural Science Foundation of China for Youth(Grant No.11205191)
文摘To achieve real-time control of tokamak plasmas, the equilibrium reconstruction has to be completed sufficiently quickly. For the case of an EAST tokamak experiment, real-time equilibrium reconstruction is generally required to provide results within 1ms. A graphic processing unit(GPU) parallel Grad–Shafranov(G-S) solver is developed in P-EFIT code,which is built with the CUDA? architecture to take advantage of massively parallel GPU cores and significantly accelerate the computation. Optimization and implementation of numerical algorithms for a block tri-diagonal linear system are presented. The solver can complete a calculation within 16 μs with 65×65 grid size and 27 μs with 129×129 grid size, and this solver supports that P-EFIT can fulfill the time feasibility for real-time plasma control with both grid sizes.
基金Supported by the National Natural Science Foundation of China(21225628,51106168,11272312)the“Strategic Priority Research Program”of the Chinese Academy of Sciences(XDA07080000)
文摘Conventionally, multiple reference frame(MRF) method and sliding mesh(SM) method are used in the simulation of stirred tanks, however, both methods have limitations. In this study, a hybrid immersed-boundary(IB)technique is developed in a finite difference context for the numerical simulation of stirred tanks. IBs based on Lagrangian markers and solid volume fractions are used for moving and stationary boundaries, respectively, to achieve optimal efficiency and accuracy. To cope with the high computational cost in the simulation of stirred tanks, the technique is implemented on computers with hybrid architecture where central processing units(CPUs) and graphics processing units(GPUs) are used together. The accuracy and efficiency of the present technique are first demonstrated in a relatively simple case, and then the technique is applied to the simulation of turbulent flow in a Rushton stirred tank with large eddy simulation(LES). Finally the proposed methodology is coupled with discrete element method(DEM) to accomplish particle-resolved simulation of solid suspensions in small stirred tanks. It demonstrates that the proposed methodology is a promising tool in simulating turbulent flow in stirred tanks with complex geometries.
基金supported by the National Natural Science Foundation of China (No.11172134)the Funding of Jiangsu Innovation Program for Graduate Education (No.CXLX13_132)
文摘Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/Navier-Stokes solver is developed for 2-D compressible flows by using NVIDIA′s Compute Unified Device Architecture(CUDA)programming model in CUDA Fortran programming language.The techniques of implementation of CUDA kernels,double-layered thread hierarchy and variety memory hierarchy are presented to form the GPU-based algorithm of Euler/Navier-Stokes equations.The resulting parallel solver is validated by a set of typical test flow cases.The numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform,which demonstrates that a GPU desktop can serve as a costeffective parallel computing platform to accelerate computational fluid dynamics(CFD)simulations substantially.
基金the financial support from Genentech Ltd.,the Engineering and Physical Science Research Council(Grant No.EP/M02976X)the Marie Skłodowska-Curie Individual Fellowships under European Union's Horizon 2020 research and innovation programme(Grant No.840264)。
文摘Die filling is a critical stage during powder compaction,which can significantly affect the product quality and efficiency.In this paper,a forced feeder is introduced attempting to improve the filling performance of a lab-scale die filling system.The die filling process is analysed with a graphics processing units(GPU)enhanced discrete element method(DEM).Various stirrer designs are assessed for a wide range of process settings(i.e.,stirrer speed,filling speed)to explore their influence on the die filling performance of free-flowing powder.Numerical results show that die filing with the novel helical-ribbon(i.e.,type D)stirrer design exhibits the highest filling ratio,implying that it is the most robust stirrer design for the feeder configuration considered.Furthermore,die filling performance with the type D stirrer design is a function of the stirrer speed and the filling speed.A positive variation of filling ratio(ηf>0%)can be ensured over the whole range of filling speed by adjusting the stirrer speed(i.e.,increasing the stirrer speed).The approach used in this study can not only help understand how the stirrer design affects the die filling performance but also guide the optimization of feeder system and process settings.
基金supported by the National Natural Science Foundation of China(Grant Nos.11272132,51006040,and 51006039)and the National Science Fund for Distinguished Young Scholars of China(51125024).Z.C.is also financially supported by the Hong Kong Scholar Program and the China Postdoctoral Science Foundation(Grant No.2012M521424).
文摘The lattice Boltzmann method(LBM)can gain a great amount of performance benefit by taking advantage of graphics processing unit(GPU)computing,and thus,the GPU,ormulti-GPU based LBMcan be considered as a promising and competent candidate in the study of large-scale fluid flows.However,the multi-GPU based lattice Boltzmann algorithm has not been studied extensively,especially for simulations of flow in complex geometries.In this paper,through coupling with the message passing interface(MPI)technique,we present an implementation of multi-GPU based LBM for fluid flow through porous media as well as some optimization strategies based on the data structure and layout,which can apparently reduce memory access and completely hide the communication time consumption.Then the performance of the algorithm is tested on a one-node cluster equipped with four Tesla C1060 GPU cards where up to 1732 MFLUPS is achieved for the Poiseuille flow and a nearly linear speedup with the number of GPUs is also observed.
基金supported by the decision support project of response to climate change of China,the National Natural Science Foundation of China (Nos.41674085, 41604009, and 41621091)the Natural Science Foundation of Qinghai Province (No. 2019-ZJ-7034)the Open Project of State Key Laboratory of Plateau Ecology and Agriculture,Qinghai University (No. 2020-zz-03)。
文摘A moisture advection scheme is an essential module of a numerical weather/climate model representing the horizontal transport of water vapor.The Piecewise Rational Method(PRM) scalar advection scheme in the Global/Regional Assimilation and Prediction System(GRAPES) solves the moisture flux advection equation based on PRM.Computation of the scalar advection involves boundary exchange,and computation of higher bandwidth requirements is complicated and time-consuming in GRAPES.Recently,Graphics Processing Units(GPUs) have been widely used to solve scientific and engineering computing problems owing to advancements in GPU hardware and related programming models such as CUDA/OpenCL and Open Accelerator(OpenACC).Herein,we present an accelerated PRM scalar advection scheme with Message Passing Interface(MPI) and OpenACC to fully exploit GPUs’ power over a cluster with multiple Central Processing Units(CPUs) and GPUs,together with optimization of various parameters such as minimizing data transfer,memory coalescing,exposing more parallelism,and overlapping computation with data transfers.Results show that about 3.5 times speedup is obtained for the entire model running at medium resolution with double precision when comparing the scheme’s elapsed time on a node with two GPUs(NVIDIA P100) and two 16-core CPUs(Intel Gold 6142).Further,results obtained from experiments of a higher resolution model with multiple GPUs show excellent scalability.
基金supported by the National Natural Science Foundation of China with Grant No. 10772040, 50921001 and 50909016The financial support from the Important National Science & Technology Specific Projects of China with Grant No. 2008ZX05026-02 is also appreciated
文摘The Moving Particle Semi-implicit (MPS) method performs well in simulating violent free surface flow and hence becomes popular in the area of fluid flow simulation. However, the implementations of searching neighbouring particles and solving the large sparse matrix equations (Poisson-type equation) are very time-consuming. In order to utilize the tremendous power of parallel computation of Graphics Processing Units (GPU), this study has developed a GPU-based MPS model employing the Compute Unified Device Architecture (CUDA) on NVIDIA GTX 280. The efficient neighbourhood particle searching is done through an indirect method and the Poisson-type pressure equation is solved by the Bi-Conjugate Gradient (BiCG) method. Four different optimization levels for the present general parallel GPU-based MPS model are demonstrated. In addition, the elaborate optimization of GPU code is also discussed. A benchmark problem of dam-breaking flow is simulated using both codes of the present GPU-based MPS and the original CPU-based MPS. The comparisons between them show that the GPU-based MPS model outperforms 26 times the traditional CPU model.
基金the assistance provided by Mr.Xiaoqiang Yue and Mr.Zheng Li from Xiangtan University in regard in our numerical experiments.Feng is partially supported by the NSFC Grant 11201398Program for Changjiang Scholars and Innovative Research Team in University of China Grant IRT1179+4 种基金Specialized research Fund for the Doctoral Program of Higher Education of China Grant 20124301110003Shu is partially supported by NSFC Grant 91130002 and 11171281the Scientific Research Fund of the Hunan Provincial Education Department of China Grant 12A138Xu is partially supported by NSFC Grant 91130011 and NSF DMS-1217142.Zhang is partially supported by the Dean Startup Fund,Academy of Mathematics and System Sciences,and by NSFC Grant 91130011.
文摘.The geometric multigrid method(GMG)is one of the most efficient solving techniques for discrete algebraic systems arising from elliptic partial differential equations.GMG utilizes a hierarchy of grids or discretizations and reduces the error at a number of frequencies simultaneously.Graphics processing units(GPUs)have recently burst onto the scientific computing scene as a technology that has yielded substantial performance and energy-efficiency improvements.A central challenge in implementing GMG on GPUs,though,is that computational work on coarse levels cannot fully utilize the capacity of a GPU.In this work,we perform numerical studies of GMG on CPU–GPU heterogeneous computers.Furthermore,we compare our implementation with an efficient CPU implementation of GMG and with the most popular fast Poisson solver,Fast Fourier Transform,in the cuFFT library developed by NVIDIA.
基金Supported by the National Natural Science Foundation of China (Nos. 61170049 and 60903044)the National High-Tech Research and Development (863) Program of China (Nos. 2012AA01A301 and 2012AA010903)
文摘Heterogeneous systems with both Central Processing Units (CPUs) and Graphics Processing Units (GPUs) are frequently used to accelerate short-ranged Molecular Dynamics (MD) simulations. The most time-consuming task in short-ranged MD simulations is the computation of particle-to-particle interac- tions. Beyond a certain distance, these interactions decrease to zero. To minimize the operations to investi- gate distance, previous works have tiled interactions by employing the spatial attribute, which increases the memory access and GPU computations, hence decreasing performance. Other studies ignore the spatial attribute and construct an all-versus-all interaction matrix, which has poor scalability. This paper presents an improved algorithm. The algorithm first bins particles into voxels according to the spatial attributes, and then tiles the all-versus-all matrix into voxel-versus-voxel sub-matrixes. Only the sub-matrixes between neighbor- ing voxels are computed on the GPU. Therefore, the algorithm reduces the distance examine operations and limits additional memory access and GPU computations. This paper also adopts a multi-level program- ming model to implement the algorithm on multi-nodes of Tianhe-lA. By employing (1) a patch design to ex- ploit parallelism across the simulation domain, (2) a communication overlapping method to overlap the communications between CPUs and GPUs, and (3) a dynamic workload balancing method to adjust the workloads among compute nodes, the implementation achieves a speedup of 4.16x on one NVIDIA Tesla M2050 GPU compared to a 2.93 GHz six-core Intel Xeon X5670 CPU. In addition, it runs 2.41x faster on 256 compute nodes of Tianhe-lA (with two CPUs and one GPU inside a node) than on 256 GPU-excluded nodes.
基金supported by the National Key R&D Program of China(2019YFA0308700,and 2017YFA0303700)the National Natural Science Foundation of China(61734005,11761141014,11690033)+3 种基金the Science and Technology Commission of Shanghai Municipality(STCSM)(17JC1400403)the Shanghai Municipal Education Commission(SMEC)(2019SHZDZX01,2017-01-07-0002-E00049)supported by the National Natural Science Foundation of China(11904229)China Postdoctoral Science Foundation(2019T120334)。
文摘Google Page Rank is a prevalent algorithm for ranking the significance of nodes or websites in a network,and a recent quantum counterpart for Page Rank algorithm has been raised to suggest a higher accuracy of ranking comparing to Google Page Rank.The quantum Page Rank algorithm is essentially based on quantum stochastic walks and can be expressed using Lindblad master equation,which,however,needs to solve the Kronecker products of an O(N^(4))dimension and requires severely large memory and time when the number of nodes N in a network increases above 150.Here,we present an efficient solver for quantum Page Rank by using the Runge-Kutta method to reduce the matrix dimension to O(N^(2))and employing Tensor Flow to conduct GPU parallel computing.We demonstrate its performance in solving quantum stochastic walks on Erdos-Rényi graphs using an RTX 2060 GPU.The test on the graph of 6000 nodes requires a memory of 5.5 GB and time of 223 s,and that on the graph of 1000 nodes requires 226 MB and 3.6 s.Compared with QSWalk,a currently prevalent Mathematica solver,our solver for the same graph of 1000 nodes reduces the required memory and time to only 0.2%and 0.05%.We apply the solver to quantum Page Rank for the USA major airline network with up to 922 nodes,and to quantum stochastic walk on a glued tree of 2186 nodes.This efficient solver for large-scale quantum Page Rank and quantum stochastic walks would greatly facilitate studies of quantum information in real-life applications.