The present research attempted a Large-Eddy Simulation (LES) of airflow over a steep, three-dimensional isolated hill by using the latest multi-cores multi-CPUs systems. As a result, it was found that 1) turbulence si...The present research attempted a Large-Eddy Simulation (LES) of airflow over a steep, three-dimensional isolated hill by using the latest multi-cores multi-CPUs systems. As a result, it was found that 1) turbulence simulations using approximately 50 million grid points are feasible and 2) the use of this system resulted in the achievement of a high computation speed, which exceeded the speed of parallel computation attained by a single CPU on one of the latest supercomputers. Furthermore, LES was conducted by using the multi-GPUs systems. The results of these simulations revealed the following findings: 1) the multi-GPUs environment which used the NVDIA? Tesla M2090 or the M2075 could simulate turbulence in a model with as many as approximately 50 million grid points. 2) The computation speed achieved by the multi-GPUs environments exceeded that by parallel computation which used four to six CPUs of one of the latest supercomputers.展开更多
Reverse time migration (RTM) is an indispensable but computationally intensive seismic exploration technique. Graphics processing units (GPUs) by NVIDIA■offer the option for parallel computations and speed improvemen...Reverse time migration (RTM) is an indispensable but computationally intensive seismic exploration technique. Graphics processing units (GPUs) by NVIDIA■offer the option for parallel computations and speed improvements in such high-density processes. With increasing seismic imaging space, the problems associated with multi-GPU techniques need to be addressed. We propose an efficient scheme for multi-GPU programming based on the features of the compute-unified device Architecture (CUDA) using GPU hardware, including concurrent kernel execution, CUDA streams, and peer-to-peer (P2P) communication between the different GPUs. In addition, by adjusting the computing time for imaging during RTM, the data communication times between GPUs become negligible. This means that the overall computation effi ciency improves linearly, as the number of GPUs increases. We introduce the multi-GPU scheme by using the acoustic wave propagation and then describe the implementation of RTM in tilted transversely isotropic (TTI) media. Next, we compare the multi-GPU and the unifi ed memory schemes. The results suggest that the proposed multi- GPU scheme is superior and, with increasing number of GPUs, the computational effi ciency improves linearly.展开更多
Deep learning has become the cornerstone of artificial intelligence,playing an increasingly important role in human production and lifestyle.However,as the complexity of problem-solving increases,deep learning models ...Deep learning has become the cornerstone of artificial intelligence,playing an increasingly important role in human production and lifestyle.However,as the complexity of problem-solving increases,deep learning models become increasingly intricate,resulting in a proliferation of large language models with an astonishing number of parameters.Pipeline model parallelism(PMP)has emerged as one of the mainstream approaches to addressing the significant challenge of training“big models”.This paper presents a comprehensive review of PMP.It covers the basic concepts and main challenges of PMP.It also comprehensively compares synchronous and asynchronous pipeline schedules for PMP approaches,and discusses the main techniques to achieve load balance for both intra-node and inter-node training.Furthermore,the main techniques to optimize computation,storage,and communication are presented,with potential research directions being discussed.展开更多
The parallel implementation of MUPHY,a concurrent multiscale code for large-scale hemodynamic simulations in anatomically realistic geometries,for multi-GPU platforms is presented.Performance tests show excellent resu...The parallel implementation of MUPHY,a concurrent multiscale code for large-scale hemodynamic simulations in anatomically realistic geometries,for multi-GPU platforms is presented.Performance tests show excellent results,with a nearly linear parallel speed-up on up to 32GPUs and a more than tenfold GPU/CPU acceleration,all across the range of GPUs.The basic MUPHY scheme combines a hydrokinetic(Lattice Boltzmann)representation of the blood plasma,with a Particle Dynamics treatment of suspended biological bodies,such as red blood cells.To the best of our knowledge,this represents the first effort in the direction of laying down general design principles for multiscale/physics parallel Particle Dynamics applications in non-ideal geometries.This configures the present multi-GPU version of MUPHY as one of the first examples of a high-performance parallel code for multiscale/physics biofluidic applications in realistically complex geometries.展开更多
文摘The present research attempted a Large-Eddy Simulation (LES) of airflow over a steep, three-dimensional isolated hill by using the latest multi-cores multi-CPUs systems. As a result, it was found that 1) turbulence simulations using approximately 50 million grid points are feasible and 2) the use of this system resulted in the achievement of a high computation speed, which exceeded the speed of parallel computation attained by a single CPU on one of the latest supercomputers. Furthermore, LES was conducted by using the multi-GPUs systems. The results of these simulations revealed the following findings: 1) the multi-GPUs environment which used the NVDIA? Tesla M2090 or the M2075 could simulate turbulence in a model with as many as approximately 50 million grid points. 2) The computation speed achieved by the multi-GPUs environments exceeded that by parallel computation which used four to six CPUs of one of the latest supercomputers.
基金supported by the National Key R&D Program of China(2017YFC0602204-01)NSFC(Grant Nos.41530321 and 41104083)
文摘Reverse time migration (RTM) is an indispensable but computationally intensive seismic exploration technique. Graphics processing units (GPUs) by NVIDIA■offer the option for parallel computations and speed improvements in such high-density processes. With increasing seismic imaging space, the problems associated with multi-GPU techniques need to be addressed. We propose an efficient scheme for multi-GPU programming based on the features of the compute-unified device Architecture (CUDA) using GPU hardware, including concurrent kernel execution, CUDA streams, and peer-to-peer (P2P) communication between the different GPUs. In addition, by adjusting the computing time for imaging during RTM, the data communication times between GPUs become negligible. This means that the overall computation effi ciency improves linearly, as the number of GPUs increases. We introduce the multi-GPU scheme by using the acoustic wave propagation and then describe the implementation of RTM in tilted transversely isotropic (TTI) media. Next, we compare the multi-GPU and the unifi ed memory schemes. The results suggest that the proposed multi- GPU scheme is superior and, with increasing number of GPUs, the computational effi ciency improves linearly.
基金supported in part by the National Natural Science Foundation of China under Grant Nos.62025208,U21A20473,U21A20513,62076154,and 62302512the State Administration of Science,Technology,and Industry for National Defense of China under Grant No.WDZC20235250118.
文摘Deep learning has become the cornerstone of artificial intelligence,playing an increasingly important role in human production and lifestyle.However,as the complexity of problem-solving increases,deep learning models become increasingly intricate,resulting in a proliferation of large language models with an astonishing number of parameters.Pipeline model parallelism(PMP)has emerged as one of the mainstream approaches to addressing the significant challenge of training“big models”.This paper presents a comprehensive review of PMP.It covers the basic concepts and main challenges of PMP.It also comprehensively compares synchronous and asynchronous pipeline schedules for PMP approaches,and discusses the main techniques to achieve load balance for both intra-node and inter-node training.Furthermore,the main techniques to optimize computation,storage,and communication are presented,with potential research directions being discussed.
文摘The parallel implementation of MUPHY,a concurrent multiscale code for large-scale hemodynamic simulations in anatomically realistic geometries,for multi-GPU platforms is presented.Performance tests show excellent results,with a nearly linear parallel speed-up on up to 32GPUs and a more than tenfold GPU/CPU acceleration,all across the range of GPUs.The basic MUPHY scheme combines a hydrokinetic(Lattice Boltzmann)representation of the blood plasma,with a Particle Dynamics treatment of suspended biological bodies,such as red blood cells.To the best of our knowledge,this represents the first effort in the direction of laying down general design principles for multiscale/physics parallel Particle Dynamics applications in non-ideal geometries.This configures the present multi-GPU version of MUPHY as one of the first examples of a high-performance parallel code for multiscale/physics biofluidic applications in realistically complex geometries.