The high-resolution DEM-IMB-LBM model can accurately describe pore-scale fluid-solid interactions,but its potential for use in geotechnical engineering analysis has not been fully unleashed due to its prohibitive comp...The high-resolution DEM-IMB-LBM model can accurately describe pore-scale fluid-solid interactions,but its potential for use in geotechnical engineering analysis has not been fully unleashed due to its prohibitive computational costs.To overcome this limitation,a message passing interface(MPI)parallel DEM-IMB-LBM framework is proposed aimed at enhancing computation efficiency.This framework utilises a static domain decomposition scheme,with the entire computation domain being decomposed into multiple subdomains according to predefined processors.A detailed parallel strategy is employed for both contact detection and hydrodynamic force calculation.In particular,a particle ID re-numbering scheme is proposed to handle particle transitions across sub-domain interfaces.Two benchmarks are conducted to validate the accuracy and overall performance of the proposed framework.Subsequently,the framework is applied to simulate scenarios involving multi-particle sedimentation and submarine landslides.The numerical examples effectively demonstrate the robustness and applicability of the MPI parallel DEM-IMB-LBM framework.展开更多
The Message Passing Interface (MPI) is a widely accepted standard for parallel computing on distributed memorysystems.However, MPI implementations can contain defects that impact the reliability and performance of par...The Message Passing Interface (MPI) is a widely accepted standard for parallel computing on distributed memorysystems.However, MPI implementations can contain defects that impact the reliability and performance of parallelapplications. Detecting and correcting these defects is crucial, yet there is a lack of published models specificallydesigned for correctingMPI defects. To address this, we propose a model for detecting and correcting MPI defects(DC_MPI), which aims to detect and correct defects in various types of MPI communication, including blockingpoint-to-point (BPTP), nonblocking point-to-point (NBPTP), and collective communication (CC). The defectsaddressed by the DC_MPI model include illegal MPI calls, deadlocks (DL), race conditions (RC), and messagemismatches (MM). To assess the effectiveness of the DC_MPI model, we performed experiments on a datasetconsisting of 40 MPI codes. The results indicate that the model achieved a detection rate of 37 out of 40 codes,resulting in an overall detection accuracy of 92.5%. Additionally, the execution duration of the DC_MPI modelranged from 0.81 to 1.36 s. These findings show that the DC_MPI model is useful in detecting and correctingdefects in MPI implementations, thereby enhancing the reliability and performance of parallel applications. TheDC_MPImodel fills an important research gap and provides a valuable tool for improving the quality ofMPI-basedparallel computing systems.展开更多
The simulation field became essential in designing or developing new casting products and in improving manufacturing processes within limited time, because it can help us to simulate the nature of processing, so that ...The simulation field became essential in designing or developing new casting products and in improving manufacturing processes within limited time, because it can help us to simulate the nature of processing, so that developers can make ideal casting designs. To take the prior occupation at commercial simulation market, so many development groups in the world are doing their every effort. They already reported successful stories in manufacturing fields by developing and providing the high performance simulation technologies for multipurpose. But they all run at powerful desk-side computers by well-trained experts mainly, so that it is hard to diffuse the scientific designing concept to newcomers in casting field. To overcome upcoming problems in scientific casting designs, we utilized information technologies and full-matured hardware backbones to spread out the effective and scientific casting design mind, and they all were integrated into Simulation Portal on the web. It professes scientific casting design on the NET including ubiquitous access way represented by "Anyone, Anytime, Anywhere" concept for casting designs.展开更多
In this work, we treat scattering objects, water, surface and bottom in a truly unified manner in a parallel finitedifference time-domain (FDTD) scheme, which is suitable for distributed parallel computing in a mess...In this work, we treat scattering objects, water, surface and bottom in a truly unified manner in a parallel finitedifference time-domain (FDTD) scheme, which is suitable for distributed parallel computing in a message passing interface (MPI) programming environment. The algorithm is implemented on a cluster-based high performance computer system. Parallel computation is performed with different division methods in 2D and 3D situations. Based on analysis of main factors affecting the speedup rate and parallel efficiency, data communication is reduced by selecting a suitable scheme of task division. A desirable scheme is recommended, giving a higher speedup rate and better efficiency. The results indicate that the unified parallel FDTD algorithm provides a solution to the numerical computation of acoustic scattering.展开更多
We present a time domain hybrid method to realize the fast coupling analysis of transmission lines excited by space electromagnetic fields, in which parallel finite-difference time-domain (FDTD) method, interpolation ...We present a time domain hybrid method to realize the fast coupling analysis of transmission lines excited by space electromagnetic fields, in which parallel finite-difference time-domain (FDTD) method, interpolation scheme, and Agrawal model-based transmission line (TL) equations are organically integrated together. Specifically, the Agrawal model is employed to establish the TL equations to describe the coupling effects of space electromagnetic fields on transmission lines. Then, the excitation fields functioning as distribution sources in TL equations are calculated by the parallel FDTD method through using the message passing interface (MPI) library scheme and interpolation scheme. Finally, the TL equations are discretized by the central difference scheme of FDTD and assigned to multiple processors to obtain the transient responses on the terminal loads of these lines. The significant feature of the presented method is embodied in its parallel and synchronous calculations of the space electromagnetic fields and transient responses on the lines. Numerical simulations of ambient wave acting on multi-conductor transmission lines (MTLs), which are located on the PEC ground and in the shielded cavity respectively, are implemented to verify the accuracy and efficiency of the presented method.展开更多
For data association in multisensor and multitarget tracking, a novel parallel algorithm is developed to improve the efficiency and real-time performance of FGAs-based algorithm. One Cluster of Workstation (COW) wit...For data association in multisensor and multitarget tracking, a novel parallel algorithm is developed to improve the efficiency and real-time performance of FGAs-based algorithm. One Cluster of Workstation (COW) with Message Passing Interface (MPI) is built. The proposed Multi-Deme Parallel FGA (MDPFGA) is run on the platform. A serial of special MDPFGAs are used to determine the static and the dynamic solutions of generalized m-best S-D assignment problem respectively, as well as target states estimation in track management. Such an assignment-based parallel algorithm is demonstrated on simulated passive sensor track formation and maintenance problem. While illustrating the feasibility of the proposed algorithm in multisensor multitarget tracking, simulation results indicate that the MDPFGAs-based algorithm has greater efficiency and speed than the FGAs-based algorithm.展开更多
Up to now,so much casting analysis software has been continuing to develop the new access way to real casting processes. Those include the melt flow analysis,heat transfer analysis for solidification calculation,mecha...Up to now,so much casting analysis software has been continuing to develop the new access way to real casting processes. Those include the melt flow analysis,heat transfer analysis for solidification calculation,mechanical property predictions and microstructure predictions. These trials were successful to obtain the ideal results comparing with real situations,so that CAE technologies became inevitable to design or develop new casting processes. But for manufacturing fields,CAE technologies are not so frequently being used because of their difficulties in using the software or insufficient computing performances. To introduce CAE technologies to manufacturing field,the high performance analysis is essential to shorten the gap between product designing time and prototyping time. The software code optimization can be helpful,but it is not enough,because the codes developed by software experts are already optimized enough. As an alternative proposal for high performance computations,the parallel computation technologies are eagerly being applied to CAE technologies to make the analysis time shorter. In this research,SMP (Shared Memory Processing) and MPI (Message Passing Interface) (1) methods for parallelization were applied to commercial software "Z-Cast" to calculate the casting processes. In the code parallelizing processes,the network stabilization,core optimization were also carried out under Microsoft Windows platform and their performances and results were compared with those of normal linear analysis codes.展开更多
Explosion and shock often involve large deformation, interface treatment between multi-material, and strong discontinuity. The Eulerian method has advantages for solving these problems. In parallel computation of the ...Explosion and shock often involve large deformation, interface treatment between multi-material, and strong discontinuity. The Eulerian method has advantages for solving these problems. In parallel computation of the Eulerian method, the physical quantities of the computaional cells do not change before the disturbance reaches to these cells. Computational efficiency is low when using fixed partition because of load imbalance. To solve this problem, a dynamic parallel method in which the computation domain expands with disturbance is used. The dynamic parallel program is designed based on the generally used message passing interface model. The numerical test of dynamic parallel program agrees well with that of the original parallel program, also agrees with the actual situation.展开更多
We propose an improved real-space parallel strategy for the density matrix renormalization group(DMRG)method,where boundaries of separate regions are adaptively distributed during DMRG sweeps.Our scheme greatly improv...We propose an improved real-space parallel strategy for the density matrix renormalization group(DMRG)method,where boundaries of separate regions are adaptively distributed during DMRG sweeps.Our scheme greatly improves the parallel efficiency with shorter waiting time between two adjacent tasks,compared with the original real-space parallel DMRG with fixed boundaries.We implement our new strategy based on the message passing interface(MPI),and dynamically control the number of kept states according to the truncation error in each DMRG step.We study the performance of the new parallel strategy by calculating the ground state of a spin-cluster chain and a quantum chemical Hamiltonian of the water molecule.The maximum parallel efficiencies for these two models are 91%and 76%in 4 nodes,which are much higher than the real-space parallel DMRG with fixed boundaries.展开更多
A parallel algorithm of circulation numerical model based on message passing interface(MPI) is developed using serialization and an irregular rectangle decomposition scheme. Neighboring point exchange strategy(NPES...A parallel algorithm of circulation numerical model based on message passing interface(MPI) is developed using serialization and an irregular rectangle decomposition scheme. Neighboring point exchange strategy(NPES) is adopted to further enhance the computational efficiency. Two experiments are conducted on HP C7000 Blade System, the numerical results show that the parallel version with NPES(PVN) produces higher efficiency than the original parallel version(PV). The PVN achieves parallel efficiency in excess of 0.9 in the second experiment when the number of processors increases to 100, while the efficiency of PV decreases to 0.39 rapidly. The PVN of ocean circulation model is used in a fine-resolution regional simulation, which produces better results. The capability of universal implementation of this algorithm makes it applicable in many other ocean models potentially.展开更多
The network switches in the data plane of Software Defined Networking (SDN) are empowered by an elementary process, in which enormous number of packets which resemble big volumes of data are classified into specific f...The network switches in the data plane of Software Defined Networking (SDN) are empowered by an elementary process, in which enormous number of packets which resemble big volumes of data are classified into specific flows by matching them against a set of dynamic rules. This basic process accelerates the processing of data, so that instead of processing singular packets repeatedly, corresponding actions are performed on corresponding flows of packets. In this paper, first, we address limitations on a typical packet classification algorithm like Tuple Space Search (TSS). Then, we present a set of different scenarios to parallelize it on different parallel processing platforms, including Graphics Processing Units (GPUs), clusters of Central Processing Units (CPUs), and hybrid clusters. Experimental results show that the hybrid cluster provides the best platform for parallelizing packet classification algorithms, which promises the average throughput rate of 4.2 Million packets per second (Mpps). That is, the hybrid cluster produced by the integration of Compute Unified Device Architecture (CUDA), Message Passing Interface (MPI), and OpenMP programming model could classify 0.24 million packets per second more than the GPU cluster scheme. Such a packet classifier satisfies the required processing speed in the programmable network systems that would be used to communicate big medical data.展开更多
The Slingshot interconnect designed by HPE/Cray is becoming more relevant in high-performance computing with its deployment on the upcoming exascale systems.In particular,it is the interconnect empowering the first ex...The Slingshot interconnect designed by HPE/Cray is becoming more relevant in high-performance computing with its deployment on the upcoming exascale systems.In particular,it is the interconnect empowering the first exascale and highest-ranked supercomputer in the world,Frontier.It offers various features such as adaptive routing,congestion control,and isolated workloads.The deployment of newer interconnects sparks interest related to performance,scalability,and any potential bottlenecks as they are critical elements contributing to the scalability across nodes on these systems.In this paper,we delve into the challenges the Slingshot interconnect poses with current state-of-the-art MPI(message passing interface)libraries.In particular,we look at the scalability performance when using Slingshot across nodes.We present a comprehensive evaluation using various MPI and communication libraries including Cray MPICH,Open-MPI+UCX,RCCL,and MVAPICH2 on CPUs and GPUs on the Spock system,an early access cluster deployed with Slingshot-10,AMD MI100 GPUs and AMD Epyc Rome CPUs to emulate the Frontier system.We also evaluate preliminary CPU-based support of MPI libraries on the Slingshot-11 interconnect.展开更多
A moisture advection scheme is an essential module of a numerical weather/climate model representing the horizontal transport of water vapor.The Piecewise Rational Method(PRM) scalar advection scheme in the Global/Reg...A moisture advection scheme is an essential module of a numerical weather/climate model representing the horizontal transport of water vapor.The Piecewise Rational Method(PRM) scalar advection scheme in the Global/Regional Assimilation and Prediction System(GRAPES) solves the moisture flux advection equation based on PRM.Computation of the scalar advection involves boundary exchange,and computation of higher bandwidth requirements is complicated and time-consuming in GRAPES.Recently,Graphics Processing Units(GPUs) have been widely used to solve scientific and engineering computing problems owing to advancements in GPU hardware and related programming models such as CUDA/OpenCL and Open Accelerator(OpenACC).Herein,we present an accelerated PRM scalar advection scheme with Message Passing Interface(MPI) and OpenACC to fully exploit GPUs’ power over a cluster with multiple Central Processing Units(CPUs) and GPUs,together with optimization of various parameters such as minimizing data transfer,memory coalescing,exposing more parallelism,and overlapping computation with data transfers.Results show that about 3.5 times speedup is obtained for the entire model running at medium resolution with double precision when comparing the scheme’s elapsed time on a node with two GPUs(NVIDIA P100) and two 16-core CPUs(Intel Gold 6142).Further,results obtained from experiments of a higher resolution model with multiple GPUs show excellent scalability.展开更多
The parallel computing algorithm for a nonhydrostatic model on one or multiple Graphic Processing Units (GPUs) for the simulation of internal solitary waves is presented and discussed. The computational efficiency o...The parallel computing algorithm for a nonhydrostatic model on one or multiple Graphic Processing Units (GPUs) for the simulation of internal solitary waves is presented and discussed. The computational efficiency of the GPU scheme is analyzed by a series of numerical experiments, including an ideal case and the field scale simulations, performed on the workstation and the super- computer system. The calculated results show that the speedup of the developed GPU-based parallel computing scheme, compared to the implementation on a single CPU core, increases with the number of computational grid cells, and the speedup can increase quasi- linearly with respect to the number of involved GPUs for the problem with relatively large number of grid cells within 32 GPUs.展开更多
MODerate resolution atmospheric TRANsmission(MODTRAN)is a commercial remote sensing(RS)software package that has been widely used to simulate radiative transfer of electromagnetic radiation through the Earth’s atmosp...MODerate resolution atmospheric TRANsmission(MODTRAN)is a commercial remote sensing(RS)software package that has been widely used to simulate radiative transfer of electromagnetic radiation through the Earth’s atmosphere and the radiation observed by a remote sensor.However,when very large RS datasets must be processed in simulation applications at a global scale,it is extremely time-consuming to operate MODTRAN on a modern workstation.Under this circumstance,the use of parallel cluster computing to speed up the process becomes vital to this time-consuming task.This paper presents PMODTRAN,an implementation of a parallel task-scheduling algorithm based on MODTRAN.PMODTRAN was able to reduce the processing time of the test cases used here from over 4.4 months on a workstation to less than a week on a local computer cluster.In addition,PMODTRAN can distribute tasks with different levels of granularity and has some extra features,such as dynamic load balancing and parameter checking.展开更多
This paper proposes a new approach for implementing fast multicast on multistage interconnection networks (MINs) with multi-head worms. For an MIN with n stages of k×k switches, a single multi-head worm can cover...This paper proposes a new approach for implementing fast multicast on multistage interconnection networks (MINs) with multi-head worms. For an MIN with n stages of k×k switches, a single multi-head worm can cover an arbitrary set of destinations with a single communication start-up. Compared with schemes using unicast messages, this approach reduces multicast latency significantly and performs better than multi-destination worms.展开更多
A wide variety of algorithms have been developed to monitor aerosol burden from satellite images. Still, few solutions currently allow for real-time and efficient retrieval of aerosol optical thickness (AOT), mainly...A wide variety of algorithms have been developed to monitor aerosol burden from satellite images. Still, few solutions currently allow for real-time and efficient retrieval of aerosol optical thickness (AOT), mainly due to the extremely large volume of computation necessary for the numeric solution of atmospheric radiative transfer equations. Taking into account the efforts to exploit the SYNergy of Terra and Aqua Modis (SYNTAM, an AOT retrieval algorithm), we present in this paper a novel method to retrieve AOT from Moderate Resolution Imaging Spectroradiometer (MODIS) satellite images, in which the strategy of block partition and collective communication was taken, thereby maximizing load balance and reducing the overhead time during inter-processor communication. Experiments were carried out to retrieve AOT at 0.44, 0.55, and 0.67μm of MODIS/Terra and MODIS/Aqua data, using the parallel SYNTAM algorithm in the IBM System Cluster 1600 deployed at China Meteorological Administration (CMA). Results showed that parallel implementation can greatly reduce computation time, and thus ensure high parallel efficiency. AOT derived by parallel algorithm was validated against measurements from ground-based sun-photometers; in all cases, the relative error range was within 20%, which demonstrated that the parallel algorithm was suitable for applications such as air quality monitoring and climate modeling.展开更多
基金financially supported by the National Natural Science Foundation of China(Grant Nos.12072217 and 42077254)the Natural Science Foundation of Hunan Province,China(Grant No.2022JJ30567).
文摘The high-resolution DEM-IMB-LBM model can accurately describe pore-scale fluid-solid interactions,but its potential for use in geotechnical engineering analysis has not been fully unleashed due to its prohibitive computational costs.To overcome this limitation,a message passing interface(MPI)parallel DEM-IMB-LBM framework is proposed aimed at enhancing computation efficiency.This framework utilises a static domain decomposition scheme,with the entire computation domain being decomposed into multiple subdomains according to predefined processors.A detailed parallel strategy is employed for both contact detection and hydrodynamic force calculation.In particular,a particle ID re-numbering scheme is proposed to handle particle transitions across sub-domain interfaces.Two benchmarks are conducted to validate the accuracy and overall performance of the proposed framework.Subsequently,the framework is applied to simulate scenarios involving multi-particle sedimentation and submarine landslides.The numerical examples effectively demonstrate the robustness and applicability of the MPI parallel DEM-IMB-LBM framework.
基金the Deanship of Scientific Research at King Abdulaziz University,Jeddah,Saudi Arabia under the Grant No.RG-12-611-43.
文摘The Message Passing Interface (MPI) is a widely accepted standard for parallel computing on distributed memorysystems.However, MPI implementations can contain defects that impact the reliability and performance of parallelapplications. Detecting and correcting these defects is crucial, yet there is a lack of published models specificallydesigned for correctingMPI defects. To address this, we propose a model for detecting and correcting MPI defects(DC_MPI), which aims to detect and correct defects in various types of MPI communication, including blockingpoint-to-point (BPTP), nonblocking point-to-point (NBPTP), and collective communication (CC). The defectsaddressed by the DC_MPI model include illegal MPI calls, deadlocks (DL), race conditions (RC), and messagemismatches (MM). To assess the effectiveness of the DC_MPI model, we performed experiments on a datasetconsisting of 40 MPI codes. The results indicate that the model achieved a detection rate of 37 out of 40 codes,resulting in an overall detection accuracy of 92.5%. Additionally, the execution duration of the DC_MPI modelranged from 0.81 to 1.36 s. These findings show that the DC_MPI model is useful in detecting and correctingdefects in MPI implementations, thereby enhancing the reliability and performance of parallel applications. TheDC_MPImodel fills an important research gap and provides a valuable tool for improving the quality ofMPI-basedparallel computing systems.
文摘The simulation field became essential in designing or developing new casting products and in improving manufacturing processes within limited time, because it can help us to simulate the nature of processing, so that developers can make ideal casting designs. To take the prior occupation at commercial simulation market, so many development groups in the world are doing their every effort. They already reported successful stories in manufacturing fields by developing and providing the high performance simulation technologies for multipurpose. But they all run at powerful desk-side computers by well-trained experts mainly, so that it is hard to diffuse the scientific designing concept to newcomers in casting field. To overcome upcoming problems in scientific casting designs, we utilized information technologies and full-matured hardware backbones to spread out the effective and scientific casting design mind, and they all were integrated into Simulation Portal on the web. It professes scientific casting design on the NET including ubiquitous access way represented by "Anyone, Anytime, Anywhere" concept for casting designs.
基金Project supported by the National Defense Laboratory Foundation (Grant No.51444020103QT0601)the Shanghai Leading Academic Discipline Project (Grant No.T0102)
文摘In this work, we treat scattering objects, water, surface and bottom in a truly unified manner in a parallel finitedifference time-domain (FDTD) scheme, which is suitable for distributed parallel computing in a message passing interface (MPI) programming environment. The algorithm is implemented on a cluster-based high performance computer system. Parallel computation is performed with different division methods in 2D and 3D situations. Based on analysis of main factors affecting the speedup rate and parallel efficiency, data communication is reduced by selecting a suitable scheme of task division. A desirable scheme is recommended, giving a higher speedup rate and better efficiency. The results indicate that the unified parallel FDTD algorithm provides a solution to the numerical computation of acoustic scattering.
基金Project supported by the National Natural Science Foundation of China(Grant No.61701057)the Chongqing Research Program of Basic Research and Frontier Technology,China(Grant No.cstc2017jcyjAX0345).
文摘We present a time domain hybrid method to realize the fast coupling analysis of transmission lines excited by space electromagnetic fields, in which parallel finite-difference time-domain (FDTD) method, interpolation scheme, and Agrawal model-based transmission line (TL) equations are organically integrated together. Specifically, the Agrawal model is employed to establish the TL equations to describe the coupling effects of space electromagnetic fields on transmission lines. Then, the excitation fields functioning as distribution sources in TL equations are calculated by the parallel FDTD method through using the message passing interface (MPI) library scheme and interpolation scheme. Finally, the TL equations are discretized by the central difference scheme of FDTD and assigned to multiple processors to obtain the transient responses on the terminal loads of these lines. The significant feature of the presented method is embodied in its parallel and synchronous calculations of the space electromagnetic fields and transient responses on the lines. Numerical simulations of ambient wave acting on multi-conductor transmission lines (MTLs), which are located on the PEC ground and in the shielded cavity respectively, are implemented to verify the accuracy and efficiency of the presented method.
基金Supported by National Defence Scientific Research Foundation
文摘For data association in multisensor and multitarget tracking, a novel parallel algorithm is developed to improve the efficiency and real-time performance of FGAs-based algorithm. One Cluster of Workstation (COW) with Message Passing Interface (MPI) is built. The proposed Multi-Deme Parallel FGA (MDPFGA) is run on the platform. A serial of special MDPFGAs are used to determine the static and the dynamic solutions of generalized m-best S-D assignment problem respectively, as well as target states estimation in track management. Such an assignment-based parallel algorithm is demonstrated on simulated passive sensor track formation and maintenance problem. While illustrating the feasibility of the proposed algorithm in multisensor multitarget tracking, simulation results indicate that the MDPFGAs-based algorithm has greater efficiency and speed than the FGAs-based algorithm.
文摘Up to now,so much casting analysis software has been continuing to develop the new access way to real casting processes. Those include the melt flow analysis,heat transfer analysis for solidification calculation,mechanical property predictions and microstructure predictions. These trials were successful to obtain the ideal results comparing with real situations,so that CAE technologies became inevitable to design or develop new casting processes. But for manufacturing fields,CAE technologies are not so frequently being used because of their difficulties in using the software or insufficient computing performances. To introduce CAE technologies to manufacturing field,the high performance analysis is essential to shorten the gap between product designing time and prototyping time. The software code optimization can be helpful,but it is not enough,because the codes developed by software experts are already optimized enough. As an alternative proposal for high performance computations,the parallel computation technologies are eagerly being applied to CAE technologies to make the analysis time shorter. In this research,SMP (Shared Memory Processing) and MPI (Message Passing Interface) (1) methods for parallelization were applied to commercial software "Z-Cast" to calculate the casting processes. In the code parallelizing processes,the network stabilization,core optimization were also carried out under Microsoft Windows platform and their performances and results were compared with those of normal linear analysis codes.
基金supported by the National Basic Research Program of China (No. 2010CB832706)the State Key Laboratory of Explosion Science and Technology (No. ZDKT10-03b)
文摘Explosion and shock often involve large deformation, interface treatment between multi-material, and strong discontinuity. The Eulerian method has advantages for solving these problems. In parallel computation of the Eulerian method, the physical quantities of the computaional cells do not change before the disturbance reaches to these cells. Computational efficiency is low when using fixed partition because of load imbalance. To solve this problem, a dynamic parallel method in which the computation domain expands with disturbance is used. The dynamic parallel program is designed based on the generally used message passing interface model. The numerical test of dynamic parallel program agrees well with that of the original parallel program, also agrees with the actual situation.
基金Project supported by the National Natural Science Foundation of China(Grant Nos.11674139,11834005,and 11904145)the Program for Changjiang Scholars and Innovative Research Team in Universities,China(Grant No.IRT-16R35).
文摘We propose an improved real-space parallel strategy for the density matrix renormalization group(DMRG)method,where boundaries of separate regions are adaptively distributed during DMRG sweeps.Our scheme greatly improves the parallel efficiency with shorter waiting time between two adjacent tasks,compared with the original real-space parallel DMRG with fixed boundaries.We implement our new strategy based on the message passing interface(MPI),and dynamically control the number of kept states according to the truncation error in each DMRG step.We study the performance of the new parallel strategy by calculating the ground state of a spin-cluster chain and a quantum chemical Hamiltonian of the water molecule.The maximum parallel efficiencies for these two models are 91%and 76%in 4 nodes,which are much higher than the real-space parallel DMRG with fixed boundaries.
基金The National High Technology Research and Development Program(863 Program)of China under contract No.2013AA09A505
文摘A parallel algorithm of circulation numerical model based on message passing interface(MPI) is developed using serialization and an irregular rectangle decomposition scheme. Neighboring point exchange strategy(NPES) is adopted to further enhance the computational efficiency. Two experiments are conducted on HP C7000 Blade System, the numerical results show that the parallel version with NPES(PVN) produces higher efficiency than the original parallel version(PV). The PVN achieves parallel efficiency in excess of 0.9 in the second experiment when the number of processors increases to 100, while the efficiency of PV decreases to 0.39 rapidly. The PVN of ocean circulation model is used in a fine-resolution regional simulation, which produces better results. The capability of universal implementation of this algorithm makes it applicable in many other ocean models potentially.
文摘The network switches in the data plane of Software Defined Networking (SDN) are empowered by an elementary process, in which enormous number of packets which resemble big volumes of data are classified into specific flows by matching them against a set of dynamic rules. This basic process accelerates the processing of data, so that instead of processing singular packets repeatedly, corresponding actions are performed on corresponding flows of packets. In this paper, first, we address limitations on a typical packet classification algorithm like Tuple Space Search (TSS). Then, we present a set of different scenarios to parallelize it on different parallel processing platforms, including Graphics Processing Units (GPUs), clusters of Central Processing Units (CPUs), and hybrid clusters. Experimental results show that the hybrid cluster provides the best platform for parallelizing packet classification algorithms, which promises the average throughput rate of 4.2 Million packets per second (Mpps). That is, the hybrid cluster produced by the integration of Compute Unified Device Architecture (CUDA), Message Passing Interface (MPI), and OpenMP programming model could classify 0.24 million packets per second more than the GPU cluster scheme. Such a packet classifier satisfies the required processing speed in the programmable network systems that would be used to communicate big medical data.
基金supported in part by the U.S.National Science Foundation under Grant Nos.1818253,1854828,1931537,and 2007991XRAC under Grant No.NCR-130002supported by the Office of Science of the U.S.Department of Energy under Contract No.DE-AC05-00OR22725.
文摘The Slingshot interconnect designed by HPE/Cray is becoming more relevant in high-performance computing with its deployment on the upcoming exascale systems.In particular,it is the interconnect empowering the first exascale and highest-ranked supercomputer in the world,Frontier.It offers various features such as adaptive routing,congestion control,and isolated workloads.The deployment of newer interconnects sparks interest related to performance,scalability,and any potential bottlenecks as they are critical elements contributing to the scalability across nodes on these systems.In this paper,we delve into the challenges the Slingshot interconnect poses with current state-of-the-art MPI(message passing interface)libraries.In particular,we look at the scalability performance when using Slingshot across nodes.We present a comprehensive evaluation using various MPI and communication libraries including Cray MPICH,Open-MPI+UCX,RCCL,and MVAPICH2 on CPUs and GPUs on the Spock system,an early access cluster deployed with Slingshot-10,AMD MI100 GPUs and AMD Epyc Rome CPUs to emulate the Frontier system.We also evaluate preliminary CPU-based support of MPI libraries on the Slingshot-11 interconnect.
基金supported by the decision support project of response to climate change of China,the National Natural Science Foundation of China (Nos.41674085, 41604009, and 41621091)the Natural Science Foundation of Qinghai Province (No. 2019-ZJ-7034)the Open Project of State Key Laboratory of Plateau Ecology and Agriculture,Qinghai University (No. 2020-zz-03)。
文摘A moisture advection scheme is an essential module of a numerical weather/climate model representing the horizontal transport of water vapor.The Piecewise Rational Method(PRM) scalar advection scheme in the Global/Regional Assimilation and Prediction System(GRAPES) solves the moisture flux advection equation based on PRM.Computation of the scalar advection involves boundary exchange,and computation of higher bandwidth requirements is complicated and time-consuming in GRAPES.Recently,Graphics Processing Units(GPUs) have been widely used to solve scientific and engineering computing problems owing to advancements in GPU hardware and related programming models such as CUDA/OpenCL and Open Accelerator(OpenACC).Herein,we present an accelerated PRM scalar advection scheme with Message Passing Interface(MPI) and OpenACC to fully exploit GPUs’ power over a cluster with multiple Central Processing Units(CPUs) and GPUs,together with optimization of various parameters such as minimizing data transfer,memory coalescing,exposing more parallelism,and overlapping computation with data transfers.Results show that about 3.5 times speedup is obtained for the entire model running at medium resolution with double precision when comparing the scheme’s elapsed time on a node with two GPUs(NVIDIA P100) and two 16-core CPUs(Intel Gold 6142).Further,results obtained from experiments of a higher resolution model with multiple GPUs show excellent scalability.
基金supported by the Natural Science Foundation of Tianjin, China (Grant No. 12JCZDJC30200)the National Natural Science Foundation of China (Grant No. 51021004)the Fundamental Research Fund for the Central Nonprofit Research Institutes of China (Grant No. TKS100206)
文摘The parallel computing algorithm for a nonhydrostatic model on one or multiple Graphic Processing Units (GPUs) for the simulation of internal solitary waves is presented and discussed. The computational efficiency of the GPU scheme is analyzed by a series of numerical experiments, including an ideal case and the field scale simulations, performed on the workstation and the super- computer system. The calculated results show that the speedup of the developed GPU-based parallel computing scheme, compared to the implementation on a single CPU core, increases with the number of computational grid cells, and the speedup can increase quasi- linearly with respect to the number of involved GPUs for the problem with relatively large number of grid cells within 32 GPUs.
基金This work was mainly supported by the National High-Technology Research and Development Program(863)[grant number 2013AA122801]the National Science Foundation of the United States[Award No.1251095]+3 种基金Also it was partially supported by the Fundamental Research Funds for the Central Universities[grant number ZYGX2015J111]the project entitled‘Design and development of the parallelism for typical remote sensing image algorithm based on heterogeneous computing’from the Institute of Remote Sensing and Digital Earth,Chinese Academy of Sciencesthe project entitled‘CAST Innovation Fund:the Study of Agent and Cloud Based Spatial Big Data Service Chain’also the National Natural Science Foundation of China[grant number 51277167].
文摘MODerate resolution atmospheric TRANsmission(MODTRAN)is a commercial remote sensing(RS)software package that has been widely used to simulate radiative transfer of electromagnetic radiation through the Earth’s atmosphere and the radiation observed by a remote sensor.However,when very large RS datasets must be processed in simulation applications at a global scale,it is extremely time-consuming to operate MODTRAN on a modern workstation.Under this circumstance,the use of parallel cluster computing to speed up the process becomes vital to this time-consuming task.This paper presents PMODTRAN,an implementation of a parallel task-scheduling algorithm based on MODTRAN.PMODTRAN was able to reduce the processing time of the test cases used here from over 4.4 months on a workstation to less than a week on a local computer cluster.In addition,PMODTRAN can distribute tasks with different levels of granularity and has some extra features,such as dynamic load balancing and parameter checking.
文摘This paper proposes a new approach for implementing fast multicast on multistage interconnection networks (MINs) with multi-head worms. For an MIN with n stages of k×k switches, a single multi-head worm can cover an arbitrary set of destinations with a single communication start-up. Compared with schemes using unicast messages, this approach reduces multicast latency significantly and performs better than multi-destination worms.
基金supported partly by the Ministry of Science and Technology of the People’s Republic of China (Grant Nos.2007CB714407, and 2008ZX10004012)the Special Funds for Basic Research in CAMS of CMA (Grant No. 2007Y001)State Key Laboratory of Remote Sensing Sciences (Grant No.07S00502CX)
文摘A wide variety of algorithms have been developed to monitor aerosol burden from satellite images. Still, few solutions currently allow for real-time and efficient retrieval of aerosol optical thickness (AOT), mainly due to the extremely large volume of computation necessary for the numeric solution of atmospheric radiative transfer equations. Taking into account the efforts to exploit the SYNergy of Terra and Aqua Modis (SYNTAM, an AOT retrieval algorithm), we present in this paper a novel method to retrieve AOT from Moderate Resolution Imaging Spectroradiometer (MODIS) satellite images, in which the strategy of block partition and collective communication was taken, thereby maximizing load balance and reducing the overhead time during inter-processor communication. Experiments were carried out to retrieve AOT at 0.44, 0.55, and 0.67μm of MODIS/Terra and MODIS/Aqua data, using the parallel SYNTAM algorithm in the IBM System Cluster 1600 deployed at China Meteorological Administration (CMA). Results showed that parallel implementation can greatly reduce computation time, and thus ensure high parallel efficiency. AOT derived by parallel algorithm was validated against measurements from ground-based sun-photometers; in all cases, the relative error range was within 20%, which demonstrated that the parallel algorithm was suitable for applications such as air quality monitoring and climate modeling.