The high-resolution DEM-IMB-LBM model can accurately describe pore-scale fluid-solid interactions,but its potential for use in geotechnical engineering analysis has not been fully unleashed due to its prohibitive comp...The high-resolution DEM-IMB-LBM model can accurately describe pore-scale fluid-solid interactions,but its potential for use in geotechnical engineering analysis has not been fully unleashed due to its prohibitive computational costs.To overcome this limitation,a message passing interface(MPI)parallel DEM-IMB-LBM framework is proposed aimed at enhancing computation efficiency.This framework utilises a static domain decomposition scheme,with the entire computation domain being decomposed into multiple subdomains according to predefined processors.A detailed parallel strategy is employed for both contact detection and hydrodynamic force calculation.In particular,a particle ID re-numbering scheme is proposed to handle particle transitions across sub-domain interfaces.Two benchmarks are conducted to validate the accuracy and overall performance of the proposed framework.Subsequently,the framework is applied to simulate scenarios involving multi-particle sedimentation and submarine landslides.The numerical examples effectively demonstrate the robustness and applicability of the MPI parallel DEM-IMB-LBM framework.展开更多
The simulation field became essential in designing or developing new casting products and in improving manufacturing processes within limited time, because it can help us to simulate the nature of processing, so that ...The simulation field became essential in designing or developing new casting products and in improving manufacturing processes within limited time, because it can help us to simulate the nature of processing, so that developers can make ideal casting designs. To take the prior occupation at commercial simulation market, so many development groups in the world are doing their every effort. They already reported successful stories in manufacturing fields by developing and providing the high performance simulation technologies for multipurpose. But they all run at powerful desk-side computers by well-trained experts mainly, so that it is hard to diffuse the scientific designing concept to newcomers in casting field. To overcome upcoming problems in scientific casting designs, we utilized information technologies and full-matured hardware backbones to spread out the effective and scientific casting design mind, and they all were integrated into Simulation Portal on the web. It professes scientific casting design on the NET including ubiquitous access way represented by "Anyone, Anytime, Anywhere" concept for casting designs.展开更多
In this work, we treat scattering objects, water, surface and bottom in a truly unified manner in a parallel finitedifference time-domain (FDTD) scheme, which is suitable for distributed parallel computing in a mess...In this work, we treat scattering objects, water, surface and bottom in a truly unified manner in a parallel finitedifference time-domain (FDTD) scheme, which is suitable for distributed parallel computing in a message passing interface (MPI) programming environment. The algorithm is implemented on a cluster-based high performance computer system. Parallel computation is performed with different division methods in 2D and 3D situations. Based on analysis of main factors affecting the speedup rate and parallel efficiency, data communication is reduced by selecting a suitable scheme of task division. A desirable scheme is recommended, giving a higher speedup rate and better efficiency. The results indicate that the unified parallel FDTD algorithm provides a solution to the numerical computation of acoustic scattering.展开更多
We present a time domain hybrid method to realize the fast coupling analysis of transmission lines excited by space electromagnetic fields, in which parallel finite-difference time-domain (FDTD) method, interpolation ...We present a time domain hybrid method to realize the fast coupling analysis of transmission lines excited by space electromagnetic fields, in which parallel finite-difference time-domain (FDTD) method, interpolation scheme, and Agrawal model-based transmission line (TL) equations are organically integrated together. Specifically, the Agrawal model is employed to establish the TL equations to describe the coupling effects of space electromagnetic fields on transmission lines. Then, the excitation fields functioning as distribution sources in TL equations are calculated by the parallel FDTD method through using the message passing interface (MPI) library scheme and interpolation scheme. Finally, the TL equations are discretized by the central difference scheme of FDTD and assigned to multiple processors to obtain the transient responses on the terminal loads of these lines. The significant feature of the presented method is embodied in its parallel and synchronous calculations of the space electromagnetic fields and transient responses on the lines. Numerical simulations of ambient wave acting on multi-conductor transmission lines (MTLs), which are located on the PEC ground and in the shielded cavity respectively, are implemented to verify the accuracy and efficiency of the presented method.展开更多
For data association in multisensor and multitarget tracking, a novel parallel algorithm is developed to improve the efficiency and real-time performance of FGAs-based algorithm. One Cluster of Workstation (COW) wit...For data association in multisensor and multitarget tracking, a novel parallel algorithm is developed to improve the efficiency and real-time performance of FGAs-based algorithm. One Cluster of Workstation (COW) with Message Passing Interface (MPI) is built. The proposed Multi-Deme Parallel FGA (MDPFGA) is run on the platform. A serial of special MDPFGAs are used to determine the static and the dynamic solutions of generalized m-best S-D assignment problem respectively, as well as target states estimation in track management. Such an assignment-based parallel algorithm is demonstrated on simulated passive sensor track formation and maintenance problem. While illustrating the feasibility of the proposed algorithm in multisensor multitarget tracking, simulation results indicate that the MDPFGAs-based algorithm has greater efficiency and speed than the FGAs-based algorithm.展开更多
A parallel algorithm of circulation numerical model based on message passing interface(MPI) is developed using serialization and an irregular rectangle decomposition scheme. Neighboring point exchange strategy(NPES...A parallel algorithm of circulation numerical model based on message passing interface(MPI) is developed using serialization and an irregular rectangle decomposition scheme. Neighboring point exchange strategy(NPES) is adopted to further enhance the computational efficiency. Two experiments are conducted on HP C7000 Blade System, the numerical results show that the parallel version with NPES(PVN) produces higher efficiency than the original parallel version(PV). The PVN achieves parallel efficiency in excess of 0.9 in the second experiment when the number of processors increases to 100, while the efficiency of PV decreases to 0.39 rapidly. The PVN of ocean circulation model is used in a fine-resolution regional simulation, which produces better results. The capability of universal implementation of this algorithm makes it applicable in many other ocean models potentially.展开更多
The Slingshot interconnect designed by HPE/Cray is becoming more relevant in high-performance computing with its deployment on the upcoming exascale systems.In particular,it is the interconnect empowering the first ex...The Slingshot interconnect designed by HPE/Cray is becoming more relevant in high-performance computing with its deployment on the upcoming exascale systems.In particular,it is the interconnect empowering the first exascale and highest-ranked supercomputer in the world,Frontier.It offers various features such as adaptive routing,congestion control,and isolated workloads.The deployment of newer interconnects sparks interest related to performance,scalability,and any potential bottlenecks as they are critical elements contributing to the scalability across nodes on these systems.In this paper,we delve into the challenges the Slingshot interconnect poses with current state-of-the-art MPI(message passing interface)libraries.In particular,we look at the scalability performance when using Slingshot across nodes.We present a comprehensive evaluation using various MPI and communication libraries including Cray MPICH,Open-MPI+UCX,RCCL,and MVAPICH2 on CPUs and GPUs on the Spock system,an early access cluster deployed with Slingshot-10,AMD MI100 GPUs and AMD Epyc Rome CPUs to emulate the Frontier system.We also evaluate preliminary CPU-based support of MPI libraries on the Slingshot-11 interconnect.展开更多
The network switches in the data plane of Software Defined Networking (SDN) are empowered by an elementary process, in which enormous number of packets which resemble big volumes of data are classified into specific f...The network switches in the data plane of Software Defined Networking (SDN) are empowered by an elementary process, in which enormous number of packets which resemble big volumes of data are classified into specific flows by matching them against a set of dynamic rules. This basic process accelerates the processing of data, so that instead of processing singular packets repeatedly, corresponding actions are performed on corresponding flows of packets. In this paper, first, we address limitations on a typical packet classification algorithm like Tuple Space Search (TSS). Then, we present a set of different scenarios to parallelize it on different parallel processing platforms, including Graphics Processing Units (GPUs), clusters of Central Processing Units (CPUs), and hybrid clusters. Experimental results show that the hybrid cluster provides the best platform for parallelizing packet classification algorithms, which promises the average throughput rate of 4.2 Million packets per second (Mpps). That is, the hybrid cluster produced by the integration of Compute Unified Device Architecture (CUDA), Message Passing Interface (MPI), and OpenMP programming model could classify 0.24 million packets per second more than the GPU cluster scheme. Such a packet classifier satisfies the required processing speed in the programmable network systems that would be used to communicate big medical data.展开更多
A moisture advection scheme is an essential module of a numerical weather/climate model representing the horizontal transport of water vapor.The Piecewise Rational Method(PRM) scalar advection scheme in the Global/Reg...A moisture advection scheme is an essential module of a numerical weather/climate model representing the horizontal transport of water vapor.The Piecewise Rational Method(PRM) scalar advection scheme in the Global/Regional Assimilation and Prediction System(GRAPES) solves the moisture flux advection equation based on PRM.Computation of the scalar advection involves boundary exchange,and computation of higher bandwidth requirements is complicated and time-consuming in GRAPES.Recently,Graphics Processing Units(GPUs) have been widely used to solve scientific and engineering computing problems owing to advancements in GPU hardware and related programming models such as CUDA/OpenCL and Open Accelerator(OpenACC).Herein,we present an accelerated PRM scalar advection scheme with Message Passing Interface(MPI) and OpenACC to fully exploit GPUs’ power over a cluster with multiple Central Processing Units(CPUs) and GPUs,together with optimization of various parameters such as minimizing data transfer,memory coalescing,exposing more parallelism,and overlapping computation with data transfers.Results show that about 3.5 times speedup is obtained for the entire model running at medium resolution with double precision when comparing the scheme’s elapsed time on a node with two GPUs(NVIDIA P100) and two 16-core CPUs(Intel Gold 6142).Further,results obtained from experiments of a higher resolution model with multiple GPUs show excellent scalability.展开更多
The parallel computing algorithm for a nonhydrostatic model on one or multiple Graphic Processing Units (GPUs) for the simulation of internal solitary waves is presented and discussed. The computational efficiency o...The parallel computing algorithm for a nonhydrostatic model on one or multiple Graphic Processing Units (GPUs) for the simulation of internal solitary waves is presented and discussed. The computational efficiency of the GPU scheme is analyzed by a series of numerical experiments, including an ideal case and the field scale simulations, performed on the workstation and the super- computer system. The calculated results show that the speedup of the developed GPU-based parallel computing scheme, compared to the implementation on a single CPU core, increases with the number of computational grid cells, and the speedup can increase quasi- linearly with respect to the number of involved GPUs for the problem with relatively large number of grid cells within 32 GPUs.展开更多
This paper proposes a new approach for implementing fast multicast on multistage interconnection networks (MINs) with multi-head worms. For an MIN with n stages of k×k switches, a single multi-head worm can cover...This paper proposes a new approach for implementing fast multicast on multistage interconnection networks (MINs) with multi-head worms. For an MIN with n stages of k×k switches, a single multi-head worm can cover an arbitrary set of destinations with a single communication start-up. Compared with schemes using unicast messages, this approach reduces multicast latency significantly and performs better than multi-destination worms.展开更多
A wide variety of algorithms have been developed to monitor aerosol burden from satellite images. Still, few solutions currently allow for real-time and efficient retrieval of aerosol optical thickness (AOT), mainly...A wide variety of algorithms have been developed to monitor aerosol burden from satellite images. Still, few solutions currently allow for real-time and efficient retrieval of aerosol optical thickness (AOT), mainly due to the extremely large volume of computation necessary for the numeric solution of atmospheric radiative transfer equations. Taking into account the efforts to exploit the SYNergy of Terra and Aqua Modis (SYNTAM, an AOT retrieval algorithm), we present in this paper a novel method to retrieve AOT from Moderate Resolution Imaging Spectroradiometer (MODIS) satellite images, in which the strategy of block partition and collective communication was taken, thereby maximizing load balance and reducing the overhead time during inter-processor communication. Experiments were carried out to retrieve AOT at 0.44, 0.55, and 0.67μm of MODIS/Terra and MODIS/Aqua data, using the parallel SYNTAM algorithm in the IBM System Cluster 1600 deployed at China Meteorological Administration (CMA). Results showed that parallel implementation can greatly reduce computation time, and thus ensure high parallel efficiency. AOT derived by parallel algorithm was validated against measurements from ground-based sun-photometers; in all cases, the relative error range was within 20%, which demonstrated that the parallel algorithm was suitable for applications such as air quality monitoring and climate modeling.展开更多
基金financially supported by the National Natural Science Foundation of China(Grant Nos.12072217 and 42077254)the Natural Science Foundation of Hunan Province,China(Grant No.2022JJ30567).
文摘The high-resolution DEM-IMB-LBM model can accurately describe pore-scale fluid-solid interactions,but its potential for use in geotechnical engineering analysis has not been fully unleashed due to its prohibitive computational costs.To overcome this limitation,a message passing interface(MPI)parallel DEM-IMB-LBM framework is proposed aimed at enhancing computation efficiency.This framework utilises a static domain decomposition scheme,with the entire computation domain being decomposed into multiple subdomains according to predefined processors.A detailed parallel strategy is employed for both contact detection and hydrodynamic force calculation.In particular,a particle ID re-numbering scheme is proposed to handle particle transitions across sub-domain interfaces.Two benchmarks are conducted to validate the accuracy and overall performance of the proposed framework.Subsequently,the framework is applied to simulate scenarios involving multi-particle sedimentation and submarine landslides.The numerical examples effectively demonstrate the robustness and applicability of the MPI parallel DEM-IMB-LBM framework.
文摘The simulation field became essential in designing or developing new casting products and in improving manufacturing processes within limited time, because it can help us to simulate the nature of processing, so that developers can make ideal casting designs. To take the prior occupation at commercial simulation market, so many development groups in the world are doing their every effort. They already reported successful stories in manufacturing fields by developing and providing the high performance simulation technologies for multipurpose. But they all run at powerful desk-side computers by well-trained experts mainly, so that it is hard to diffuse the scientific designing concept to newcomers in casting field. To overcome upcoming problems in scientific casting designs, we utilized information technologies and full-matured hardware backbones to spread out the effective and scientific casting design mind, and they all were integrated into Simulation Portal on the web. It professes scientific casting design on the NET including ubiquitous access way represented by "Anyone, Anytime, Anywhere" concept for casting designs.
基金Project supported by the National Defense Laboratory Foundation (Grant No.51444020103QT0601)the Shanghai Leading Academic Discipline Project (Grant No.T0102)
文摘In this work, we treat scattering objects, water, surface and bottom in a truly unified manner in a parallel finitedifference time-domain (FDTD) scheme, which is suitable for distributed parallel computing in a message passing interface (MPI) programming environment. The algorithm is implemented on a cluster-based high performance computer system. Parallel computation is performed with different division methods in 2D and 3D situations. Based on analysis of main factors affecting the speedup rate and parallel efficiency, data communication is reduced by selecting a suitable scheme of task division. A desirable scheme is recommended, giving a higher speedup rate and better efficiency. The results indicate that the unified parallel FDTD algorithm provides a solution to the numerical computation of acoustic scattering.
基金Project supported by the National Natural Science Foundation of China(Grant No.61701057)the Chongqing Research Program of Basic Research and Frontier Technology,China(Grant No.cstc2017jcyjAX0345).
文摘We present a time domain hybrid method to realize the fast coupling analysis of transmission lines excited by space electromagnetic fields, in which parallel finite-difference time-domain (FDTD) method, interpolation scheme, and Agrawal model-based transmission line (TL) equations are organically integrated together. Specifically, the Agrawal model is employed to establish the TL equations to describe the coupling effects of space electromagnetic fields on transmission lines. Then, the excitation fields functioning as distribution sources in TL equations are calculated by the parallel FDTD method through using the message passing interface (MPI) library scheme and interpolation scheme. Finally, the TL equations are discretized by the central difference scheme of FDTD and assigned to multiple processors to obtain the transient responses on the terminal loads of these lines. The significant feature of the presented method is embodied in its parallel and synchronous calculations of the space electromagnetic fields and transient responses on the lines. Numerical simulations of ambient wave acting on multi-conductor transmission lines (MTLs), which are located on the PEC ground and in the shielded cavity respectively, are implemented to verify the accuracy and efficiency of the presented method.
基金Supported by National Defence Scientific Research Foundation
文摘For data association in multisensor and multitarget tracking, a novel parallel algorithm is developed to improve the efficiency and real-time performance of FGAs-based algorithm. One Cluster of Workstation (COW) with Message Passing Interface (MPI) is built. The proposed Multi-Deme Parallel FGA (MDPFGA) is run on the platform. A serial of special MDPFGAs are used to determine the static and the dynamic solutions of generalized m-best S-D assignment problem respectively, as well as target states estimation in track management. Such an assignment-based parallel algorithm is demonstrated on simulated passive sensor track formation and maintenance problem. While illustrating the feasibility of the proposed algorithm in multisensor multitarget tracking, simulation results indicate that the MDPFGAs-based algorithm has greater efficiency and speed than the FGAs-based algorithm.
基金The National High Technology Research and Development Program(863 Program)of China under contract No.2013AA09A505
文摘A parallel algorithm of circulation numerical model based on message passing interface(MPI) is developed using serialization and an irregular rectangle decomposition scheme. Neighboring point exchange strategy(NPES) is adopted to further enhance the computational efficiency. Two experiments are conducted on HP C7000 Blade System, the numerical results show that the parallel version with NPES(PVN) produces higher efficiency than the original parallel version(PV). The PVN achieves parallel efficiency in excess of 0.9 in the second experiment when the number of processors increases to 100, while the efficiency of PV decreases to 0.39 rapidly. The PVN of ocean circulation model is used in a fine-resolution regional simulation, which produces better results. The capability of universal implementation of this algorithm makes it applicable in many other ocean models potentially.
基金supported in part by the U.S.National Science Foundation under Grant Nos.1818253,1854828,1931537,and 2007991XRAC under Grant No.NCR-130002supported by the Office of Science of the U.S.Department of Energy under Contract No.DE-AC05-00OR22725.
文摘The Slingshot interconnect designed by HPE/Cray is becoming more relevant in high-performance computing with its deployment on the upcoming exascale systems.In particular,it is the interconnect empowering the first exascale and highest-ranked supercomputer in the world,Frontier.It offers various features such as adaptive routing,congestion control,and isolated workloads.The deployment of newer interconnects sparks interest related to performance,scalability,and any potential bottlenecks as they are critical elements contributing to the scalability across nodes on these systems.In this paper,we delve into the challenges the Slingshot interconnect poses with current state-of-the-art MPI(message passing interface)libraries.In particular,we look at the scalability performance when using Slingshot across nodes.We present a comprehensive evaluation using various MPI and communication libraries including Cray MPICH,Open-MPI+UCX,RCCL,and MVAPICH2 on CPUs and GPUs on the Spock system,an early access cluster deployed with Slingshot-10,AMD MI100 GPUs and AMD Epyc Rome CPUs to emulate the Frontier system.We also evaluate preliminary CPU-based support of MPI libraries on the Slingshot-11 interconnect.
文摘The network switches in the data plane of Software Defined Networking (SDN) are empowered by an elementary process, in which enormous number of packets which resemble big volumes of data are classified into specific flows by matching them against a set of dynamic rules. This basic process accelerates the processing of data, so that instead of processing singular packets repeatedly, corresponding actions are performed on corresponding flows of packets. In this paper, first, we address limitations on a typical packet classification algorithm like Tuple Space Search (TSS). Then, we present a set of different scenarios to parallelize it on different parallel processing platforms, including Graphics Processing Units (GPUs), clusters of Central Processing Units (CPUs), and hybrid clusters. Experimental results show that the hybrid cluster provides the best platform for parallelizing packet classification algorithms, which promises the average throughput rate of 4.2 Million packets per second (Mpps). That is, the hybrid cluster produced by the integration of Compute Unified Device Architecture (CUDA), Message Passing Interface (MPI), and OpenMP programming model could classify 0.24 million packets per second more than the GPU cluster scheme. Such a packet classifier satisfies the required processing speed in the programmable network systems that would be used to communicate big medical data.
基金supported by the decision support project of response to climate change of China,the National Natural Science Foundation of China (Nos.41674085, 41604009, and 41621091)the Natural Science Foundation of Qinghai Province (No. 2019-ZJ-7034)the Open Project of State Key Laboratory of Plateau Ecology and Agriculture,Qinghai University (No. 2020-zz-03)。
文摘A moisture advection scheme is an essential module of a numerical weather/climate model representing the horizontal transport of water vapor.The Piecewise Rational Method(PRM) scalar advection scheme in the Global/Regional Assimilation and Prediction System(GRAPES) solves the moisture flux advection equation based on PRM.Computation of the scalar advection involves boundary exchange,and computation of higher bandwidth requirements is complicated and time-consuming in GRAPES.Recently,Graphics Processing Units(GPUs) have been widely used to solve scientific and engineering computing problems owing to advancements in GPU hardware and related programming models such as CUDA/OpenCL and Open Accelerator(OpenACC).Herein,we present an accelerated PRM scalar advection scheme with Message Passing Interface(MPI) and OpenACC to fully exploit GPUs’ power over a cluster with multiple Central Processing Units(CPUs) and GPUs,together with optimization of various parameters such as minimizing data transfer,memory coalescing,exposing more parallelism,and overlapping computation with data transfers.Results show that about 3.5 times speedup is obtained for the entire model running at medium resolution with double precision when comparing the scheme’s elapsed time on a node with two GPUs(NVIDIA P100) and two 16-core CPUs(Intel Gold 6142).Further,results obtained from experiments of a higher resolution model with multiple GPUs show excellent scalability.
基金supported by the Natural Science Foundation of Tianjin, China (Grant No. 12JCZDJC30200)the National Natural Science Foundation of China (Grant No. 51021004)the Fundamental Research Fund for the Central Nonprofit Research Institutes of China (Grant No. TKS100206)
文摘The parallel computing algorithm for a nonhydrostatic model on one or multiple Graphic Processing Units (GPUs) for the simulation of internal solitary waves is presented and discussed. The computational efficiency of the GPU scheme is analyzed by a series of numerical experiments, including an ideal case and the field scale simulations, performed on the workstation and the super- computer system. The calculated results show that the speedup of the developed GPU-based parallel computing scheme, compared to the implementation on a single CPU core, increases with the number of computational grid cells, and the speedup can increase quasi- linearly with respect to the number of involved GPUs for the problem with relatively large number of grid cells within 32 GPUs.
文摘This paper proposes a new approach for implementing fast multicast on multistage interconnection networks (MINs) with multi-head worms. For an MIN with n stages of k×k switches, a single multi-head worm can cover an arbitrary set of destinations with a single communication start-up. Compared with schemes using unicast messages, this approach reduces multicast latency significantly and performs better than multi-destination worms.
基金supported partly by the Ministry of Science and Technology of the People’s Republic of China (Grant Nos.2007CB714407, and 2008ZX10004012)the Special Funds for Basic Research in CAMS of CMA (Grant No. 2007Y001)State Key Laboratory of Remote Sensing Sciences (Grant No.07S00502CX)
文摘A wide variety of algorithms have been developed to monitor aerosol burden from satellite images. Still, few solutions currently allow for real-time and efficient retrieval of aerosol optical thickness (AOT), mainly due to the extremely large volume of computation necessary for the numeric solution of atmospheric radiative transfer equations. Taking into account the efforts to exploit the SYNergy of Terra and Aqua Modis (SYNTAM, an AOT retrieval algorithm), we present in this paper a novel method to retrieve AOT from Moderate Resolution Imaging Spectroradiometer (MODIS) satellite images, in which the strategy of block partition and collective communication was taken, thereby maximizing load balance and reducing the overhead time during inter-processor communication. Experiments were carried out to retrieve AOT at 0.44, 0.55, and 0.67μm of MODIS/Terra and MODIS/Aqua data, using the parallel SYNTAM algorithm in the IBM System Cluster 1600 deployed at China Meteorological Administration (CMA). Results showed that parallel implementation can greatly reduce computation time, and thus ensure high parallel efficiency. AOT derived by parallel algorithm was validated against measurements from ground-based sun-photometers; in all cases, the relative error range was within 20%, which demonstrated that the parallel algorithm was suitable for applications such as air quality monitoring and climate modeling.