In this work, we treat scattering objects, water, surface and bottom in a truly unified manner in a parallel finitedifference time-domain (FDTD) scheme, which is suitable for distributed parallel computing in a mess...In this work, we treat scattering objects, water, surface and bottom in a truly unified manner in a parallel finitedifference time-domain (FDTD) scheme, which is suitable for distributed parallel computing in a message passing interface (MPI) programming environment. The algorithm is implemented on a cluster-based high performance computer system. Parallel computation is performed with different division methods in 2D and 3D situations. Based on analysis of main factors affecting the speedup rate and parallel efficiency, data communication is reduced by selecting a suitable scheme of task division. A desirable scheme is recommended, giving a higher speedup rate and better efficiency. The results indicate that the unified parallel FDTD algorithm provides a solution to the numerical computation of acoustic scattering.展开更多
Up to now,so much casting analysis software has been continuing to develop the new access way to real casting processes. Those include the melt flow analysis,heat transfer analysis for solidification calculation,mecha...Up to now,so much casting analysis software has been continuing to develop the new access way to real casting processes. Those include the melt flow analysis,heat transfer analysis for solidification calculation,mechanical property predictions and microstructure predictions. These trials were successful to obtain the ideal results comparing with real situations,so that CAE technologies became inevitable to design or develop new casting processes. But for manufacturing fields,CAE technologies are not so frequently being used because of their difficulties in using the software or insufficient computing performances. To introduce CAE technologies to manufacturing field,the high performance analysis is essential to shorten the gap between product designing time and prototyping time. The software code optimization can be helpful,but it is not enough,because the codes developed by software experts are already optimized enough. As an alternative proposal for high performance computations,the parallel computation technologies are eagerly being applied to CAE technologies to make the analysis time shorter. In this research,SMP (Shared Memory Processing) and MPI (Message Passing Interface) (1) methods for parallelization were applied to commercial software "Z-Cast" to calculate the casting processes. In the code parallelizing processes,the network stabilization,core optimization were also carried out under Microsoft Windows platform and their performances and results were compared with those of normal linear analysis codes.展开更多
The grid equations in decomposed domain by parallel computation are soled, and a method of local orthogonalization to solve the large-scaled numerical computation is presented. It constructs preconditioned iteration m...The grid equations in decomposed domain by parallel computation are soled, and a method of local orthogonalization to solve the large-scaled numerical computation is presented. It constructs preconditioned iteration matrix by the combination of predigesting LU decomposition and local orthogonalization, and the convergence of solution is proved. Indicated from the example, this algorithm can increase the rate of computation efficiently and it is quite stable.展开更多
To efficiently complete a complex computation task,the complex task should be decomposed into subcomputation tasks that run parallel in edge computing.Wireless Sensor Network(WSN)is a typical application of parallel c...To efficiently complete a complex computation task,the complex task should be decomposed into subcomputation tasks that run parallel in edge computing.Wireless Sensor Network(WSN)is a typical application of parallel computation.To achieve highly reliable parallel computation for wireless sensor network,the network's lifetime needs to be extended.Therefore,a proper task allocation strategy is needed to reduce the energy consumption and balance the load of the network.This paper proposes a task model and a cluster-based WSN model in edge computing.In our model,different tasks require different types of resources and different sensors provide different types of resources,so our model is heterogeneous,which makes the model more practical.Then we propose a task allocation algorithm that combines the Genetic Algorithm(GA)and the Ant Colony Optimization(ACO)algorithm.The algorithm concentrates on energy conservation and load balancing so that the lifetime of the network can be extended.The experimental result shows the algorithm's effectiveness and advantages in energy conservation and load balancing.展开更多
In this paper, a 3rd order combination method with three processes and a 4th order combination method with five processes for solving ODEs are discussed. These methods are the Runge-Kutta method combined with a linear...In this paper, a 3rd order combination method with three processes and a 4th order combination method with five processes for solving ODEs are discussed. These methods are the Runge-Kutta method combined with a linear multistep method, which overcomes the defect of the 3rd order parallel Runge-Kutta method discussed in [1].展开更多
Three-dimensional(3D)image reconstruction involves the computations of an extensive amount of data that leads to tremendous processing time.Therefore,optimization is crucially needed to improve the performance and eff...Three-dimensional(3D)image reconstruction involves the computations of an extensive amount of data that leads to tremendous processing time.Therefore,optimization is crucially needed to improve the performance and efficiency.With the widespread use of graphics processing units(GPU),parallel computing is transforming this arduous reconstruction process for numerous imaging modalities,and photoacoustic computed tomography(PACT)is not an exception.Existing works have investigated GPU-based optimization on photoacoustic microscopy(PAM)and PACT reconstruction using compute unified device architecture(CUDA)on either C++or MATLAB only.However,our study is the first that uses cross-platform GPU computation.It maintains the simplicity of MATLAB,while improves the speed through CUDA/C++−based MATLAB converted functions called MEXCUDA.Compared to a purely MATLAB with GPU approach,our cross-platform method improves the speed five times.Because MATLAB is widely used in PAM and PACT,this study will open up new avenues for photoacoustic image reconstruction and relevant real-time imaging applications.展开更多
Based on the efficient hybrid methods for solving initial value problems of stiff ODEs, this paper derives a parallel scheme that can be used to solve the problems on parallel computers with N processors, and discusse...Based on the efficient hybrid methods for solving initial value problems of stiff ODEs, this paper derives a parallel scheme that can be used to solve the problems on parallel computers with N processors, and discusses the iteratively B-convergence of the Newton iterative process, finally, the paper provides some numberical results which show that the parallel scheme is highly efficient as N is not too large.展开更多
The discrete fracture network model is a powerful tool for fractured rock mass fluid flow simulations and supports safety assessments of coal mine hazards such as water inrush.Intersection analysis,which identifies al...The discrete fracture network model is a powerful tool for fractured rock mass fluid flow simulations and supports safety assessments of coal mine hazards such as water inrush.Intersection analysis,which identifies all pairs of intersected fractures(the basic components composing the connectivity of a network),is one of its crucial procedures.This paper attempts to improve intersection analysis through parallel computing.Considering a seamless interfacing with other procedures in modeling,two algorithms are designed and presented,of which one is a completely independent parallel procedure with some redundant computations and the other is an optimized version with reduced redundancy.A numerical study indicates that both of the algorithms are practical and can significantly improve the computational performance of intersection analysis for large-scale simulations.Moreover,the preferred application conditions for the two algorithms are also discussed.展开更多
Derived from a proposed universal mathematical expression, this paper investigates a novel algo-rithm for parallel Cyclic Redundancy Check (CRC) computation, which is an iterative algorithm to update the check-bit seq...Derived from a proposed universal mathematical expression, this paper investigates a novel algo-rithm for parallel Cyclic Redundancy Check (CRC) computation, which is an iterative algorithm to update the check-bit sequence step by step and suits to various argument selections of CRC computation. The algorithm proposed is quite suitable for hardware implementation. The simulation implementation and performance analysis suggest that it could efficiently speed up the computation compared with the conventional ones. The algorithm is implemented in hardware at as high as 21Gbps, and its usefulness in high-speed CRC computa-tions is implied, such as Asynchronous Transfer Mode (ATM) networks and 10G Ethernet.展开更多
This paper improves and generalizes the two difference schemes presented in paper [1] and gives a new difference scheme for second order linear elliptic partial differential equations, its difference matrix is a matri...This paper improves and generalizes the two difference schemes presented in paper [1] and gives a new difference scheme for second order linear elliptic partial differential equations, its difference matrix is a matrix and because of the stability of the M-matrix, it is convergent by the asynchronous iterative method on multiprocessors. Then this paper gives a class of differeifce schemes for linear elliptic PDEs so that their difference matrixes are all M-matrixes and their asynchronous parallel computation are convergent.展开更多
A parallel numerical method is employed to solve the two dimensional Navier Stokes equations in primitive variables for incompressible flow.The computing process contains two sections.The first section uses the GE (...A parallel numerical method is employed to solve the two dimensional Navier Stokes equations in primitive variables for incompressible flow.The computing process contains two sections.The first section uses the GE (group explicit) method with high parallelism to solve the velocity equations.The second section solves the pressure equation by a successive underrelaxtion iteration method in red black order.Results are given using these methods on a parallel computer.Until recently,GE method has rarely been used to solve the Navier Stokes equations.展开更多
In recent years, high performance scientific computing under workstation cluster connected by local area network is becoming a hot point. Owing to both the longer latency and the higher overhead for protocol processin...In recent years, high performance scientific computing under workstation cluster connected by local area network is becoming a hot point. Owing to both the longer latency and the higher overhead for protocol processing compared with the powerful single workstation capacity, it is becoming severe important to keep balance not only for numerical load but also for communication load, and to overlap communications with computations while parallel computing. Hence,our efficiency evaluation rules must discover these capacities of a given parallel algorithm in order to optimize the existed algorithm to attain its highest parallel efficiency. The traditional efficiency evaluation rules can not succeed in this work any more. Fortunately, thanks to Culler's detail discuss in LogP model about interconnection networks for MPP systems, we present a system of efficiency evaluation rules for parallel computations under workstation cluster with PVM3.0 parallel software framework in this paper. These rules can satisfy above acquirements successfully. At last, two typical synchronous,and asynchronous applications are designed to verify the validity of these rules under 4 SGIs workstations cluster connected by Ethernet.展开更多
In this paper,a 4th order parallel computation method with four processes for solving ODEs is discussed.This method is the Runge-Kutta method combined with a linear multistep method,which overcomes the difficulties of...In this paper,a 4th order parallel computation method with four processes for solving ODEs is discussed.This method is the Runge-Kutta method combined with a linear multistep method,which overcomes the difficulties of the 4th order parallel Runge-Kutta method discussed in [1].The concept of critical speedup for parallel methods is also defined,and speedups of some methods are analyzed by using this concept.展开更多
Multicomputer systems(distributed memory computer systems) are becoming more and more popular and will be wildly used in scientific researches. In this paper, we present a parallel algorithm of Fourier Transform of a ...Multicomputer systems(distributed memory computer systems) are becoming more and more popular and will be wildly used in scientific researches. In this paper, we present a parallel algorithm of Fourier Transform of a vector of complex numbers on multicomputer system and give its computing times and its speedup in parallel environment supported by EXPRESS system on the multicomputer system which consists of four SGI workstations. Our analysis shows that the results is ideal and this scheme is suitable to multicomputer systems.展开更多
The rapid growth of interconnected high performance workstations has produced a new computing paradigm called clustered of workstations computing. In these systems load balance problem is a serious impediment to achie...The rapid growth of interconnected high performance workstations has produced a new computing paradigm called clustered of workstations computing. In these systems load balance problem is a serious impediment to achieve good performance. The main concern of this paper is the implementation of dynamic load balancing algorithm, asynchronous Round Robin (ARR), for balancing workload of parallel tree computation depth-first-search algorithm on Cluster of Heterogeneous Workstations (COW) Many algorithms in artificial intelligence and other areas of computer science are based on depth first search in implicitty defined trees. For these algorithms a load-balancing scheme is required, which is able to evenly distribute parts of an irregularly shaped tree over the workstations with minimal interprocessor communication and without prior knowledge of the tree’s shape. For the (ARR) algorithm only minimal interprocessor communication is needed when necessary and it runs under the MPI (Message passing interface) that allows parallel execution on heterogeneous SUN cluster of workstation platform. The program code is written in C language and executed under UNIX operating system (Solaris version).展开更多
The Message Passing Interface (MPI) is a widely accepted standard for parallel computing on distributed memorysystems.However, MPI implementations can contain defects that impact the reliability and performance of par...The Message Passing Interface (MPI) is a widely accepted standard for parallel computing on distributed memorysystems.However, MPI implementations can contain defects that impact the reliability and performance of parallelapplications. Detecting and correcting these defects is crucial, yet there is a lack of published models specificallydesigned for correctingMPI defects. To address this, we propose a model for detecting and correcting MPI defects(DC_MPI), which aims to detect and correct defects in various types of MPI communication, including blockingpoint-to-point (BPTP), nonblocking point-to-point (NBPTP), and collective communication (CC). The defectsaddressed by the DC_MPI model include illegal MPI calls, deadlocks (DL), race conditions (RC), and messagemismatches (MM). To assess the effectiveness of the DC_MPI model, we performed experiments on a datasetconsisting of 40 MPI codes. The results indicate that the model achieved a detection rate of 37 out of 40 codes,resulting in an overall detection accuracy of 92.5%. Additionally, the execution duration of the DC_MPI modelranged from 0.81 to 1.36 s. These findings show that the DC_MPI model is useful in detecting and correctingdefects in MPI implementations, thereby enhancing the reliability and performance of parallel applications. TheDC_MPImodel fills an important research gap and provides a valuable tool for improving the quality ofMPI-basedparallel computing systems.展开更多
In this research,we present the pure open multi-processing(OpenMP),pure message passing interface(MPI),and hybrid MPI/OpenMP parallel solvers within the dynamic explicit central difference algorithm for the coining pr...In this research,we present the pure open multi-processing(OpenMP),pure message passing interface(MPI),and hybrid MPI/OpenMP parallel solvers within the dynamic explicit central difference algorithm for the coining process to address the challenge of capturing fine relief features of approximately 50 microns.Achieving such precision demands the utilization of at least 7 million tetrahedron elements,surpassing the capabilities of traditional serial programs previously developed.To mitigate data races when calculating internal forces,intermediate arrays are introduced within the OpenMP directive.This helps ensure proper synchronization and avoid conflicts during parallel execution.Additionally,in the MPI implementation,the coins are partitioned into the desired number of regions.This division allows for efficient distribution of computational tasks across multiple processes.Numerical simulation examples are conducted to compare the three solvers with serial programs,evaluating correctness,acceleration ratio,and parallel efficiency.The results reveal a relative error of approximately 0.3%in forming force among the parallel and serial solvers,while the predicted insufficient material zones align with experimental observations.Additionally,speedup ratio and parallel efficiency are assessed for the coining process simulation.The pureMPI parallel solver achieves a maximum acceleration of 9.5 on a single computer(utilizing 12 cores)and the hybrid solver exhibits a speedup ratio of 136 in a cluster(using 6 compute nodes and 12 cores per compute node),showing the strong scalability of the hybrid MPI/OpenMP programming model.This approach effectively meets the simulation requirements for commemorative coins with intricate relief patterns.展开更多
This paper aims to solve large-scale and complex isogeometric topology optimization problems that consumesignificant computational resources. A novel isogeometric topology optimization method with a hybrid parallelstr...This paper aims to solve large-scale and complex isogeometric topology optimization problems that consumesignificant computational resources. A novel isogeometric topology optimization method with a hybrid parallelstrategy of CPU/GPU is proposed, while the hybrid parallel strategies for stiffness matrix assembly, equationsolving, sensitivity analysis, and design variable update are discussed in detail. To ensure the high efficiency ofCPU/GPU computing, a workload balancing strategy is presented for optimally distributing the workload betweenCPU and GPU. To illustrate the advantages of the proposedmethod, three benchmark examples are tested to verifythe hybrid parallel strategy in this paper. The results show that the efficiency of the hybrid method is faster thanserial CPU and parallel GPU, while the speedups can be up to two orders of magnitude.展开更多
With the development of parallel computing technology,non-linear inversion calculation efficiency has been improving.However,for single-point search-based non-linear inversion methods,the implementation of parallel al...With the development of parallel computing technology,non-linear inversion calculation efficiency has been improving.However,for single-point search-based non-linear inversion methods,the implementation of parallel algorithms is a difficult issue.We introduce the idea of group search to the single-point search-based non-linear inversion algorithm, taking the quantum Monte Carlo method as an example for two-dimensional seismic wave velocity inversion and practical impedance inversion and test the calculation efficiency of using different node numbers.The results show the parallel algorithm in theoretical and practical data inversion is feasible and effective.The parallel algorithm has good versatility. The algorithm efficiency increases with increasing node numbers but the algorithm efficiency rate of increase gradually decreases as the node numbers increase.展开更多
基金Project supported by the National Defense Laboratory Foundation (Grant No.51444020103QT0601)the Shanghai Leading Academic Discipline Project (Grant No.T0102)
文摘In this work, we treat scattering objects, water, surface and bottom in a truly unified manner in a parallel finitedifference time-domain (FDTD) scheme, which is suitable for distributed parallel computing in a message passing interface (MPI) programming environment. The algorithm is implemented on a cluster-based high performance computer system. Parallel computation is performed with different division methods in 2D and 3D situations. Based on analysis of main factors affecting the speedup rate and parallel efficiency, data communication is reduced by selecting a suitable scheme of task division. A desirable scheme is recommended, giving a higher speedup rate and better efficiency. The results indicate that the unified parallel FDTD algorithm provides a solution to the numerical computation of acoustic scattering.
文摘Up to now,so much casting analysis software has been continuing to develop the new access way to real casting processes. Those include the melt flow analysis,heat transfer analysis for solidification calculation,mechanical property predictions and microstructure predictions. These trials were successful to obtain the ideal results comparing with real situations,so that CAE technologies became inevitable to design or develop new casting processes. But for manufacturing fields,CAE technologies are not so frequently being used because of their difficulties in using the software or insufficient computing performances. To introduce CAE technologies to manufacturing field,the high performance analysis is essential to shorten the gap between product designing time and prototyping time. The software code optimization can be helpful,but it is not enough,because the codes developed by software experts are already optimized enough. As an alternative proposal for high performance computations,the parallel computation technologies are eagerly being applied to CAE technologies to make the analysis time shorter. In this research,SMP (Shared Memory Processing) and MPI (Message Passing Interface) (1) methods for parallelization were applied to commercial software "Z-Cast" to calculate the casting processes. In the code parallelizing processes,the network stabilization,core optimization were also carried out under Microsoft Windows platform and their performances and results were compared with those of normal linear analysis codes.
文摘The grid equations in decomposed domain by parallel computation are soled, and a method of local orthogonalization to solve the large-scaled numerical computation is presented. It constructs preconditioned iteration matrix by the combination of predigesting LU decomposition and local orthogonalization, and the convergence of solution is proved. Indicated from the example, this algorithm can increase the rate of computation efficiently and it is quite stable.
基金supported by Postdoctoral Science Foundation of China(No.2021M702441)National Natural Science Foundation of China(No.61871283)。
文摘To efficiently complete a complex computation task,the complex task should be decomposed into subcomputation tasks that run parallel in edge computing.Wireless Sensor Network(WSN)is a typical application of parallel computation.To achieve highly reliable parallel computation for wireless sensor network,the network's lifetime needs to be extended.Therefore,a proper task allocation strategy is needed to reduce the energy consumption and balance the load of the network.This paper proposes a task model and a cluster-based WSN model in edge computing.In our model,different tasks require different types of resources and different sensors provide different types of resources,so our model is heterogeneous,which makes the model more practical.Then we propose a task allocation algorithm that combines the Genetic Algorithm(GA)and the Ant Colony Optimization(ACO)algorithm.The algorithm concentrates on energy conservation and load balancing so that the lifetime of the network can be extended.The experimental result shows the algorithm's effectiveness and advantages in energy conservation and load balancing.
文摘In this paper, a 3rd order combination method with three processes and a 4th order combination method with five processes for solving ODEs are discussed. These methods are the Runge-Kutta method combined with a linear multistep method, which overcomes the defect of the 3rd order parallel Runge-Kutta method discussed in [1].
基金supported in part by the Career Catalyst Research Grant from the Susan G.Komen Foundationthe Clinical and Translational Science Pilot Study Award from the National Institutes of Health.
文摘Three-dimensional(3D)image reconstruction involves the computations of an extensive amount of data that leads to tremendous processing time.Therefore,optimization is crucially needed to improve the performance and efficiency.With the widespread use of graphics processing units(GPU),parallel computing is transforming this arduous reconstruction process for numerous imaging modalities,and photoacoustic computed tomography(PACT)is not an exception.Existing works have investigated GPU-based optimization on photoacoustic microscopy(PAM)and PACT reconstruction using compute unified device architecture(CUDA)on either C++or MATLAB only.However,our study is the first that uses cross-platform GPU computation.It maintains the simplicity of MATLAB,while improves the speed through CUDA/C++−based MATLAB converted functions called MEXCUDA.Compared to a purely MATLAB with GPU approach,our cross-platform method improves the speed five times.Because MATLAB is widely used in PAM and PACT,this study will open up new avenues for photoacoustic image reconstruction and relevant real-time imaging applications.
文摘Based on the efficient hybrid methods for solving initial value problems of stiff ODEs, this paper derives a parallel scheme that can be used to solve the problems on parallel computers with N processors, and discusses the iteratively B-convergence of the Newton iterative process, finally, the paper provides some numberical results which show that the parallel scheme is highly efficient as N is not too large.
基金supported by the National Basic Research Program of China(973 Program)(2010CB428801,2010CB428804)National High-tech R&D Program of China(863 Program)(2011AA050105)+1 种基金National Science Foundation of China(40972166)National Science and Technology Major Project of China(2011ZX 05060-005).
文摘The discrete fracture network model is a powerful tool for fractured rock mass fluid flow simulations and supports safety assessments of coal mine hazards such as water inrush.Intersection analysis,which identifies all pairs of intersected fractures(the basic components composing the connectivity of a network),is one of its crucial procedures.This paper attempts to improve intersection analysis through parallel computing.Considering a seamless interfacing with other procedures in modeling,two algorithms are designed and presented,of which one is a completely independent parallel procedure with some redundant computations and the other is an optimized version with reduced redundancy.A numerical study indicates that both of the algorithms are practical and can significantly improve the computational performance of intersection analysis for large-scale simulations.Moreover,the preferred application conditions for the two algorithms are also discussed.
基金Supported by the National Natural Science Foundation of China (No.60172029) and the Natural Science Foun-dation of Shaanxi Province (No.2004F04).
文摘Derived from a proposed universal mathematical expression, this paper investigates a novel algo-rithm for parallel Cyclic Redundancy Check (CRC) computation, which is an iterative algorithm to update the check-bit sequence step by step and suits to various argument selections of CRC computation. The algorithm proposed is quite suitable for hardware implementation. The simulation implementation and performance analysis suggest that it could efficiently speed up the computation compared with the conventional ones. The algorithm is implemented in hardware at as high as 21Gbps, and its usefulness in high-speed CRC computa-tions is implied, such as Asynchronous Transfer Mode (ATM) networks and 10G Ethernet.
文摘This paper improves and generalizes the two difference schemes presented in paper [1] and gives a new difference scheme for second order linear elliptic partial differential equations, its difference matrix is a matrix and because of the stability of the M-matrix, it is convergent by the asynchronous iterative method on multiprocessors. Then this paper gives a class of differeifce schemes for linear elliptic PDEs so that their difference matrixes are all M-matrixes and their asynchronous parallel computation are convergent.
基金Supported by the Science and TechnologyFoundation of the Chinese Institute of E-ngineering Physics
文摘A parallel numerical method is employed to solve the two dimensional Navier Stokes equations in primitive variables for incompressible flow.The computing process contains two sections.The first section uses the GE (group explicit) method with high parallelism to solve the velocity equations.The second section solves the pressure equation by a successive underrelaxtion iteration method in red black order.Results are given using these methods on a parallel computer.Until recently,GE method has rarely been used to solve the Navier Stokes equations.
文摘In recent years, high performance scientific computing under workstation cluster connected by local area network is becoming a hot point. Owing to both the longer latency and the higher overhead for protocol processing compared with the powerful single workstation capacity, it is becoming severe important to keep balance not only for numerical load but also for communication load, and to overlap communications with computations while parallel computing. Hence,our efficiency evaluation rules must discover these capacities of a given parallel algorithm in order to optimize the existed algorithm to attain its highest parallel efficiency. The traditional efficiency evaluation rules can not succeed in this work any more. Fortunately, thanks to Culler's detail discuss in LogP model about interconnection networks for MPP systems, we present a system of efficiency evaluation rules for parallel computations under workstation cluster with PVM3.0 parallel software framework in this paper. These rules can satisfy above acquirements successfully. At last, two typical synchronous,and asynchronous applications are designed to verify the validity of these rules under 4 SGIs workstations cluster connected by Ethernet.
文摘In this paper,a 4th order parallel computation method with four processes for solving ODEs is discussed.This method is the Runge-Kutta method combined with a linear multistep method,which overcomes the difficulties of the 4th order parallel Runge-Kutta method discussed in [1].The concept of critical speedup for parallel methods is also defined,and speedups of some methods are analyzed by using this concept.
文摘Multicomputer systems(distributed memory computer systems) are becoming more and more popular and will be wildly used in scientific researches. In this paper, we present a parallel algorithm of Fourier Transform of a vector of complex numbers on multicomputer system and give its computing times and its speedup in parallel environment supported by EXPRESS system on the multicomputer system which consists of four SGI workstations. Our analysis shows that the results is ideal and this scheme is suitable to multicomputer systems.
文摘The rapid growth of interconnected high performance workstations has produced a new computing paradigm called clustered of workstations computing. In these systems load balance problem is a serious impediment to achieve good performance. The main concern of this paper is the implementation of dynamic load balancing algorithm, asynchronous Round Robin (ARR), for balancing workload of parallel tree computation depth-first-search algorithm on Cluster of Heterogeneous Workstations (COW) Many algorithms in artificial intelligence and other areas of computer science are based on depth first search in implicitty defined trees. For these algorithms a load-balancing scheme is required, which is able to evenly distribute parts of an irregularly shaped tree over the workstations with minimal interprocessor communication and without prior knowledge of the tree’s shape. For the (ARR) algorithm only minimal interprocessor communication is needed when necessary and it runs under the MPI (Message passing interface) that allows parallel execution on heterogeneous SUN cluster of workstation platform. The program code is written in C language and executed under UNIX operating system (Solaris version).
基金the Deanship of Scientific Research at King Abdulaziz University,Jeddah,Saudi Arabia under the Grant No.RG-12-611-43.
文摘The Message Passing Interface (MPI) is a widely accepted standard for parallel computing on distributed memorysystems.However, MPI implementations can contain defects that impact the reliability and performance of parallelapplications. Detecting and correcting these defects is crucial, yet there is a lack of published models specificallydesigned for correctingMPI defects. To address this, we propose a model for detecting and correcting MPI defects(DC_MPI), which aims to detect and correct defects in various types of MPI communication, including blockingpoint-to-point (BPTP), nonblocking point-to-point (NBPTP), and collective communication (CC). The defectsaddressed by the DC_MPI model include illegal MPI calls, deadlocks (DL), race conditions (RC), and messagemismatches (MM). To assess the effectiveness of the DC_MPI model, we performed experiments on a datasetconsisting of 40 MPI codes. The results indicate that the model achieved a detection rate of 37 out of 40 codes,resulting in an overall detection accuracy of 92.5%. Additionally, the execution duration of the DC_MPI modelranged from 0.81 to 1.36 s. These findings show that the DC_MPI model is useful in detecting and correctingdefects in MPI implementations, thereby enhancing the reliability and performance of parallel applications. TheDC_MPImodel fills an important research gap and provides a valuable tool for improving the quality ofMPI-basedparallel computing systems.
基金supported by the fund from ShenyangMint Company Limited(No.20220056)Senior Talent Foundation of Jiangsu University(No.19JDG022)Taizhou City Double Innovation and Entrepreneurship Talent Program(No.Taizhou Human Resources Office[2022]No.22).
文摘In this research,we present the pure open multi-processing(OpenMP),pure message passing interface(MPI),and hybrid MPI/OpenMP parallel solvers within the dynamic explicit central difference algorithm for the coining process to address the challenge of capturing fine relief features of approximately 50 microns.Achieving such precision demands the utilization of at least 7 million tetrahedron elements,surpassing the capabilities of traditional serial programs previously developed.To mitigate data races when calculating internal forces,intermediate arrays are introduced within the OpenMP directive.This helps ensure proper synchronization and avoid conflicts during parallel execution.Additionally,in the MPI implementation,the coins are partitioned into the desired number of regions.This division allows for efficient distribution of computational tasks across multiple processes.Numerical simulation examples are conducted to compare the three solvers with serial programs,evaluating correctness,acceleration ratio,and parallel efficiency.The results reveal a relative error of approximately 0.3%in forming force among the parallel and serial solvers,while the predicted insufficient material zones align with experimental observations.Additionally,speedup ratio and parallel efficiency are assessed for the coining process simulation.The pureMPI parallel solver achieves a maximum acceleration of 9.5 on a single computer(utilizing 12 cores)and the hybrid solver exhibits a speedup ratio of 136 in a cluster(using 6 compute nodes and 12 cores per compute node),showing the strong scalability of the hybrid MPI/OpenMP programming model.This approach effectively meets the simulation requirements for commemorative coins with intricate relief patterns.
基金the National Key R&D Program of China(2020YFB1708300)the National Natural Science Foundation of China(52005192)the Project of Ministry of Industry and Information Technology(TC210804R-3).
文摘This paper aims to solve large-scale and complex isogeometric topology optimization problems that consumesignificant computational resources. A novel isogeometric topology optimization method with a hybrid parallelstrategy of CPU/GPU is proposed, while the hybrid parallel strategies for stiffness matrix assembly, equationsolving, sensitivity analysis, and design variable update are discussed in detail. To ensure the high efficiency ofCPU/GPU computing, a workload balancing strategy is presented for optimally distributing the workload betweenCPU and GPU. To illustrate the advantages of the proposedmethod, three benchmark examples are tested to verifythe hybrid parallel strategy in this paper. The results show that the efficiency of the hybrid method is faster thanserial CPU and parallel GPU, while the speedups can be up to two orders of magnitude.
基金supported by National Key S&T Special Projects of Marine Carbonate(No.2008ZX05000-004)CNPC Projects(No.2008E-0610-10)
文摘With the development of parallel computing technology,non-linear inversion calculation efficiency has been improving.However,for single-point search-based non-linear inversion methods,the implementation of parallel algorithms is a difficult issue.We introduce the idea of group search to the single-point search-based non-linear inversion algorithm, taking the quantum Monte Carlo method as an example for two-dimensional seismic wave velocity inversion and practical impedance inversion and test the calculation efficiency of using different node numbers.The results show the parallel algorithm in theoretical and practical data inversion is feasible and effective.The parallel algorithm has good versatility. The algorithm efficiency increases with increasing node numbers but the algorithm efficiency rate of increase gradually decreases as the node numbers increase.