This paper aims to solve large-scale and complex isogeometric topology optimization problems that consumesignificant computational resources. A novel isogeometric topology optimization method with a hybrid parallelstr...This paper aims to solve large-scale and complex isogeometric topology optimization problems that consumesignificant computational resources. A novel isogeometric topology optimization method with a hybrid parallelstrategy of CPU/GPU is proposed, while the hybrid parallel strategies for stiffness matrix assembly, equationsolving, sensitivity analysis, and design variable update are discussed in detail. To ensure the high efficiency ofCPU/GPU computing, a workload balancing strategy is presented for optimally distributing the workload betweenCPU and GPU. To illustrate the advantages of the proposedmethod, three benchmark examples are tested to verifythe hybrid parallel strategy in this paper. The results show that the efficiency of the hybrid method is faster thanserial CPU and parallel GPU, while the speedups can be up to two orders of magnitude.展开更多
Accurate 3-dimensional(3-D)reconstruction technology for nondestructive testing based on digital radiography(DR)is of great importance for alleviating the drawbacks of the existing computed tomography(CT)-based method...Accurate 3-dimensional(3-D)reconstruction technology for nondestructive testing based on digital radiography(DR)is of great importance for alleviating the drawbacks of the existing computed tomography(CT)-based method.The commonly used Monte Carlo simulation method ensures well-performing imaging results for DR.However,for 3-D reconstruction,it is limited by its high time consumption.To solve this problem,this study proposes a parallel computing method to accelerate Monte Carlo simulation for projection images with a parallel interface and a specific DR application.The images are utilized for 3-D reconstruction of the test model.We verify the accuracy of parallel computing for DR and evaluate the performance of two parallel computing modes-multithreaded applications(G4-MT)and message-passing interfaces(G4-MPI)-by assessing parallel speedup and efficiency.This study explores the scalability of the hybrid G4-MPI and G4-MT modes.The results show that the two parallel computing modes can significantly reduce the Monte Carlo simulation time because the parallel speedup increment of Monte Carlo simulations can be considered linear growth,and the parallel efficiency is maintained at a high level.The hybrid mode has strong scalability,as the overall run time of the 180 simulations using 320 threads is 15.35 h with 10 billion particles emitted,and the parallel speedup can be up to 151.36.The 3-D reconstruction of the model is achieved based on the filtered back projection(FBP)algorithm using 180 projection images obtained with the hybrid G4-MPI and G4-MT.The quality of the reconstructed sliced images is satisfactory because the images can reflect the internal structure of the test model.This method is applied to a complex model,and the quality of the reconstructed images is evaluated.展开更多
A computational fluid dynamics(CFD)solver for a GPU/CPU heterogeneous architecture parallel computing platform is developed to simulate incompressible flows on billion-level grid points.To solve the Poisson equation,t...A computational fluid dynamics(CFD)solver for a GPU/CPU heterogeneous architecture parallel computing platform is developed to simulate incompressible flows on billion-level grid points.To solve the Poisson equation,the conjugate gradient method is used as a basic solver,and a Chebyshev method in combination with a Jacobi sub-preconditioner is used as a preconditioner.The developed CFD solver shows good performance on parallel efficiency,which exceeds 90%in the weak-scalability test when the number of grid points allocated to each GPU card is greater than 2083.In the acceleration test,it is found that running a simulation with 10403 grid points on 125 GPU cards accelerates by 203.6x over the same number of CPU cores.The developed solver is then tested in the context of a two-dimensional lid-driven cavity flow and three-dimensional Taylor-Green vortex flow.The results are consistent with previous results in the literature.展开更多
Due to the inherent insecure nature of the Internet,it is crucial to ensure the secure transmission of image data over this network.Additionally,given the limitations of computers,it becomes evenmore important to empl...Due to the inherent insecure nature of the Internet,it is crucial to ensure the secure transmission of image data over this network.Additionally,given the limitations of computers,it becomes evenmore important to employ efficient and fast image encryption techniques.While 1D chaotic maps offer a practical approach to real-time image encryption,their limited flexibility and increased vulnerability restrict their practical application.In this research,we have utilized a 3DHindmarsh-Rosemodel to construct a secure cryptosystem.The randomness of the chaotic map is assessed through standard analysis.The proposed system enhances security by incorporating an increased number of system parameters and a wide range of chaotic parameters,as well as ensuring a uniformdistribution of chaotic signals across the entire value space.Additionally,a fast image encryption technique utilizing the new chaotic system is proposed.The novelty of the approach is confirmed through time complexity analysis.To further strengthen the resistance against cryptanalysis attacks and differential attacks,the SHA-256 algorithm is employed for secure key generation.Experimental results through a number of parameters demonstrate the strong cryptographic performance of the proposed image encryption approach,highlighting its exceptional suitability for secure communication.Moreover,the security of the proposed scheme has been compared with stateof-the-art image encryption schemes,and all comparison metrics indicate the superior performance of the proposed scheme.展开更多
To efficiently complete a complex computation task,the complex task should be decomposed into subcomputation tasks that run parallel in edge computing.Wireless Sensor Network(WSN)is a typical application of parallel c...To efficiently complete a complex computation task,the complex task should be decomposed into subcomputation tasks that run parallel in edge computing.Wireless Sensor Network(WSN)is a typical application of parallel computation.To achieve highly reliable parallel computation for wireless sensor network,the network's lifetime needs to be extended.Therefore,a proper task allocation strategy is needed to reduce the energy consumption and balance the load of the network.This paper proposes a task model and a cluster-based WSN model in edge computing.In our model,different tasks require different types of resources and different sensors provide different types of resources,so our model is heterogeneous,which makes the model more practical.Then we propose a task allocation algorithm that combines the Genetic Algorithm(GA)and the Ant Colony Optimization(ACO)algorithm.The algorithm concentrates on energy conservation and load balancing so that the lifetime of the network can be extended.The experimental result shows the algorithm's effectiveness and advantages in energy conservation and load balancing.展开更多
Conventional gradient-based full waveform inversion (FWI) is a local optimization, which is highly dependent on the initial model and prone to trapping in local minima. Globally optimal FWI that can overcome this limi...Conventional gradient-based full waveform inversion (FWI) is a local optimization, which is highly dependent on the initial model and prone to trapping in local minima. Globally optimal FWI that can overcome this limitation is particularly attractive, but is currently limited by the huge amount of calculation. In this paper, we propose a globally optimal FWI framework based on GPU parallel computing, which greatly improves the efficiency, and is expected to make globally optimal FWI more widely used. In this framework, we simplify and recombine the model parameters, and optimize the model iteratively. Each iteration contains hundreds of individuals, each individual is independent of the other, and each individual contains forward modeling and cost function calculation. The framework is suitable for a variety of globally optimal algorithms, and we test the framework with particle swarm optimization algorithm for example. Both the synthetic and field examples achieve good results, indicating the effectiveness of the framework. .展开更多
In this paper,we investigate vehicular fog computing system and develop an effective parallel offloading scheme.The service time,that addresses task offloading delay,task decomposition and handover cost,is adopted as ...In this paper,we investigate vehicular fog computing system and develop an effective parallel offloading scheme.The service time,that addresses task offloading delay,task decomposition and handover cost,is adopted as the metric of offloading performance.We propose an available resource-aware based parallel offloading scheme,which decides target fog nodes by RSU for computation offloading jointly considering effect of vehicles mobility and time-varying computation capability.Based on Hidden Markov model and Markov chain theories,proposed scheme effectively handles the imperfect system state information for fog nodes selection by jointly achieving mobility awareness and computation perception.Simulation results are presented to corroborate the theoretical analysis and validate the effectiveness of the proposed algorithm.展开更多
The Spectral Statistical Interpolation (SSI) analysis system of NCEP is used to assimilate meteorological data from the Global Positioning Satellite System (GPS/MET) refraction angles with the variational technique. V...The Spectral Statistical Interpolation (SSI) analysis system of NCEP is used to assimilate meteorological data from the Global Positioning Satellite System (GPS/MET) refraction angles with the variational technique. Verified by radiosonde, including GPS/MET observations into the analysis makes an overall improvement to the analysis variables of temperature, winds, and water vapor. However, the variational model with the ray-tracing method is quite expensive for numerical weather prediction and climate research. For example, about 4 000 GPS/MET refraction angles need to be assimilated to produce an ideal global analysis. Just one iteration of minimization will take more than 24 hours CPU time on the NCEP's Cray C90 computer. Although efforts have been taken to reduce the computational cost, it is still prohibitive for operational data assimilation. In this paper, a parallel version of the three-dimensional variational data assimilation model of GPS/MET occultation measurement suitable for massive parallel processors architectures is developed. The divide-and-conquer strategy is used to achieve parallelism and is implemented by message passing. The authors present the principles for the code's design and examine the performance on the state-of-the-art parallel computers in China. The results show that this parallel model scales favorably as the number of processors is increased. With the Memory-IO technique implemented by the author, the wall clock time per iteration used for assimilating 1420 refraction angles is reduced from 45 s to 12 s using 1420 processors. This suggests that the new parallelized code has the potential to be useful in numerical weather prediction (NWP) and climate studies.展开更多
The flexibility of traditional image processing system is limited because those system are designed for specific applications. In this paper, a new TMS320C64x-based multi-DSP parallel computing architecture is present...The flexibility of traditional image processing system is limited because those system are designed for specific applications. In this paper, a new TMS320C64x-based multi-DSP parallel computing architecture is presented. It has many promising characteristics such as powerful computing capability, broad I/O bandwidth, topology flexibility, and expansibility. The parallel system performance is evaluated by practical experiment.展开更多
The vertex solution for estimation on the static displacement bounds of structures with uncertain-but-bounded parameters is studied in this paper. For the linear static problem, when there are uncertain interval param...The vertex solution for estimation on the static displacement bounds of structures with uncertain-but-bounded parameters is studied in this paper. For the linear static problem, when there are uncertain interval parameters in the stiffness matrix and the vector of applied forces, the static response may be an interval. Based on the interval operations, the interval solution obtained by the vertex solution is more accurate and more credible than other methods (such as the perturbation method). However, the vertex solution method by traditional serial computing usually needs large computational efforts, especially for large structures. In order to avoid its disadvantages of large calculation and much runtime, its parallel computing which can be used in large-scale computing is presented in this paper. Two kinds of parallel computing algorithms are proposed based on the vertex solution. The parallel computing will solve many interval problems which cannot be resolved by traditional interval analysis methods.展开更多
Abstract In this paper, we introduce several on-going research projects to support parallel and distribut,ed computing on heterogeneous networks of workstations (NOW) in the High Performance Computing and Software Lah...Abstract In this paper, we introduce several on-going research projects to support parallel and distribut,ed computing on heterogeneous networks of workstations (NOW) in the High Performance Computing and Software Lahoratory at the University of Texas at San Antonio. The projects at aiming at addressing three technical issues. First, the factors of heterogeneity and time-sharing effects make traditional performance models/metrics for homogeneous computing performance measurement and evaluation not. suitable for bet.erogeneous computing. We develop practical models and metrics which quantify. the heterogeneity of networks and characterize the performance effects. Second, in order to perform parallel computation effectively, special system support is necessary. We are developing system schemes for heterogeneity management, process scheduling and efficient communications. Finally, to provide insight into system performance, we are developing two types of supporting tools : a graphical instrumentation monitor to aid users in investigating performance problems and in determining the most effective way of exploiting the NOW systems, and a trace-driven simulator to test and compare different system management and scheduling schemes.展开更多
In this paper, we propose a parallel computing technique for content-based image retrieval (CBIR) system. This technique is mainly used for single node with multi-core processor, which is different from those based ...In this paper, we propose a parallel computing technique for content-based image retrieval (CBIR) system. This technique is mainly used for single node with multi-core processor, which is different from those based on cluster or network computing architecture. Due to its specific applications (such as medical image processing) and the harsh terms of hardware resource requirement, the CBIR system has been prevented from being widely used. With the increasing volume of the image database, the widespread use of multi-core processors, and the requirement of the retrieval accuracy and speed, we need to achieve a retrieval strategy which is based on multi-core processor to make the retrieval faster and more convenient than before. Experimental results demonstrate that this parallel architecture can significantly improve the performance of retrieval system. In addition, we also propose an efficient parallel technique with the combinations of the cluster and the multi-core techniques, which is supposed to gear to the new trend of the cloud computing.展开更多
The Message Passing Interface (MPI) is a widely accepted standard for parallel computing on distributed memorysystems.However, MPI implementations can contain defects that impact the reliability and performance of par...The Message Passing Interface (MPI) is a widely accepted standard for parallel computing on distributed memorysystems.However, MPI implementations can contain defects that impact the reliability and performance of parallelapplications. Detecting and correcting these defects is crucial, yet there is a lack of published models specificallydesigned for correctingMPI defects. To address this, we propose a model for detecting and correcting MPI defects(DC_MPI), which aims to detect and correct defects in various types of MPI communication, including blockingpoint-to-point (BPTP), nonblocking point-to-point (NBPTP), and collective communication (CC). The defectsaddressed by the DC_MPI model include illegal MPI calls, deadlocks (DL), race conditions (RC), and messagemismatches (MM). To assess the effectiveness of the DC_MPI model, we performed experiments on a datasetconsisting of 40 MPI codes. The results indicate that the model achieved a detection rate of 37 out of 40 codes,resulting in an overall detection accuracy of 92.5%. Additionally, the execution duration of the DC_MPI modelranged from 0.81 to 1.36 s. These findings show that the DC_MPI model is useful in detecting and correctingdefects in MPI implementations, thereby enhancing the reliability and performance of parallel applications. TheDC_MPImodel fills an important research gap and provides a valuable tool for improving the quality ofMPI-basedparallel computing systems.展开更多
In this research,we present the pure open multi-processing(OpenMP),pure message passing interface(MPI),and hybrid MPI/OpenMP parallel solvers within the dynamic explicit central difference algorithm for the coining pr...In this research,we present the pure open multi-processing(OpenMP),pure message passing interface(MPI),and hybrid MPI/OpenMP parallel solvers within the dynamic explicit central difference algorithm for the coining process to address the challenge of capturing fine relief features of approximately 50 microns.Achieving such precision demands the utilization of at least 7 million tetrahedron elements,surpassing the capabilities of traditional serial programs previously developed.To mitigate data races when calculating internal forces,intermediate arrays are introduced within the OpenMP directive.This helps ensure proper synchronization and avoid conflicts during parallel execution.Additionally,in the MPI implementation,the coins are partitioned into the desired number of regions.This division allows for efficient distribution of computational tasks across multiple processes.Numerical simulation examples are conducted to compare the three solvers with serial programs,evaluating correctness,acceleration ratio,and parallel efficiency.The results reveal a relative error of approximately 0.3%in forming force among the parallel and serial solvers,while the predicted insufficient material zones align with experimental observations.Additionally,speedup ratio and parallel efficiency are assessed for the coining process simulation.The pureMPI parallel solver achieves a maximum acceleration of 9.5 on a single computer(utilizing 12 cores)and the hybrid solver exhibits a speedup ratio of 136 in a cluster(using 6 compute nodes and 12 cores per compute node),showing the strong scalability of the hybrid MPI/OpenMP programming model.This approach effectively meets the simulation requirements for commemorative coins with intricate relief patterns.展开更多
Peta-scale high-performance computing systems are increasingly built with heterogeneous CPU and GPU nodes to achieve higher power efficiency and computation throughput. While providing unprecedented capabilities to co...Peta-scale high-performance computing systems are increasingly built with heterogeneous CPU and GPU nodes to achieve higher power efficiency and computation throughput. While providing unprecedented capabilities to conduct computational experiments of historic significance, these systems are presently difficult to program. The users, who are domain experts rather than computer experts, prefer to use programming models closer to their domains (e.g., physics and biology) rather than MPI and OpenMP. This has led the development of domain-specific programming that provides domain-specific programming interfaces but abstracts away some performance-critical architecture details. Based on experience in designing large-scale computing systems, a hybrid programming framework for scientific computing on heterogeneous architectures is proposed in this work. Its design philosophy is to provide a collaborative mechanism for domain experts and computer experts so that both domain-specific knowledge and performance-critical architecture details can be adequately exploited. Two real-world scientific applications have been evaluated on TH-1A, a peta-scale CPU-GPU heterogeneous system that is currently the 5th fastest supercomputer in the world. The experimental results show that the proposed framework is well suited for developing large-scale scientific computing applications on peta-scale heterogeneous CPU/GPU systems.展开更多
In this paper the parallel computing of a grid-point nine-level atmospheric general circulation model on the Dawn 1000 is introduced. The model was developed by the Institute of Atmospheric Physics (IAP), Chinese Acad...In this paper the parallel computing of a grid-point nine-level atmospheric general circulation model on the Dawn 1000 is introduced. The model was developed by the Institute of Atmospheric Physics (IAP), Chinese Academy of Sciences (CAS). The Dawn 1000 is a MIMD massive parallel computer made by National Research Center for Intelligent Computer (NCIC), CAS. A two-dimensional domain decomposition method is adopted to perform the parallel computing. The potential ways to increase the speed-up ratio and exploit more resources of future massively parallel supercomputation are also discussed.展开更多
To realize nonferrous metals deposit mining remotely with mobile robot under unknown environment, parallel evolutionary computing and 3 tier load balance were proposed to overcome the efficiency problem of online evol...To realize nonferrous metals deposit mining remotely with mobile robot under unknown environment, parallel evolutionary computing and 3 tier load balance were proposed to overcome the efficiency problem of online evolutionary computing. A system of polar coordinates can be established on remote mining robot with the polar point of current position and the polar axis from the current point to goal point. With the polar coordinate system path planning of remote mining robot can be computed in a parallel way. From the results of simulations and analysis based on agent techniques, good computing quality can be guaranteed for remote mining robot, such as efficiency, optimization and robustness.展开更多
Parallel finite element method using domain decomposition technique is adapted to a distributed parallel environment of workstation cluster. The algorithm is presented for parallelization of the preconditioned conjuga...Parallel finite element method using domain decomposition technique is adapted to a distributed parallel environment of workstation cluster. The algorithm is presented for parallelization of the preconditioned conjugate gradient method based on domain decomposition. Using the developed code, a dam structural analysis problem is solved on workstation cluster and results are given. The parallel performance is analyzed.展开更多
基金the National Key R&D Program of China(2020YFB1708300)the National Natural Science Foundation of China(52005192)the Project of Ministry of Industry and Information Technology(TC210804R-3).
文摘This paper aims to solve large-scale and complex isogeometric topology optimization problems that consumesignificant computational resources. A novel isogeometric topology optimization method with a hybrid parallelstrategy of CPU/GPU is proposed, while the hybrid parallel strategies for stiffness matrix assembly, equationsolving, sensitivity analysis, and design variable update are discussed in detail. To ensure the high efficiency ofCPU/GPU computing, a workload balancing strategy is presented for optimally distributing the workload betweenCPU and GPU. To illustrate the advantages of the proposedmethod, three benchmark examples are tested to verifythe hybrid parallel strategy in this paper. The results show that the efficiency of the hybrid method is faster thanserial CPU and parallel GPU, while the speedups can be up to two orders of magnitude.
基金the China Natural Science Fund(No.52171253)the Natural Science Foundation of Sichuan(No.2022NSFSCO949).
文摘Accurate 3-dimensional(3-D)reconstruction technology for nondestructive testing based on digital radiography(DR)is of great importance for alleviating the drawbacks of the existing computed tomography(CT)-based method.The commonly used Monte Carlo simulation method ensures well-performing imaging results for DR.However,for 3-D reconstruction,it is limited by its high time consumption.To solve this problem,this study proposes a parallel computing method to accelerate Monte Carlo simulation for projection images with a parallel interface and a specific DR application.The images are utilized for 3-D reconstruction of the test model.We verify the accuracy of parallel computing for DR and evaluate the performance of two parallel computing modes-multithreaded applications(G4-MT)and message-passing interfaces(G4-MPI)-by assessing parallel speedup and efficiency.This study explores the scalability of the hybrid G4-MPI and G4-MT modes.The results show that the two parallel computing modes can significantly reduce the Monte Carlo simulation time because the parallel speedup increment of Monte Carlo simulations can be considered linear growth,and the parallel efficiency is maintained at a high level.The hybrid mode has strong scalability,as the overall run time of the 180 simulations using 320 threads is 15.35 h with 10 billion particles emitted,and the parallel speedup can be up to 151.36.The 3-D reconstruction of the model is achieved based on the filtered back projection(FBP)algorithm using 180 projection images obtained with the hybrid G4-MPI and G4-MT.The quality of the reconstructed sliced images is satisfactory because the images can reflect the internal structure of the test model.This method is applied to a complex model,and the quality of the reconstructed images is evaluated.
基金supported by the National Natural Science Foundation of China (NSFC)Basic Science Center Program for Multiscale Problems in Nonlinear Mechanics’(Grant No. 11988102)NSFC project (Grant No. 11972038)
文摘A computational fluid dynamics(CFD)solver for a GPU/CPU heterogeneous architecture parallel computing platform is developed to simulate incompressible flows on billion-level grid points.To solve the Poisson equation,the conjugate gradient method is used as a basic solver,and a Chebyshev method in combination with a Jacobi sub-preconditioner is used as a preconditioner.The developed CFD solver shows good performance on parallel efficiency,which exceeds 90%in the weak-scalability test when the number of grid points allocated to each GPU card is greater than 2083.In the acceleration test,it is found that running a simulation with 10403 grid points on 125 GPU cards accelerates by 203.6x over the same number of CPU cores.The developed solver is then tested in the context of a two-dimensional lid-driven cavity flow and three-dimensional Taylor-Green vortex flow.The results are consistent with previous results in the literature.
基金the Deanship of Scientific Research at Najran University for funding this work under the Research Groups Funding Program Grant Code(NU/RG/SERC/12/3).
文摘Due to the inherent insecure nature of the Internet,it is crucial to ensure the secure transmission of image data over this network.Additionally,given the limitations of computers,it becomes evenmore important to employ efficient and fast image encryption techniques.While 1D chaotic maps offer a practical approach to real-time image encryption,their limited flexibility and increased vulnerability restrict their practical application.In this research,we have utilized a 3DHindmarsh-Rosemodel to construct a secure cryptosystem.The randomness of the chaotic map is assessed through standard analysis.The proposed system enhances security by incorporating an increased number of system parameters and a wide range of chaotic parameters,as well as ensuring a uniformdistribution of chaotic signals across the entire value space.Additionally,a fast image encryption technique utilizing the new chaotic system is proposed.The novelty of the approach is confirmed through time complexity analysis.To further strengthen the resistance against cryptanalysis attacks and differential attacks,the SHA-256 algorithm is employed for secure key generation.Experimental results through a number of parameters demonstrate the strong cryptographic performance of the proposed image encryption approach,highlighting its exceptional suitability for secure communication.Moreover,the security of the proposed scheme has been compared with stateof-the-art image encryption schemes,and all comparison metrics indicate the superior performance of the proposed scheme.
基金supported by Postdoctoral Science Foundation of China(No.2021M702441)National Natural Science Foundation of China(No.61871283)。
文摘To efficiently complete a complex computation task,the complex task should be decomposed into subcomputation tasks that run parallel in edge computing.Wireless Sensor Network(WSN)is a typical application of parallel computation.To achieve highly reliable parallel computation for wireless sensor network,the network's lifetime needs to be extended.Therefore,a proper task allocation strategy is needed to reduce the energy consumption and balance the load of the network.This paper proposes a task model and a cluster-based WSN model in edge computing.In our model,different tasks require different types of resources and different sensors provide different types of resources,so our model is heterogeneous,which makes the model more practical.Then we propose a task allocation algorithm that combines the Genetic Algorithm(GA)and the Ant Colony Optimization(ACO)algorithm.The algorithm concentrates on energy conservation and load balancing so that the lifetime of the network can be extended.The experimental result shows the algorithm's effectiveness and advantages in energy conservation and load balancing.
文摘Conventional gradient-based full waveform inversion (FWI) is a local optimization, which is highly dependent on the initial model and prone to trapping in local minima. Globally optimal FWI that can overcome this limitation is particularly attractive, but is currently limited by the huge amount of calculation. In this paper, we propose a globally optimal FWI framework based on GPU parallel computing, which greatly improves the efficiency, and is expected to make globally optimal FWI more widely used. In this framework, we simplify and recombine the model parameters, and optimize the model iteratively. Each iteration contains hundreds of individuals, each individual is independent of the other, and each individual contains forward modeling and cost function calculation. The framework is suitable for a variety of globally optimal algorithms, and we test the framework with particle swarm optimization algorithm for example. Both the synthetic and field examples achieve good results, indicating the effectiveness of the framework. .
基金supported in part by the National Natural Science Foundation of China under Grant 61971077,Grant 61901066in part by the Chongqing Science and Technology Commission under Grant cstc2019jcyj-msxmX0575in part by the Program for Innovation Team Building at colleges and universities in Chongqing,China under Grant CXTDX201601006
文摘In this paper,we investigate vehicular fog computing system and develop an effective parallel offloading scheme.The service time,that addresses task offloading delay,task decomposition and handover cost,is adopted as the metric of offloading performance.We propose an available resource-aware based parallel offloading scheme,which decides target fog nodes by RSU for computation offloading jointly considering effect of vehicles mobility and time-varying computation capability.Based on Hidden Markov model and Markov chain theories,proposed scheme effectively handles the imperfect system state information for fog nodes selection by jointly achieving mobility awareness and computation perception.Simulation results are presented to corroborate the theoretical analysis and validate the effectiveness of the proposed algorithm.
基金supported by the National Natural Science Eoundation of China under Grant No.40221503the China National Key Programme for Development Basic Sciences (Abbreviation:973 Project,Grant No.G1999032801)
文摘The Spectral Statistical Interpolation (SSI) analysis system of NCEP is used to assimilate meteorological data from the Global Positioning Satellite System (GPS/MET) refraction angles with the variational technique. Verified by radiosonde, including GPS/MET observations into the analysis makes an overall improvement to the analysis variables of temperature, winds, and water vapor. However, the variational model with the ray-tracing method is quite expensive for numerical weather prediction and climate research. For example, about 4 000 GPS/MET refraction angles need to be assimilated to produce an ideal global analysis. Just one iteration of minimization will take more than 24 hours CPU time on the NCEP's Cray C90 computer. Although efforts have been taken to reduce the computational cost, it is still prohibitive for operational data assimilation. In this paper, a parallel version of the three-dimensional variational data assimilation model of GPS/MET occultation measurement suitable for massive parallel processors architectures is developed. The divide-and-conquer strategy is used to achieve parallelism and is implemented by message passing. The authors present the principles for the code's design and examine the performance on the state-of-the-art parallel computers in China. The results show that this parallel model scales favorably as the number of processors is increased. With the Memory-IO technique implemented by the author, the wall clock time per iteration used for assimilating 1420 refraction angles is reduced from 45 s to 12 s using 1420 processors. This suggests that the new parallelized code has the potential to be useful in numerical weather prediction (NWP) and climate studies.
基金This project was supported by the National Natural Science Foundation of China (60135020).
文摘The flexibility of traditional image processing system is limited because those system are designed for specific applications. In this paper, a new TMS320C64x-based multi-DSP parallel computing architecture is presented. It has many promising characteristics such as powerful computing capability, broad I/O bandwidth, topology flexibility, and expansibility. The parallel system performance is evaluated by practical experiment.
基金supported by the National Outstanding Youth Science Foundation of China (No.10425208)111 Project(No.B07009)FanZhou Science and Research Foundation for Young Scholars (No.20080503).
文摘The vertex solution for estimation on the static displacement bounds of structures with uncertain-but-bounded parameters is studied in this paper. For the linear static problem, when there are uncertain interval parameters in the stiffness matrix and the vector of applied forces, the static response may be an interval. Based on the interval operations, the interval solution obtained by the vertex solution is more accurate and more credible than other methods (such as the perturbation method). However, the vertex solution method by traditional serial computing usually needs large computational efforts, especially for large structures. In order to avoid its disadvantages of large calculation and much runtime, its parallel computing which can be used in large-scale computing is presented in this paper. Two kinds of parallel computing algorithms are proposed based on the vertex solution. The parallel computing will solve many interval problems which cannot be resolved by traditional interval analysis methods.
文摘Abstract In this paper, we introduce several on-going research projects to support parallel and distribut,ed computing on heterogeneous networks of workstations (NOW) in the High Performance Computing and Software Lahoratory at the University of Texas at San Antonio. The projects at aiming at addressing three technical issues. First, the factors of heterogeneity and time-sharing effects make traditional performance models/metrics for homogeneous computing performance measurement and evaluation not. suitable for bet.erogeneous computing. We develop practical models and metrics which quantify. the heterogeneity of networks and characterize the performance effects. Second, in order to perform parallel computation effectively, special system support is necessary. We are developing system schemes for heterogeneity management, process scheduling and efficient communications. Finally, to provide insight into system performance, we are developing two types of supporting tools : a graphical instrumentation monitor to aid users in investigating performance problems and in determining the most effective way of exploiting the NOW systems, and a trace-driven simulator to test and compare different system management and scheduling schemes.
基金supported by the Natural Science Foundation of Shanghai (Grant No.08ZR1408200)the Shanghai Leading Academic Discipline Project (Grant No.J50103)the Open Project Program of the National Laboratory of Pattern Recognition
文摘In this paper, we propose a parallel computing technique for content-based image retrieval (CBIR) system. This technique is mainly used for single node with multi-core processor, which is different from those based on cluster or network computing architecture. Due to its specific applications (such as medical image processing) and the harsh terms of hardware resource requirement, the CBIR system has been prevented from being widely used. With the increasing volume of the image database, the widespread use of multi-core processors, and the requirement of the retrieval accuracy and speed, we need to achieve a retrieval strategy which is based on multi-core processor to make the retrieval faster and more convenient than before. Experimental results demonstrate that this parallel architecture can significantly improve the performance of retrieval system. In addition, we also propose an efficient parallel technique with the combinations of the cluster and the multi-core techniques, which is supposed to gear to the new trend of the cloud computing.
基金Foundation item:Supported by the National Natural Science Foundation of China (Grant No. 50921001), National Key Basic Research Special Foundation of China (Grant No. 2010CB832704), Scientific Project for High-tech Ships: Key Technical Research on the Semi-planning Hybrid Fore-body Trimaran, Doctoral Research Foundation of Liaoning Province (Grant No. 20091012).
基金Supported by the National Science and Technology Major Project of the Ministry of Science and Technology of China (No. 2013ZX06002001- 007), the National Key Scientific Instrument and Equipment Development Projects, China (No. 2012YQ180118) and the National Natural Science Foundation of China (Nos. 11275110, 11075091 and 11105081).
基金the Deanship of Scientific Research at King Abdulaziz University,Jeddah,Saudi Arabia under the Grant No.RG-12-611-43.
文摘The Message Passing Interface (MPI) is a widely accepted standard for parallel computing on distributed memorysystems.However, MPI implementations can contain defects that impact the reliability and performance of parallelapplications. Detecting and correcting these defects is crucial, yet there is a lack of published models specificallydesigned for correctingMPI defects. To address this, we propose a model for detecting and correcting MPI defects(DC_MPI), which aims to detect and correct defects in various types of MPI communication, including blockingpoint-to-point (BPTP), nonblocking point-to-point (NBPTP), and collective communication (CC). The defectsaddressed by the DC_MPI model include illegal MPI calls, deadlocks (DL), race conditions (RC), and messagemismatches (MM). To assess the effectiveness of the DC_MPI model, we performed experiments on a datasetconsisting of 40 MPI codes. The results indicate that the model achieved a detection rate of 37 out of 40 codes,resulting in an overall detection accuracy of 92.5%. Additionally, the execution duration of the DC_MPI modelranged from 0.81 to 1.36 s. These findings show that the DC_MPI model is useful in detecting and correctingdefects in MPI implementations, thereby enhancing the reliability and performance of parallel applications. TheDC_MPImodel fills an important research gap and provides a valuable tool for improving the quality ofMPI-basedparallel computing systems.
基金supported by the fund from ShenyangMint Company Limited(No.20220056)Senior Talent Foundation of Jiangsu University(No.19JDG022)Taizhou City Double Innovation and Entrepreneurship Talent Program(No.Taizhou Human Resources Office[2022]No.22).
文摘In this research,we present the pure open multi-processing(OpenMP),pure message passing interface(MPI),and hybrid MPI/OpenMP parallel solvers within the dynamic explicit central difference algorithm for the coining process to address the challenge of capturing fine relief features of approximately 50 microns.Achieving such precision demands the utilization of at least 7 million tetrahedron elements,surpassing the capabilities of traditional serial programs previously developed.To mitigate data races when calculating internal forces,intermediate arrays are introduced within the OpenMP directive.This helps ensure proper synchronization and avoid conflicts during parallel execution.Additionally,in the MPI implementation,the coins are partitioned into the desired number of regions.This division allows for efficient distribution of computational tasks across multiple processes.Numerical simulation examples are conducted to compare the three solvers with serial programs,evaluating correctness,acceleration ratio,and parallel efficiency.The results reveal a relative error of approximately 0.3%in forming force among the parallel and serial solvers,while the predicted insufficient material zones align with experimental observations.Additionally,speedup ratio and parallel efficiency are assessed for the coining process simulation.The pureMPI parallel solver achieves a maximum acceleration of 9.5 on a single computer(utilizing 12 cores)and the hybrid solver exhibits a speedup ratio of 136 in a cluster(using 6 compute nodes and 12 cores per compute node),showing the strong scalability of the hybrid MPI/OpenMP programming model.This approach effectively meets the simulation requirements for commemorative coins with intricate relief patterns.
基金Project(61170049) supported by the National Natural Science Foundation of ChinaProject(2012AA010903) supported by the National High Technology Research and Development Program of China
文摘Peta-scale high-performance computing systems are increasingly built with heterogeneous CPU and GPU nodes to achieve higher power efficiency and computation throughput. While providing unprecedented capabilities to conduct computational experiments of historic significance, these systems are presently difficult to program. The users, who are domain experts rather than computer experts, prefer to use programming models closer to their domains (e.g., physics and biology) rather than MPI and OpenMP. This has led the development of domain-specific programming that provides domain-specific programming interfaces but abstracts away some performance-critical architecture details. Based on experience in designing large-scale computing systems, a hybrid programming framework for scientific computing on heterogeneous architectures is proposed in this work. Its design philosophy is to provide a collaborative mechanism for domain experts and computer experts so that both domain-specific knowledge and performance-critical architecture details can be adequately exploited. Two real-world scientific applications have been evaluated on TH-1A, a peta-scale CPU-GPU heterogeneous system that is currently the 5th fastest supercomputer in the world. The experimental results show that the proposed framework is well suited for developing large-scale scientific computing applications on peta-scale heterogeneous CPU/GPU systems.
文摘In this paper the parallel computing of a grid-point nine-level atmospheric general circulation model on the Dawn 1000 is introduced. The model was developed by the Institute of Atmospheric Physics (IAP), Chinese Academy of Sciences (CAS). The Dawn 1000 is a MIMD massive parallel computer made by National Research Center for Intelligent Computer (NCIC), CAS. A two-dimensional domain decomposition method is adopted to perform the parallel computing. The potential ways to increase the speed-up ratio and exploit more resources of future massively parallel supercomputation are also discussed.
文摘To realize nonferrous metals deposit mining remotely with mobile robot under unknown environment, parallel evolutionary computing and 3 tier load balance were proposed to overcome the efficiency problem of online evolutionary computing. A system of polar coordinates can be established on remote mining robot with the polar point of current position and the polar axis from the current point to goal point. With the polar coordinate system path planning of remote mining robot can be computed in a parallel way. From the results of simulations and analysis based on agent techniques, good computing quality can be guaranteed for remote mining robot, such as efficiency, optimization and robustness.
基金Project supported by Key Project Science Foundation of ShanghaiMunicipal Commission of Education (Grant No .03AZ03)
文摘Parallel finite element method using domain decomposition technique is adapted to a distributed parallel environment of workstation cluster. The algorithm is presented for parallelization of the preconditioned conjugate gradient method based on domain decomposition. Using the developed code, a dam structural analysis problem is solved on workstation cluster and results are given. The parallel performance is analyzed.