This study embarks on a comprehensive examination of optimization techniques within GPU-based parallel programming models,pivotal for advancing high-performance computing(HPC).Emphasizing the transition of GPUs from g...This study embarks on a comprehensive examination of optimization techniques within GPU-based parallel programming models,pivotal for advancing high-performance computing(HPC).Emphasizing the transition of GPUs from graphic-centric processors to versatile computing units,it delves into the nuanced optimization of memory access,thread management,algorithmic design,and data structures.These optimizations are critical for exploiting the parallel processing capabilities of GPUs,addressingboth the theoretical frameworks and practical implementations.By integrating advanced strategies such as memory coalescing,dynamic scheduling,and parallel algorithmic transformations,this research aims to significantly elevate computational efficiency and throughput.The findings underscore the potential of optimized GPU programming to revolutionize computational tasks across various domains,highlighting a pathway towards achieving unparalleled processing power and efficiency in HPC environments.The paper not only contributes to the academic discourse on GPU optimization but also provides actionable insights for developers,fostering advancements in computational sciences and technology.展开更多
In recent years, the widespread adoption of parallel computing, especially in multi-core processors and high-performance computing environments, ushered in a new era of efficiency and speed. This trend was particularl...In recent years, the widespread adoption of parallel computing, especially in multi-core processors and high-performance computing environments, ushered in a new era of efficiency and speed. This trend was particularly noteworthy in the field of image processing, which witnessed significant advancements. This parallel computing project explored the field of parallel image processing, with a focus on the grayscale conversion of colorful images. Our approach involved integrating OpenMP into our framework for parallelization to execute a critical image processing task: grayscale conversion. By using OpenMP, we strategically enhanced the overall performance of the conversion process by distributing the workload across multiple threads. The primary objectives of our project revolved around optimizing computation time and improving overall efficiency, particularly in the task of grayscale conversion of colorful images. Utilizing OpenMP for concurrent processing across multiple cores significantly reduced execution times through the effective distribution of tasks among these cores. The speedup values for various image sizes highlighted the efficacy of parallel processing, especially for large images. However, a detailed examination revealed a potential decline in parallelization efficiency with an increasing number of cores. This underscored the importance of a carefully optimized parallelization strategy, considering factors like load balancing and minimizing communication overhead. Despite challenges, the overall scalability and efficiency achieved with parallel image processing underscored OpenMP’s effectiveness in accelerating image manipulation tasks.展开更多
Web service is a grid computing technology that promises greater ease-of-use and interoperability than previous distributed computing technologies. This paper proposed Group Service Framework, a grid computing platfor...Web service is a grid computing technology that promises greater ease-of-use and interoperability than previous distributed computing technologies. This paper proposed Group Service Framework, a grid computing platform based on Microsoft. NET that use web service to: (1) locate and harness volunteer computing resources for different applications, and (2) support multi-models such as Master/Slave, Divide and Conquer, Phase Parallel and so forth parallel programming paradigms in Grid environment, (3) allocate data and balance load dynamically and transparently for grid computing application. The Grid Service Framework based on Microsoft. NET was used to implement several simple parallel computing applications. The results show that the proposed Group Service Framework is suitable for generic parallel numerical computing.展开更多
The workload of the 3D magnetotelluric forward modeling algorithm is so large that the traditional serial algorithm costs an extremely large compute time. However, the 3D forward modeling algorithm can process the dat...The workload of the 3D magnetotelluric forward modeling algorithm is so large that the traditional serial algorithm costs an extremely large compute time. However, the 3D forward modeling algorithm can process the data in the frequency domain, which is very suitable for parallel computation. With the advantage of MPI and based on an analysis of the flow of the 3D magnetotelluric serial forward algorithm, we suggest the idea of parallel computation and apply it. Three theoretical models are tested and the execution efficiency is compared in different situations. The results indicate that the parallel 3D forward modeling computation is correct and the efficiency is greatly improved. This method is suitable for large size geophysical computations.展开更多
Scale Invariant Feature Transform (SIFT) algorithm is a widely used computer vision algorithm that detects and extracts local feature descriptors from images. SIFT is computationally intensive, making it infeasible fo...Scale Invariant Feature Transform (SIFT) algorithm is a widely used computer vision algorithm that detects and extracts local feature descriptors from images. SIFT is computationally intensive, making it infeasible for single threaded im-plementation to extract local feature descriptors for high-resolution images in real time. In this paper, an approach to parallelization of the SIFT algorithm is demonstrated using NVIDIA’s Graphics Processing Unit (GPU). The parallel-ization design for SIFT on GPUs is divided into two stages, a) Algorithm de-sign-generic design strategies which focuses on data and b) Implementation de-sign-architecture specific design strategies which focuses on optimally using GPU resources for maximum occupancy. Increasing memory latency hiding, eliminating branches and data blocking achieve a significant decrease in aver-age computational time. Furthermore, it is observed via Paraver tools that our approach to parallelization while optimizing for maximum occupancy allows GPU to execute memory bound SIFT algorithm at optimal levels.展开更多
Large scale optimization problems can only be solved in an efficient way, if their special structure is taken as the basis of algorithm design. In this paper we consider a very broad class of large-scale problems ...Large scale optimization problems can only be solved in an efficient way, if their special structure is taken as the basis of algorithm design. In this paper we consider a very broad class of large-scale problems with special structure, namely tree structured problems. We show how the exploitation of the structure leads to efficient decomposition algorithms and how it may be implemented in a parallel environment.展开更多
The increasing amount of sequences stored in genomic databases has become unfeasible to the sequential analysis. Then, the parallel computing brought its power to the Bioinformatics through parallel algorithms to alig...The increasing amount of sequences stored in genomic databases has become unfeasible to the sequential analysis. Then, the parallel computing brought its power to the Bioinformatics through parallel algorithms to align and analyze the sequences, providing improvements mainly in the running time of these algorithms. In many situations, the parallel strategy contributes to reducing the computational complexity of the big problems. This work shows some results obtained by an implementation of a parallel score estimating technique for the score matrix calculation stage, which is the first stage of a progressive multiple sequence alignment. The performance and quality of the parallel score estimating are compared with the results of a dynamic programming approach also implemented in parallel. This comparison shows a significant reduction of running time. Moreover, the quality of the final alignment, using the new strategy, is analyzed and compared with the quality of the approach with dynamic programming.展开更多
Modern power systems are evolving into sociotechnical systems with massive complexity, whose real-time operation and dispatch go beyond human capability. Thus,the need for developing and applying new intelligent power...Modern power systems are evolving into sociotechnical systems with massive complexity, whose real-time operation and dispatch go beyond human capability. Thus,the need for developing and applying new intelligent power system dispatch tools are of great practical significance. In this paper, we introduce the overall business model of power system dispatch, the top level design approach of an intelligent dispatch system, and the parallel intelligent technology with its dispatch applications. We expect that a new dispatch paradigm,namely the parallel dispatch, can be established by incorporating various intelligent technologies, especially the parallel intelligent technology, to enable secure operation of complex power grids,extend system operators' capabilities, suggest optimal dispatch strategies, and to provide decision-making recommendations according to power system operational goals.展开更多
The solution of tension distributions is infinite for cable-driven parallel manipulators(CDPMs) with redundant cables. A rapid optimization method for determining the optimal tension distribution is presented. The n...The solution of tension distributions is infinite for cable-driven parallel manipulators(CDPMs) with redundant cables. A rapid optimization method for determining the optimal tension distribution is presented. The new optimization method is primarily based on the geometry properties of a polyhedron and convex analysis. The computational efficiency of the optimization method is improved by the designed projection algorithm, and a fast algorithm is proposed to determine which two of the lines are intersected at the optimal point. Moreover, a method for avoiding the operating point on the lower tension limit is developed. Simulation experiments are implemented on a six degree-of-freedom(6-DOF) CDPM with eight cables, and the results indicate that the new method is one order of magnitude faster than the standard simplex method. The optimal distribution of tension distribution is thus rapidly established on real-time by the proposed method.展开更多
First, an asynchronous distributed parallel evolutionary modeling algorithm (PEMA) for building the model of system of ordinary differential equations for dynamical systems is proposed in this paper. Then a series of ...First, an asynchronous distributed parallel evolutionary modeling algorithm (PEMA) for building the model of system of ordinary differential equations for dynamical systems is proposed in this paper. Then a series of parallel experiments have been conducted to systematically test the influence of some important parallel control parameters on the performance of the algorithm. A lot of experimental results are obtained and we make some analysis and explanations to them.展开更多
In this paper they deal with the issue of specification and design of parallel communicatingprocesses. A trace-state based model is introduced to describe the behaviour of concurrent programs. They presenta formal sys...In this paper they deal with the issue of specification and design of parallel communicatingprocesses. A trace-state based model is introduced to describe the behaviour of concurrent programs. They presenta formal system based on that model to achieve hierarchical and modular development and verification methods. Anumber of refinement rules are used to decompose the specification into smaller ones and calculate program fromthe展开更多
Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big data.When big data is used in the real-world applications,two data challenges such as class overlap and class imba...Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big data.When big data is used in the real-world applications,two data challenges such as class overlap and class imbalance arises.When dealing with large datasets,most traditional classifiers are stuck in the local optimum problem.As a result,it’s necessary to look into new methods for dealing with large data collections.Several solutions have been proposed for overcoming this issue.The rapid growth of the available data threatens to limit the usefulness of many traditional methods.Methods such as oversampling and undersampling have shown great promises in addressing the issues of class imbalance.Among all of these techniques,Synthetic Minority Oversampling TechniquE(SMOTE)has produced the best results by generating synthetic samples for the minority class in creating a balanced dataset.The issue is that their practical applicability is restricted to problems involving tens of thousands or lower instances of each.In this paper,we have proposed a parallel mode method using SMOTE and MapReduce strategy,this distributes the operation of the algorithm among a group of computational nodes for addressing the aforementioned problem.Our proposed solution has been divided into three stages.Thefirst stage involves the process of splitting the data into different blocks using a mapping function,followed by a pre-processing step for each mapping block that employs a hybrid SMOTE algo-rithm for solving the class imbalanced problem.On each map block,a decision tree model would be constructed.Finally,the decision tree blocks would be com-bined for creating a classification model.We have used numerous datasets with up to 4 million instances in our experiments for testing the proposed scheme’s cap-abilities.As a result,the Hybrid SMOTE appears to have good scalability within the framework proposed,and it also cuts down the processing time.展开更多
Cyber-physical systems (CPS) represent a class of complex engineered systems where functionality and behavior emerge through the interaction between the computational and physical domains. Simulation provides design e...Cyber-physical systems (CPS) represent a class of complex engineered systems where functionality and behavior emerge through the interaction between the computational and physical domains. Simulation provides design engineers with quick and accurate feedback on the behaviors generated by their designs. However, as systems become more complex, simulating their behaviors becomes computation all complex. But, most modern simulation environments still execute on a single thread, which does not take advantage of the processing power available on modern multi-core CPUs. This paper investigates methods to partition and simulate differential equation-based models of cyber-physical systems using multiple threads on multi-core CPUs that can share data across threads. We describe model partitioning methods using fixed step and variable step numerical in-tegration methods that consider the multi-layer cache structure of these CPUs to avoid simulation performance degradation due to cache conflicts. We study the effectiveness of each parallel simu-lation algorithm by calculating the relative speedup compared to a serial simulation applied to a series of large electric circuit models. We also develop a series of guidelines for maximizing performance when developing parallel simulation software intended for use on multi-core CPUs.展开更多
As the hardware industry moves toward using specialized heterogeneous many-core processors to avoid the effects of the power wall,software developers are finding it hard to deal with the complexity of these systems.In...As the hardware industry moves toward using specialized heterogeneous many-core processors to avoid the effects of the power wall,software developers are finding it hard to deal with the complexity of these systems.In this paper,we share our experience of developing a programming model and its supporting compiler and libraries for Matrix-3000,which is designed for next-generation exascale supercomputers but has a complex memory hierarchy and processor organization.To assist its software development,we have developed a software stack from scratch that includes a low-level programming interface and a high-level OpenCL compiler.Our low-level programming model offers native programming support for using the bare-metal accelerators of Matrix-3000,while the high-level model allows programmers to use the OpenCL programming standard.We detail our design choices and highlight the lessons learned from developing system software to enable the programming of bare-metal accelerators.Our programming models have been deployed in the production environment of an exascale prototype system.展开更多
The seismic reflection method is one of the most important methods in geophysical exploration.There are three stages in a seismic exploration survey:acquisition,processing,and interpretation.This paper focuses on a pr...The seismic reflection method is one of the most important methods in geophysical exploration.There are three stages in a seismic exploration survey:acquisition,processing,and interpretation.This paper focuses on a pre-processing tool,the Non-Local Means(NLM)filter algorithm,which is a powerful technique that can significantly suppress noise in seismic data.However,the domain of the NLM algorithm is the whole dataset and 3D seismic data being very large,often exceeding one terabyte(TB),it is impossible to store all the data in Random Access Memory(RAM).Furthermore,the NLM filter would require a considerably long runtime.These factors make a straightforward implementation of the NLM algorithm on real geophysical exploration data infeasible.This paper redesigned and implemented the NLM filter algorithm to fit the challenges of seismic exploration.The optimized implementation of the NLM filter is capable of processing production-size seismic data on modern clusters and is 87 times faster than the straightforward implementation of NLM.展开更多
本文用CORE-IAF(Coordinated Ocean-ice Reference Experiments–Interannual Forcing)外强迫场分别强迫LICOM3(LASG/IAP Climate System Ocean Model Version 3)和POP2(Parallel Ocean Program version 2)两个海洋模式,并分析了这两个...本文用CORE-IAF(Coordinated Ocean-ice Reference Experiments–Interannual Forcing)外强迫场分别强迫LICOM3(LASG/IAP Climate System Ocean Model Version 3)和POP2(Parallel Ocean Program version 2)两个海洋模式,并分析了这两个模式中太平洋北赤道逆流(NECC)的模拟结果。我们发现LICOM3和POP2模拟的NECC强度均弱于实测,这和Sun et al.(2019)的研究结果一致,也进一步证明了海洋模式中NECC偏弱是CORE-IAF外强迫场造成的,海表风应力及对应的风应力旋度是海洋模式准确模拟NECC的最主要因子。同时,我们也分析了NECC的模拟在动力机制上的差别,这里的动力强迫项包括风应力项、平流项和余项。我们发现模式的外强迫场虽然相同,但是两个模式中各动力强迫项(风应力项、平流项和余项)对NECC模拟的影响并不完全相同。展开更多
Numerical simulation of careful parallel arithmetic of oil resources migration-accumulation of Tanhai Region ( three-layer) was done. Careful parallel operator splitting-up implicit iterative scheme, parallel arithmet...Numerical simulation of careful parallel arithmetic of oil resources migration-accumulation of Tanhai Region ( three-layer) was done. Careful parallel operator splitting-up implicit iterative scheme, parallel arithmetic program, parallel arithmetic information and alternating-direction mesh subdivision were put forward. Parallel arithmetic and analysis of different CPU combinations were done. This numerical simulation test and the actual conditions are basically coincident. The convergence estimation of the model problem has successfully solved the difficult problem in the fields of permeation fluid mechanics, computational mathematics and petroleum geology.展开更多
This paper studies a high-speed text-independent Automatic Speaker Recognition(ASR)algorithm based on a multicore system's Gaussian Mixture Model(GMM).The high speech is achieved using parallel implementation of t...This paper studies a high-speed text-independent Automatic Speaker Recognition(ASR)algorithm based on a multicore system's Gaussian Mixture Model(GMM).The high speech is achieved using parallel implementation of the feature's extraction and aggregation methods during training and testing procedures.Shared memory parallel programming techniques using both OpenMP and PThreads libraries are developed to accelerate the code and improve the performance of the ASR algorithm.The experimental results show speed-up improvements of around 3.2 on a personal laptop with Intel i5-6300HQ(2.3 GHz,four cores without hyper-threading,and 8 GB of RAM).In addition,a remarkable 100%speaker recognition accuracy is achieved.展开更多
In this paper, we present a general survey on parallel computing. The main contents include parallel computer system which is the hardware platform of parallel computing, parallel algorithm which is the theoretical ba...In this paper, we present a general survey on parallel computing. The main contents include parallel computer system which is the hardware platform of parallel computing, parallel algorithm which is the theoretical base of parallel computing, parallel programming which is the software support of parallel computing. After that, we also introduce some parallel applications and enabling technologies. We argue that parallel computing research should form an integrated methodology of "architecture algorithm programming application". Only in this way, parallel computing research becomes continuous development and more realistic.展开更多
Today, parallel programming is dominated by message passing libraries, such as message passing interface (MPI). This article intends to simplify parallel programming by generating parallel programs from parallelized...Today, parallel programming is dominated by message passing libraries, such as message passing interface (MPI). This article intends to simplify parallel programming by generating parallel programs from parallelized algorithm design strategies. It uses skeletons to abstract parallelized algorithm design strategies, as well as parallel architectures. Starting from problem specification, an abstract parallel abstract programming language+ (Apla+) program is generated from parallelized algorithm design strategies and problem-specific function definitions. By combining with parallel architectures, implicity of parallelism inside the parallelized algorithm design strategies is exploited. With implementation and transformation, C++ and parallel virtual machine (CPPVM) parallel program is finally generated. Parallelized branch and bound (B&B) algorithm design strategy and paraUelized divide and conquer (D & C) algorithm design strategy are studied in this article as examples. And it also illustrates the approach with a case study.展开更多
文摘This study embarks on a comprehensive examination of optimization techniques within GPU-based parallel programming models,pivotal for advancing high-performance computing(HPC).Emphasizing the transition of GPUs from graphic-centric processors to versatile computing units,it delves into the nuanced optimization of memory access,thread management,algorithmic design,and data structures.These optimizations are critical for exploiting the parallel processing capabilities of GPUs,addressingboth the theoretical frameworks and practical implementations.By integrating advanced strategies such as memory coalescing,dynamic scheduling,and parallel algorithmic transformations,this research aims to significantly elevate computational efficiency and throughput.The findings underscore the potential of optimized GPU programming to revolutionize computational tasks across various domains,highlighting a pathway towards achieving unparalleled processing power and efficiency in HPC environments.The paper not only contributes to the academic discourse on GPU optimization but also provides actionable insights for developers,fostering advancements in computational sciences and technology.
文摘In recent years, the widespread adoption of parallel computing, especially in multi-core processors and high-performance computing environments, ushered in a new era of efficiency and speed. This trend was particularly noteworthy in the field of image processing, which witnessed significant advancements. This parallel computing project explored the field of parallel image processing, with a focus on the grayscale conversion of colorful images. Our approach involved integrating OpenMP into our framework for parallelization to execute a critical image processing task: grayscale conversion. By using OpenMP, we strategically enhanced the overall performance of the conversion process by distributing the workload across multiple threads. The primary objectives of our project revolved around optimizing computation time and improving overall efficiency, particularly in the task of grayscale conversion of colorful images. Utilizing OpenMP for concurrent processing across multiple cores significantly reduced execution times through the effective distribution of tasks among these cores. The speedup values for various image sizes highlighted the efficacy of parallel processing, especially for large images. However, a detailed examination revealed a potential decline in parallelization efficiency with an increasing number of cores. This underscored the importance of a carefully optimized parallelization strategy, considering factors like load balancing and minimizing communication overhead. Despite challenges, the overall scalability and efficiency achieved with parallel image processing underscored OpenMP’s effectiveness in accelerating image manipulation tasks.
基金National Natural F oundation of China(No.60 173 0 13 )
文摘Web service is a grid computing technology that promises greater ease-of-use and interoperability than previous distributed computing technologies. This paper proposed Group Service Framework, a grid computing platform based on Microsoft. NET that use web service to: (1) locate and harness volunteer computing resources for different applications, and (2) support multi-models such as Master/Slave, Divide and Conquer, Phase Parallel and so forth parallel programming paradigms in Grid environment, (3) allocate data and balance load dynamically and transparently for grid computing application. The Grid Service Framework based on Microsoft. NET was used to implement several simple parallel computing applications. The results show that the proposed Group Service Framework is suitable for generic parallel numerical computing.
基金This research is sponsored by the National Natural Science Foundation of China (No. 40374024).
文摘The workload of the 3D magnetotelluric forward modeling algorithm is so large that the traditional serial algorithm costs an extremely large compute time. However, the 3D forward modeling algorithm can process the data in the frequency domain, which is very suitable for parallel computation. With the advantage of MPI and based on an analysis of the flow of the 3D magnetotelluric serial forward algorithm, we suggest the idea of parallel computation and apply it. Three theoretical models are tested and the execution efficiency is compared in different situations. The results indicate that the parallel 3D forward modeling computation is correct and the efficiency is greatly improved. This method is suitable for large size geophysical computations.
文摘Scale Invariant Feature Transform (SIFT) algorithm is a widely used computer vision algorithm that detects and extracts local feature descriptors from images. SIFT is computationally intensive, making it infeasible for single threaded im-plementation to extract local feature descriptors for high-resolution images in real time. In this paper, an approach to parallelization of the SIFT algorithm is demonstrated using NVIDIA’s Graphics Processing Unit (GPU). The parallel-ization design for SIFT on GPUs is divided into two stages, a) Algorithm de-sign-generic design strategies which focuses on data and b) Implementation de-sign-architecture specific design strategies which focuses on optimally using GPU resources for maximum occupancy. Increasing memory latency hiding, eliminating branches and data blocking achieve a significant decrease in aver-age computational time. Furthermore, it is observed via Paraver tools that our approach to parallelization while optimizing for maximum occupancy allows GPU to execute memory bound SIFT algorithm at optimal levels.
基金Supported by the Austrin Science Fund as part of the Special Research Program AURORA(f0 11)
文摘Large scale optimization problems can only be solved in an efficient way, if their special structure is taken as the basis of algorithm design. In this paper we consider a very broad class of large-scale problems with special structure, namely tree structured problems. We show how the exploitation of the structure leads to efficient decomposition algorithms and how it may be implemented in a parallel environment.
文摘The increasing amount of sequences stored in genomic databases has become unfeasible to the sequential analysis. Then, the parallel computing brought its power to the Bioinformatics through parallel algorithms to align and analyze the sequences, providing improvements mainly in the running time of these algorithms. In many situations, the parallel strategy contributes to reducing the computational complexity of the big problems. This work shows some results obtained by an implementation of a parallel score estimating technique for the score matrix calculation stage, which is the first stage of a progressive multiple sequence alignment. The performance and quality of the parallel score estimating are compared with the results of a dynamic programming approach also implemented in parallel. This comparison shows a significant reduction of running time. Moreover, the quality of the final alignment, using the new strategy, is analyzed and compared with the quality of the approach with dynamic programming.
基金supported by State Grid Corporation of China(SGCC)Science and Technology Project SGTJDK00DWJS1700060
文摘Modern power systems are evolving into sociotechnical systems with massive complexity, whose real-time operation and dispatch go beyond human capability. Thus,the need for developing and applying new intelligent power system dispatch tools are of great practical significance. In this paper, we introduce the overall business model of power system dispatch, the top level design approach of an intelligent dispatch system, and the parallel intelligent technology with its dispatch applications. We expect that a new dispatch paradigm,namely the parallel dispatch, can be established by incorporating various intelligent technologies, especially the parallel intelligent technology, to enable secure operation of complex power grids,extend system operators' capabilities, suggest optimal dispatch strategies, and to provide decision-making recommendations according to power system operational goals.
基金Supported by National Natural Science Foundation of China(Grant No.51275500)Research Project of State Key Laboratory of Mechanical System and Vibration(Grant No.MSV201502)+1 种基金USTC-COOGOO Robotics Research Center(Grant No.2015)Youth Innovation Promotion Association of Chinese Academy of Sciences(Grant No.2012321)
文摘The solution of tension distributions is infinite for cable-driven parallel manipulators(CDPMs) with redundant cables. A rapid optimization method for determining the optimal tension distribution is presented. The new optimization method is primarily based on the geometry properties of a polyhedron and convex analysis. The computational efficiency of the optimization method is improved by the designed projection algorithm, and a fast algorithm is proposed to determine which two of the lines are intersected at the optimal point. Moreover, a method for avoiding the operating point on the lower tension limit is developed. Simulation experiments are implemented on a six degree-of-freedom(6-DOF) CDPM with eight cables, and the results indicate that the new method is one order of magnitude faster than the standard simplex method. The optimal distribution of tension distribution is thus rapidly established on real-time by the proposed method.
基金Supported by the National Natural Science Foundation of China(60133010,70071042,60073043)
文摘First, an asynchronous distributed parallel evolutionary modeling algorithm (PEMA) for building the model of system of ordinary differential equations for dynamical systems is proposed in this paper. Then a series of parallel experiments have been conducted to systematically test the influence of some important parallel control parameters on the performance of the algorithm. A lot of experimental results are obtained and we make some analysis and explanations to them.
基金ESPRIT Basic Research ProCoS project 3104 and 7071
文摘In this paper they deal with the issue of specification and design of parallel communicatingprocesses. A trace-state based model is introduced to describe the behaviour of concurrent programs. They presenta formal system based on that model to achieve hierarchical and modular development and verification methods. Anumber of refinement rules are used to decompose the specification into smaller ones and calculate program fromthe
文摘Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big data.When big data is used in the real-world applications,two data challenges such as class overlap and class imbalance arises.When dealing with large datasets,most traditional classifiers are stuck in the local optimum problem.As a result,it’s necessary to look into new methods for dealing with large data collections.Several solutions have been proposed for overcoming this issue.The rapid growth of the available data threatens to limit the usefulness of many traditional methods.Methods such as oversampling and undersampling have shown great promises in addressing the issues of class imbalance.Among all of these techniques,Synthetic Minority Oversampling TechniquE(SMOTE)has produced the best results by generating synthetic samples for the minority class in creating a balanced dataset.The issue is that their practical applicability is restricted to problems involving tens of thousands or lower instances of each.In this paper,we have proposed a parallel mode method using SMOTE and MapReduce strategy,this distributes the operation of the algorithm among a group of computational nodes for addressing the aforementioned problem.Our proposed solution has been divided into three stages.Thefirst stage involves the process of splitting the data into different blocks using a mapping function,followed by a pre-processing step for each mapping block that employs a hybrid SMOTE algo-rithm for solving the class imbalanced problem.On each map block,a decision tree model would be constructed.Finally,the decision tree blocks would be com-bined for creating a classification model.We have used numerous datasets with up to 4 million instances in our experiments for testing the proposed scheme’s cap-abilities.As a result,the Hybrid SMOTE appears to have good scalability within the framework proposed,and it also cuts down the processing time.
文摘Cyber-physical systems (CPS) represent a class of complex engineered systems where functionality and behavior emerge through the interaction between the computational and physical domains. Simulation provides design engineers with quick and accurate feedback on the behaviors generated by their designs. However, as systems become more complex, simulating their behaviors becomes computation all complex. But, most modern simulation environments still execute on a single thread, which does not take advantage of the processing power available on modern multi-core CPUs. This paper investigates methods to partition and simulate differential equation-based models of cyber-physical systems using multiple threads on multi-core CPUs that can share data across threads. We describe model partitioning methods using fixed step and variable step numerical in-tegration methods that consider the multi-layer cache structure of these CPUs to avoid simulation performance degradation due to cache conflicts. We study the effectiveness of each parallel simu-lation algorithm by calculating the relative speedup compared to a serial simulation applied to a series of large electric circuit models. We also develop a series of guidelines for maximizing performance when developing parallel simulation software intended for use on multi-core CPUs.
基金Project supported by the National Key Research and Development Program of China(No.2021YFB0300101)the National Natural Science Foundation of China(No.61972408)the UK Royal Society International Collaboration Grant。
文摘As the hardware industry moves toward using specialized heterogeneous many-core processors to avoid the effects of the power wall,software developers are finding it hard to deal with the complexity of these systems.In this paper,we share our experience of developing a programming model and its supporting compiler and libraries for Matrix-3000,which is designed for next-generation exascale supercomputers but has a complex memory hierarchy and processor organization.To assist its software development,we have developed a software stack from scratch that includes a low-level programming interface and a high-level OpenCL compiler.Our low-level programming model offers native programming support for using the bare-metal accelerators of Matrix-3000,while the high-level model allows programmers to use the OpenCL programming standard.We detail our design choices and highlight the lessons learned from developing system software to enable the programming of bare-metal accelerators.Our programming models have been deployed in the production environment of an exascale prototype system.
文摘The seismic reflection method is one of the most important methods in geophysical exploration.There are three stages in a seismic exploration survey:acquisition,processing,and interpretation.This paper focuses on a pre-processing tool,the Non-Local Means(NLM)filter algorithm,which is a powerful technique that can significantly suppress noise in seismic data.However,the domain of the NLM algorithm is the whole dataset and 3D seismic data being very large,often exceeding one terabyte(TB),it is impossible to store all the data in Random Access Memory(RAM).Furthermore,the NLM filter would require a considerably long runtime.These factors make a straightforward implementation of the NLM algorithm on real geophysical exploration data infeasible.This paper redesigned and implemented the NLM filter algorithm to fit the challenges of seismic exploration.The optimized implementation of the NLM filter is capable of processing production-size seismic data on modern clusters and is 87 times faster than the straightforward implementation of NLM.
文摘本文用CORE-IAF(Coordinated Ocean-ice Reference Experiments–Interannual Forcing)外强迫场分别强迫LICOM3(LASG/IAP Climate System Ocean Model Version 3)和POP2(Parallel Ocean Program version 2)两个海洋模式,并分析了这两个模式中太平洋北赤道逆流(NECC)的模拟结果。我们发现LICOM3和POP2模拟的NECC强度均弱于实测,这和Sun et al.(2019)的研究结果一致,也进一步证明了海洋模式中NECC偏弱是CORE-IAF外强迫场造成的,海表风应力及对应的风应力旋度是海洋模式准确模拟NECC的最主要因子。同时,我们也分析了NECC的模拟在动力机制上的差别,这里的动力强迫项包括风应力项、平流项和余项。我们发现模式的外强迫场虽然相同,但是两个模式中各动力强迫项(风应力项、平流项和余项)对NECC模拟的影响并不完全相同。
基金Project supported by the National Natural Science Foundation of China (Nos. 10372052 and 10271066)the Major State Basic Research Program of China (No. 1999032803)the Special Fund of PhD Program of Education Ministry of China (No. 20030422047)
文摘Numerical simulation of careful parallel arithmetic of oil resources migration-accumulation of Tanhai Region ( three-layer) was done. Careful parallel operator splitting-up implicit iterative scheme, parallel arithmetic program, parallel arithmetic information and alternating-direction mesh subdivision were put forward. Parallel arithmetic and analysis of different CPU combinations were done. This numerical simulation test and the actual conditions are basically coincident. The convergence estimation of the model problem has successfully solved the difficult problem in the fields of permeation fluid mechanics, computational mathematics and petroleum geology.
文摘This paper studies a high-speed text-independent Automatic Speaker Recognition(ASR)algorithm based on a multicore system's Gaussian Mixture Model(GMM).The high speech is achieved using parallel implementation of the feature's extraction and aggregation methods during training and testing procedures.Shared memory parallel programming techniques using both OpenMP and PThreads libraries are developed to accelerate the code and improve the performance of the ASR algorithm.The experimental results show speed-up improvements of around 3.2 on a personal laptop with Intel i5-6300HQ(2.3 GHz,four cores without hyper-threading,and 8 GB of RAM).In addition,a remarkable 100%speaker recognition accuracy is achieved.
基金Supported by the National Natural Science Foundation of China under Grant No.60533020. Acknowledgments We would like to thank the anonymous referees for their valuable suggestions and comments to improve the presentation of this paper.
文摘In this paper, we present a general survey on parallel computing. The main contents include parallel computer system which is the hardware platform of parallel computing, parallel algorithm which is the theoretical base of parallel computing, parallel programming which is the software support of parallel computing. After that, we also introduce some parallel applications and enabling technologies. We argue that parallel computing research should form an integrated methodology of "architecture algorithm programming application". Only in this way, parallel computing research becomes continuous development and more realistic.
基金National Natural Science Foundation of China (60773054)National Basic Research Program of China (2003CCA02800)
文摘Today, parallel programming is dominated by message passing libraries, such as message passing interface (MPI). This article intends to simplify parallel programming by generating parallel programs from parallelized algorithm design strategies. It uses skeletons to abstract parallelized algorithm design strategies, as well as parallel architectures. Starting from problem specification, an abstract parallel abstract programming language+ (Apla+) program is generated from parallelized algorithm design strategies and problem-specific function definitions. By combining with parallel architectures, implicity of parallelism inside the parallelized algorithm design strategies is exploited. With implementation and transformation, C++ and parallel virtual machine (CPPVM) parallel program is finally generated. Parallelized branch and bound (B&B) algorithm design strategy and paraUelized divide and conquer (D & C) algorithm design strategy are studied in this article as examples. And it also illustrates the approach with a case study.