Ray tracing is a computer graphics method that renders images realistically. As the name suggests, this technique primarily traces the path of light rays interacting with objects in a scene [1], permitting the calcula...Ray tracing is a computer graphics method that renders images realistically. As the name suggests, this technique primarily traces the path of light rays interacting with objects in a scene [1], permitting the calculation of lighting and reflecting impact [2]. As ray tracing is a time-consuming process, the need for parallelization to solve this problem arises. One downside of this solution is the existence of race conditions. In this work, we explore and experiment with a different, well-known solution for this race condition. Starting with the introduction and the background section, a brief overview of the topic is followed by a detailed part of how the race conditions may occur in the case of the ray tracing algorithm. Continuing with the methods and results section, we have used OpenMP to parallelize the Ray tracing algorithm with the different compiler directives critical, atomic, and first-private. Hence, it concluded that both critical and atomic are not efficient solutions to produce a good-quality picture, but first-private succeeded in producing a high-quality picture.展开更多
For the characteristics of intra prediction algorithms, the data dependence and parallelism between intra prediction models are first analyzed. This paper proposes a parallelization method based on dynamic reconfigura...For the characteristics of intra prediction algorithms, the data dependence and parallelism between intra prediction models are first analyzed. This paper proposes a parallelization method based on dynamic reconfigurable array processors provided by the project team, and uses data level parallel(DLP) algorithms in multi-core units. The experimental results show that Y-component of peak signal to noise ratio(Y-PSNR) is improved about 10 dB and the time is saved 63% compared with high-efficiency video coding(HEVC) test model HM10.0. This method can effectively reduce codec time of the video and reduce computational complexity.展开更多
The combined finiteediscrete element method(FDEM) belongs to a family of methods of computational mechanics of discontinua. The method is suitable for problems of discontinua, where particles are deformable and can fr...The combined finiteediscrete element method(FDEM) belongs to a family of methods of computational mechanics of discontinua. The method is suitable for problems of discontinua, where particles are deformable and can fracture or fragment. The applications of FDEM have spread over a number of disciplines including rock mechanics, where problems like mining, mineral processing or rock blasting can be solved by employing FDEM. In this work, a novel approach for the parallelization of two-dimensional(2D) FDEM aiming at clusters and desktop computers is developed. Dynamic domain decomposition based parallelization solvers covering all aspects of FDEM have been developed. These have been implemented into the open source Y2 D software package and have been tested on a PC cluster. The overall performance and scalability of the parallel code have been studied using numerical examples. The results obtained confirm the suitability of the parallel implementation for solving large scale problems.展开更多
The Global-Regional Integrated forecast System(GRIST)is the next-generation weather and climate integrated model dynamic framework developed by Chinese Academy of Meteorological Sciences.In this paper,we present sever...The Global-Regional Integrated forecast System(GRIST)is the next-generation weather and climate integrated model dynamic framework developed by Chinese Academy of Meteorological Sciences.In this paper,we present several changes made to the global nonhydrostatic dynamical(GND)core,which is part of the ongoing prototype of GRIST.The changes leveraging MPI and PnetCDF techniques were targeted at the parallelization and performance optimization to the original serial GND core.Meanwhile,some sophisticated data structures and interfaces were designed to adjust flexibly the size of boundary and halo domains according to the variable accuracy in parallel context.In addition,the I/O performance of PnetCDF decreases as the number of MPI processes increases in our experimental environment.Especially when the number exceeds 6000,it caused system-wide outages(SWO).Thus,a grouping solution was proposed to overcome that issue.Several experiments were carried out on the supercomputing platform based on Intel x86 CPUs in the National Supercomputing Center in Wuxi.The results demonstrated that the parallel GND core based on grouping solution achieves good strong scalability and improves the performance significantly,as well as avoiding the SWOs.展开更多
To reduce the computational complexity and storage cost caused by wedge segmentation algorithm,a scheme of simplifying wedge matching is proposed.It takes advantage of the correlation of the wedge separation line of d...To reduce the computational complexity and storage cost caused by wedge segmentation algorithm,a scheme of simplifying wedge matching is proposed.It takes advantage of the correlation of the wedge separation line of depth map and the direction of intra-prediction for 3D high-efficiency video coding(3D-HEVC).According to the difference of wedge segmentation between adjacent edge and opposite edge,a set only including 104×4 wedgelet templates is given.By expanding of the wedge wave of a certain minimum unit,a simple separation line acquisition method for different size of depth block is put forward.Furthermore,based on the array processor(DPR-CODEC)developed by project team,an efficient parallel scheme of the improved wedge segmentation mode prediction is introduced.By the scheme,prediction unit(PU)size can be changed randomly from 4×4 to 8×8,16×16,and 32×32,which is more in line with the needs of the HEVC standard.Veri-fied with test sequence in HTM16.1 and the Xilinx virtex-6 field programmable gate array(FPGA)respectively,the experiment results show that the proposed methods save 99.2%of the storage space and 63.94%of the encoding time,the serial/parallel acceleration ratio of each template reaches 1.84 in average.The coding performance,storage and resource consumption are considered for both.展开更多
After the extension of depth modeling mode 4(DMM-4)in 3D high efficiency video coding(3D-HEVC),the computational complexity increases sharply,which causes the real-time performance of video coding to be impacted.To re...After the extension of depth modeling mode 4(DMM-4)in 3D high efficiency video coding(3D-HEVC),the computational complexity increases sharply,which causes the real-time performance of video coding to be impacted.To reduce the computational complexity of DMM-4,a simplified hardware-friendly contour prediction algorithm is proposed in this paper.Based on the similarity between texture and depth map,the proposed algorithm directly codes depth blocks to calculate edge regions to reduce the number of reference blocks.Through the verification of the test sequence on HTM16.1,the proposed algorithm coding time is reduced by 9.42%compared with the original algorithm.To avoid the time consuming of serial coding on HTM,a parallelization design of the proposed algorithm based on reconfigurable array processor(DPR-CODEC)is proposed.The parallelization design reduces the storage access time,configuration time and saves the storage cost.Verified with the Xilinx Virtex 6 FPGA,experimental results show that parallelization design is capable of processing HD 1080p at a speed above 30 frames per second.Compared with the related work,the scheme reduces the LUTs by 42.3%,the REG by 85.5%and the hardware resources by 66.7%.The data loading speedup ratio of parallel scheme can reach 3.4539.On average,the different sized templates serial/parallel speedup ratio of encoding time can reach 2.446.展开更多
With the development of satellite remote sensing technology, more and more requirements are put forward on the timeliness and stability of the satellite weather service system. The FY satellite rainfall estimate day k...With the development of satellite remote sensing technology, more and more requirements are put forward on the timeliness and stability of the satellite weather service system. The FY satellite rainfall estimate day knock off product algorithm runs longer, about 20 minutes, which affects the estimated rainfall product generated timeliness. Research and development of parallel optimization algorithms based on the needs of satellite meteorological services and their effectiveness in practical applications are necessary ways to enhance the high-performance and high-availability capabilities of satellite meteorological services. So aiming at this problem, we started the parallel algorithm research based on the analysis of precipitation estimation algorithm. Firstly, we explained the steps of precipitation estimated date knock off product algorithm;secondly, we analyzed the four main calculation module calculating the amount of algorithms;thirdly, multithreaded parallel algorithm and MPI parallelization was designed. Finally, the multithreaded parallel and MPI parallelization were realized. Experimental results show that the multithreaded parallel and MPI parallelization algorithm could greatly improve the overall degree of computational efficiency. And, MPI parallelization mode has a higher operating efficiency. The performance of parallel processing is closely related to the architecture of the computer. From the perspective of service scheduling and product algorithms, the MPI parallelization approach is adopted to achieve the purpose of improving service quality.展开更多
Scale Invariant Feature Transform (SIFT) algorithm is a widely used computer vision algorithm that detects and extracts local feature descriptors from images. SIFT is computationally intensive, making it infeasible fo...Scale Invariant Feature Transform (SIFT) algorithm is a widely used computer vision algorithm that detects and extracts local feature descriptors from images. SIFT is computationally intensive, making it infeasible for single threaded im-plementation to extract local feature descriptors for high-resolution images in real time. In this paper, an approach to parallelization of the SIFT algorithm is demonstrated using NVIDIA’s Graphics Processing Unit (GPU). The parallel-ization design for SIFT on GPUs is divided into two stages, a) Algorithm de-sign-generic design strategies which focuses on data and b) Implementation de-sign-architecture specific design strategies which focuses on optimally using GPU resources for maximum occupancy. Increasing memory latency hiding, eliminating branches and data blocking achieve a significant decrease in aver-age computational time. Furthermore, it is observed via Paraver tools that our approach to parallelization while optimizing for maximum occupancy allows GPU to execute memory bound SIFT algorithm at optimal levels.展开更多
The general m-machine permutation flowshop problem with the total flow-time objective is known to be NP-hard for m ≥ 2. The only practical method for finding optimal solutions has been branch-and-bound algorithms. In...The general m-machine permutation flowshop problem with the total flow-time objective is known to be NP-hard for m ≥ 2. The only practical method for finding optimal solutions has been branch-and-bound algorithms. In this paper, we present an improved sequential algorithm which is based on a strict alternation of Generation and Exploration execution modes as well as Depth-First/Best-First hybrid strategies. The experimental results show that the proposed scheme exhibits improved performance compared with the algorithm in [1]. More importantly, our method can be easily extended and implemented with lightweight threads to speed up the execution times. Good speedups can be obtained on shared-memory multicore systems.展开更多
A rate-dependent peridynamic ceramic model,considering the brittle tensile response,compressive plastic softening and strain-rate dependence,can accurately represent the dynamic response and crack propagation of ceram...A rate-dependent peridynamic ceramic model,considering the brittle tensile response,compressive plastic softening and strain-rate dependence,can accurately represent the dynamic response and crack propagation of ceramic materials.However,it also considers the strain-rate dependence and damage accumulation caused by compressive plastic softening during the compression stage,requiring more computational resources for the bond force evaluation and damage evolution.Herein,the OpenMP parallel optimization of the rate-dependent peridynamic ceramicmodel is investigated.Also,themodules that compute the interactions betweenmaterial points and update damage index are vectorized and parallelized.Moreover,the numerical examples are carried out to simulate the dynamic response and fracture of the ceramic plate under normal impact.Furthermore,the speed-up ratio and computational efficiency by multi-threads are evaluated and discussed to demonstrate the reliability of parallelized programs.The results reveal that the totalwall clock time has been significantly reduced after optimization,showing the promise of parallelization process in terms of accuracy and stability.展开更多
In this paper, we present parallel programming approaches to calculate the values of the cells in matrix’s scoring used in the Smith-Waterman’s algorithm for sequence alignment. This algorithm, well known in bioinfo...In this paper, we present parallel programming approaches to calculate the values of the cells in matrix’s scoring used in the Smith-Waterman’s algorithm for sequence alignment. This algorithm, well known in bioinformatics for its applications, is unfortunately time-consuming on a serial computer. We use formulation based on anti-diagonals structure of data. This representation focuses on parallelizable parts of the algorithm without changing the initial formulation of the algorithm. Approaching data in that way give us a formulation more flexible. To examine this approach, we encode it in OpenMP and Cuda C. The performance obtained shows the interest of our paper.展开更多
The parallelization of the diagnostics for climate research has been an important goal in the performance testing and improvement of the diagnostics for the Department of Energy’s (DOE’s) Accelerated Climate Modelin...The parallelization of the diagnostics for climate research has been an important goal in the performance testing and improvement of the diagnostics for the Department of Energy’s (DOE’s) Accelerated Climate Modeling for Energy (ACME) project [1]. The primary mission of the ACME project is to build and test the next-generation Earth system model for current and future generations of computing systems operated by the DOE office of science computing facilities, including the envisioned exascale systems foreseen in the early part of the next decade. As part of the underpinning workflow environment, a diagnostics, model metrics, and intercomparison Python framework, called UVC Metrics was created to aid in testing and production execution of the model. This framework builds on common methods and similar metrics to accommodate and diagnose individual component models, such as atmosphere, land, ocean, sea ice, and land ice. This paper reports on initial parallelization of UVC Metrics for the atmosphere model component using two popular frameworks: MPI and SPARK. A timing study is presented to assess the performance of each method in which significant improvement was achieved for both frameworks despite I/O contentions with NFS. The advantages and disadvantages of each framework are also presented.展开更多
In this paper,stochastic global optimization algorithms,specifically,genetic algorithm and simulated annealing are used for the problem of calibrating the dynamic option pricing model under stochastic volatility to ma...In this paper,stochastic global optimization algorithms,specifically,genetic algorithm and simulated annealing are used for the problem of calibrating the dynamic option pricing model under stochastic volatility to market prices by adopting a hybrid programming approach.The performance of this dynamic option pricing model under the obtained optimal parameters is also discussed.To enhance the model throughput and reduce latency,a heterogeneous hybrid programming approach on GPU was adopted which emphasized a data-parallel implementation of the dynamic option pricing model on a GPU-based system.Kernel offloading to the GPU of the compute-intensive segments of the pricing algorithms was done in OpenCL.The GPU approach was found to significantly reduce latency by an optimum of 541 times faster than a parallel implementation approach on the CPU,reducing the computation time from 46.24 minutes to 5.12 seconds.展开更多
Real-time capabilities and computational efficiency are provided by parallel image processing utilizing OpenMP. However, race conditions can affect the accuracy and reliability of the outcomes. This paper highlights t...Real-time capabilities and computational efficiency are provided by parallel image processing utilizing OpenMP. However, race conditions can affect the accuracy and reliability of the outcomes. This paper highlights the importance of addressing race conditions in parallel image processing, specifically focusing on color inverse filtering using OpenMP. We considered three solutions to solve race conditions, each with distinct characteristics: #pragma omp atomic: Protects individual memory operations for fine-grained control. #pragma omp critical: Protects entire code blocks for exclusive access. #pragma omp parallel sections reduction: Employs a reduction clause for safe aggregation of values across threads. Our findings show that the produced images were unaffected by race condition. However, it becomes evident that solving the race conditions in the code makes it significantly faster, especially when it is executed on multiple cores.展开更多
Power flow transfer(PFT) analysis under various anticipated faults in advance is important for securing power system operations. In China, PSD-BPA software is the most widely used tool for power system analysis, but i...Power flow transfer(PFT) analysis under various anticipated faults in advance is important for securing power system operations. In China, PSD-BPA software is the most widely used tool for power system analysis, but its input/output interface is easily adapted for PFT analysis,which is also difficult due to its computationally intensity.To solve this issue, and achieve a fast and accurate PFT analysis, a modular parallelization framework is developed in this paper. Two major contributions are included. One is several integrated PFT analysis modules, including parameter initialization, fault setting, network integrity detection, reasonableness identification and result analysis.The other is a parallelization technique for enhancing computation efficiency using a Fork/Join framework. The proposed framework has been tested and validated by the IEEE 39 bus reference power system. Furthermore, it has been applied to a practical power network with 11052 buses and 12487 branches in the Yunnan Power Grid ofChina, providing decision support for large-scale power system analysis.展开更多
The current parallel ankle rehabilitation robot(ARR)suffers from the problem of difficult real-time alignment of the human-robot joint center of rotation,which may lead to secondary injuries to the patient.This study ...The current parallel ankle rehabilitation robot(ARR)suffers from the problem of difficult real-time alignment of the human-robot joint center of rotation,which may lead to secondary injuries to the patient.This study investigates type synthesis of a parallel self-alignment ankle rehabilitation robot(PSAARR)based on the kinematic characteristics of ankle joint rotation center drift from the perspective of introducing"suitable passive degrees of freedom(DOF)"with a suitable number and form.First,the self-alignment principle of parallel ARR was proposed by deriving conditions for transforming a human-robot closed chain(HRCC)formed by an ARR and human body into a kinematic suitable constrained system and introducing conditions of"decoupled"and"less limb".Second,the relationship between the self-alignment principle and actuation wrenches(twists)of PSAARR was analyzed with the velocity Jacobian matrix as a"bridge".Subsequently,the type synthesis conditions of PSAARR were proposed.Third,a PSAARR synthesis method was proposed based on the screw theory and type of PSAARR synthesis conducted.Finally,an HRCC kinematic model was established to verify the self-alignment capability of the PSAARR.In this study,93 types of PSAARR limb structures were synthesized and the self-alignment capability of a human-robot joint axis was verified through kinematic analysis,which provides a theoretical basis for the design of such an ARR.展开更多
The kinematic equivalent model of an existing ankle-rehabilitation robot is inconsistent with the anatomical structure of the human ankle,which influences the rehabilitation effect.Therefore,this study equates the hum...The kinematic equivalent model of an existing ankle-rehabilitation robot is inconsistent with the anatomical structure of the human ankle,which influences the rehabilitation effect.Therefore,this study equates the human ankle to the UR model and proposes a novel three degrees of freedom(3-DOF)generalized spherical parallel mechanism for ankle rehabilitation.The parallel mechanism has two spherical centers corresponding to the rotation centers of tibiotalar and subtalar joints.Using screw theory,the mobility of the parallel mechanism,which meets the requirements of the human ankle,is analyzed.The inverse kinematics are presented,and singularities are identified based on the Jacobian matrix.The workspaces of the parallel mechanism are obtained through the search method and compared with the motion range of the human ankle,which shows that the parallel mechanism can meet the motion demand of ankle rehabilitation.Additionally,based on the motion-force transmissibility,the performance atlases are plotted in the parameter optimal design space,and the optimum parameter is obtained according to the demands of practical applications.The results show that the parallel mechanism can meet the motion requirements of ankle rehabilitation and has excellent kinematic performance in its rehabilitation range,which provides a theoretical basis for the prototype design and experimental verification.展开更多
The Message Passing Interface (MPI) is a widely accepted standard for parallel computing on distributed memorysystems.However, MPI implementations can contain defects that impact the reliability and performance of par...The Message Passing Interface (MPI) is a widely accepted standard for parallel computing on distributed memorysystems.However, MPI implementations can contain defects that impact the reliability and performance of parallelapplications. Detecting and correcting these defects is crucial, yet there is a lack of published models specificallydesigned for correctingMPI defects. To address this, we propose a model for detecting and correcting MPI defects(DC_MPI), which aims to detect and correct defects in various types of MPI communication, including blockingpoint-to-point (BPTP), nonblocking point-to-point (NBPTP), and collective communication (CC). The defectsaddressed by the DC_MPI model include illegal MPI calls, deadlocks (DL), race conditions (RC), and messagemismatches (MM). To assess the effectiveness of the DC_MPI model, we performed experiments on a datasetconsisting of 40 MPI codes. The results indicate that the model achieved a detection rate of 37 out of 40 codes,resulting in an overall detection accuracy of 92.5%. Additionally, the execution duration of the DC_MPI modelranged from 0.81 to 1.36 s. These findings show that the DC_MPI model is useful in detecting and correctingdefects in MPI implementations, thereby enhancing the reliability and performance of parallel applications. TheDC_MPImodel fills an important research gap and provides a valuable tool for improving the quality ofMPI-basedparallel computing systems.展开更多
Hydraulic-electric rock fragmentation(HERF)plays a significant role in improving the efficiency of high voltage pulse rock breaking.However,the underlying mechanism of HERF remains unclear.In this study,considering th...Hydraulic-electric rock fragmentation(HERF)plays a significant role in improving the efficiency of high voltage pulse rock breaking.However,the underlying mechanism of HERF remains unclear.In this study,considering the heterogeneity of the rock,microscopic thermodynamic properties,and shockwave time domain waveforms,based on the shockwave model,digital imaging technology and the discrete element method,the cyclic loading numerical simulations of HERF is achieved by coupling electrical,thermal,and solid mechanics under different formation temperatures,confining pressure,initial peak voltage,electrode bit diameter,and loading times.Meanwhile,the HERF discharge system is conducive to the laboratory experiments with various electrical parameters and the resulting broken pits are numerically reconstructed to obtain the geometric parameters.The results show that,the completely broken area consists of powdery rock debris.In the pre-broken zone,the mineral cementation of the rock determines the transition of type CⅠcracks to type CⅡand type CⅢcracks.Furthermore,the peak pressure of the shockwave increased with initial peak voltage but decreased with electrode bit diameter,while the wave front time reduced.Moreover,increasing well depth,formation temperature and confining pressure augment and inhibit HERF,but once confining pressure surpassed the threshold of 60 MPa for 152.40,215.90,and 228.60 mm electrode bits,and 40 MPa for 309.88 mm electrode bits,HERF is promoted.Additionally,for the same kind of rock,the volume and width of the broken pit increase with higher initial peak voltage and rock fissures will promote HERF.Eventually,the electrode drill bit with a 215.90 mm diameter is more suitable for drilling pink granite.This research contributes to a better microscopic understanding of HERF and provides valuable insights for electrode bit selection,as well as the optimization of circuit parameters for HERF technology.展开更多
文摘Ray tracing is a computer graphics method that renders images realistically. As the name suggests, this technique primarily traces the path of light rays interacting with objects in a scene [1], permitting the calculation of lighting and reflecting impact [2]. As ray tracing is a time-consuming process, the need for parallelization to solve this problem arises. One downside of this solution is the existence of race conditions. In this work, we explore and experiment with a different, well-known solution for this race condition. Starting with the introduction and the background section, a brief overview of the topic is followed by a detailed part of how the race conditions may occur in the case of the ray tracing algorithm. Continuing with the methods and results section, we have used OpenMP to parallelize the Ray tracing algorithm with the different compiler directives critical, atomic, and first-private. Hence, it concluded that both critical and atomic are not efficient solutions to produce a good-quality picture, but first-private succeeded in producing a high-quality picture.
基金Supported by the National Natural Science Foundation of China(No.61772417,61634004,61602377,61272120)the Shaanxi Provincial Co-ordination Innovation Project of Science and Technology(No.2016KTZDGY02-04-02)the Shaanxi Provincial key R&D plan(No.2017GY-060)
文摘For the characteristics of intra prediction algorithms, the data dependence and parallelism between intra prediction models are first analyzed. This paper proposes a parallelization method based on dynamic reconfigurable array processors provided by the project team, and uses data level parallel(DLP) algorithms in multi-core units. The experimental results show that Y-component of peak signal to noise ratio(Y-PSNR) is improved about 10 dB and the time is saved 63% compared with high-efficiency video coding(HEVC) test model HM10.0. This method can effectively reduce codec time of the video and reduce computational complexity.
文摘The combined finiteediscrete element method(FDEM) belongs to a family of methods of computational mechanics of discontinua. The method is suitable for problems of discontinua, where particles are deformable and can fracture or fragment. The applications of FDEM have spread over a number of disciplines including rock mechanics, where problems like mining, mineral processing or rock blasting can be solved by employing FDEM. In this work, a novel approach for the parallelization of two-dimensional(2D) FDEM aiming at clusters and desktop computers is developed. Dynamic domain decomposition based parallelization solvers covering all aspects of FDEM have been developed. These have been implemented into the open source Y2 D software package and have been tested on a PC cluster. The overall performance and scalability of the parallel code have been studied using numerical examples. The results obtained confirm the suitability of the parallel implementation for solving large scale problems.
基金This work was supported by the National Key Research and Development Program of China under Grant No.2017YFC1502203.
文摘The Global-Regional Integrated forecast System(GRIST)is the next-generation weather and climate integrated model dynamic framework developed by Chinese Academy of Meteorological Sciences.In this paper,we present several changes made to the global nonhydrostatic dynamical(GND)core,which is part of the ongoing prototype of GRIST.The changes leveraging MPI and PnetCDF techniques were targeted at the parallelization and performance optimization to the original serial GND core.Meanwhile,some sophisticated data structures and interfaces were designed to adjust flexibly the size of boundary and halo domains according to the variable accuracy in parallel context.In addition,the I/O performance of PnetCDF decreases as the number of MPI processes increases in our experimental environment.Especially when the number exceeds 6000,it caused system-wide outages(SWO).Thus,a grouping solution was proposed to overcome that issue.Several experiments were carried out on the supercomputing platform based on Intel x86 CPUs in the National Supercomputing Center in Wuxi.The results demonstrated that the parallel GND core based on grouping solution achieves good strong scalability and improves the performance significantly,as well as avoiding the SWOs.
基金the National Natural Science Foundation of China(No.61834005,61772417,61802304,61602377,61874087,61634004)Shaanxi International Science and Technology Cooperation Program(No.2018KW-006).
文摘To reduce the computational complexity and storage cost caused by wedge segmentation algorithm,a scheme of simplifying wedge matching is proposed.It takes advantage of the correlation of the wedge separation line of depth map and the direction of intra-prediction for 3D high-efficiency video coding(3D-HEVC).According to the difference of wedge segmentation between adjacent edge and opposite edge,a set only including 104×4 wedgelet templates is given.By expanding of the wedge wave of a certain minimum unit,a simple separation line acquisition method for different size of depth block is put forward.Furthermore,based on the array processor(DPR-CODEC)developed by project team,an efficient parallel scheme of the improved wedge segmentation mode prediction is introduced.By the scheme,prediction unit(PU)size can be changed randomly from 4×4 to 8×8,16×16,and 32×32,which is more in line with the needs of the HEVC standard.Veri-fied with test sequence in HTM16.1 and the Xilinx virtex-6 field programmable gate array(FPGA)respectively,the experiment results show that the proposed methods save 99.2%of the storage space and 63.94%of the encoding time,the serial/parallel acceleration ratio of each template reaches 1.84 in average.The coding performance,storage and resource consumption are considered for both.
基金Supported by the National Natural Science Foundation of China(No.61834005,61772417,61802304,61602377,61874087,61634004)the Shaanxi Province Key R&D Plan(No.2020JM-525,2021GY-029,2021KW-16)。
文摘After the extension of depth modeling mode 4(DMM-4)in 3D high efficiency video coding(3D-HEVC),the computational complexity increases sharply,which causes the real-time performance of video coding to be impacted.To reduce the computational complexity of DMM-4,a simplified hardware-friendly contour prediction algorithm is proposed in this paper.Based on the similarity between texture and depth map,the proposed algorithm directly codes depth blocks to calculate edge regions to reduce the number of reference blocks.Through the verification of the test sequence on HTM16.1,the proposed algorithm coding time is reduced by 9.42%compared with the original algorithm.To avoid the time consuming of serial coding on HTM,a parallelization design of the proposed algorithm based on reconfigurable array processor(DPR-CODEC)is proposed.The parallelization design reduces the storage access time,configuration time and saves the storage cost.Verified with the Xilinx Virtex 6 FPGA,experimental results show that parallelization design is capable of processing HD 1080p at a speed above 30 frames per second.Compared with the related work,the scheme reduces the LUTs by 42.3%,the REG by 85.5%and the hardware resources by 66.7%.The data loading speedup ratio of parallel scheme can reach 3.4539.On average,the different sized templates serial/parallel speedup ratio of encoding time can reach 2.446.
文摘With the development of satellite remote sensing technology, more and more requirements are put forward on the timeliness and stability of the satellite weather service system. The FY satellite rainfall estimate day knock off product algorithm runs longer, about 20 minutes, which affects the estimated rainfall product generated timeliness. Research and development of parallel optimization algorithms based on the needs of satellite meteorological services and their effectiveness in practical applications are necessary ways to enhance the high-performance and high-availability capabilities of satellite meteorological services. So aiming at this problem, we started the parallel algorithm research based on the analysis of precipitation estimation algorithm. Firstly, we explained the steps of precipitation estimated date knock off product algorithm;secondly, we analyzed the four main calculation module calculating the amount of algorithms;thirdly, multithreaded parallel algorithm and MPI parallelization was designed. Finally, the multithreaded parallel and MPI parallelization were realized. Experimental results show that the multithreaded parallel and MPI parallelization algorithm could greatly improve the overall degree of computational efficiency. And, MPI parallelization mode has a higher operating efficiency. The performance of parallel processing is closely related to the architecture of the computer. From the perspective of service scheduling and product algorithms, the MPI parallelization approach is adopted to achieve the purpose of improving service quality.
文摘Scale Invariant Feature Transform (SIFT) algorithm is a widely used computer vision algorithm that detects and extracts local feature descriptors from images. SIFT is computationally intensive, making it infeasible for single threaded im-plementation to extract local feature descriptors for high-resolution images in real time. In this paper, an approach to parallelization of the SIFT algorithm is demonstrated using NVIDIA’s Graphics Processing Unit (GPU). The parallel-ization design for SIFT on GPUs is divided into two stages, a) Algorithm de-sign-generic design strategies which focuses on data and b) Implementation de-sign-architecture specific design strategies which focuses on optimally using GPU resources for maximum occupancy. Increasing memory latency hiding, eliminating branches and data blocking achieve a significant decrease in aver-age computational time. Furthermore, it is observed via Paraver tools that our approach to parallelization while optimizing for maximum occupancy allows GPU to execute memory bound SIFT algorithm at optimal levels.
文摘The general m-machine permutation flowshop problem with the total flow-time objective is known to be NP-hard for m ≥ 2. The only practical method for finding optimal solutions has been branch-and-bound algorithms. In this paper, we present an improved sequential algorithm which is based on a strict alternation of Generation and Exploration execution modes as well as Depth-First/Best-First hybrid strategies. The experimental results show that the proposed scheme exhibits improved performance compared with the algorithm in [1]. More importantly, our method can be easily extended and implemented with lightweight threads to speed up the execution times. Good speedups can be obtained on shared-memory multicore systems.
基金supported by the National Natural Science Foundation of China(Nos.11972267,11802214 and 51932006)the Fundamental Research Funds for the Central Universities(WUT:2020lll031GX).
文摘A rate-dependent peridynamic ceramic model,considering the brittle tensile response,compressive plastic softening and strain-rate dependence,can accurately represent the dynamic response and crack propagation of ceramic materials.However,it also considers the strain-rate dependence and damage accumulation caused by compressive plastic softening during the compression stage,requiring more computational resources for the bond force evaluation and damage evolution.Herein,the OpenMP parallel optimization of the rate-dependent peridynamic ceramicmodel is investigated.Also,themodules that compute the interactions betweenmaterial points and update damage index are vectorized and parallelized.Moreover,the numerical examples are carried out to simulate the dynamic response and fracture of the ceramic plate under normal impact.Furthermore,the speed-up ratio and computational efficiency by multi-threads are evaluated and discussed to demonstrate the reliability of parallelized programs.The results reveal that the totalwall clock time has been significantly reduced after optimization,showing the promise of parallelization process in terms of accuracy and stability.
文摘In this paper, we present parallel programming approaches to calculate the values of the cells in matrix’s scoring used in the Smith-Waterman’s algorithm for sequence alignment. This algorithm, well known in bioinformatics for its applications, is unfortunately time-consuming on a serial computer. We use formulation based on anti-diagonals structure of data. This representation focuses on parallelizable parts of the algorithm without changing the initial formulation of the algorithm. Approaching data in that way give us a formulation more flexible. To examine this approach, we encode it in OpenMP and Cuda C. The performance obtained shows the interest of our paper.
文摘The parallelization of the diagnostics for climate research has been an important goal in the performance testing and improvement of the diagnostics for the Department of Energy’s (DOE’s) Accelerated Climate Modeling for Energy (ACME) project [1]. The primary mission of the ACME project is to build and test the next-generation Earth system model for current and future generations of computing systems operated by the DOE office of science computing facilities, including the envisioned exascale systems foreseen in the early part of the next decade. As part of the underpinning workflow environment, a diagnostics, model metrics, and intercomparison Python framework, called UVC Metrics was created to aid in testing and production execution of the model. This framework builds on common methods and similar metrics to accommodate and diagnose individual component models, such as atmosphere, land, ocean, sea ice, and land ice. This paper reports on initial parallelization of UVC Metrics for the atmosphere model component using two popular frameworks: MPI and SPARK. A timing study is presented to assess the performance of each method in which significant improvement was achieved for both frameworks despite I/O contentions with NFS. The advantages and disadvantages of each framework are also presented.
文摘In this paper,stochastic global optimization algorithms,specifically,genetic algorithm and simulated annealing are used for the problem of calibrating the dynamic option pricing model under stochastic volatility to market prices by adopting a hybrid programming approach.The performance of this dynamic option pricing model under the obtained optimal parameters is also discussed.To enhance the model throughput and reduce latency,a heterogeneous hybrid programming approach on GPU was adopted which emphasized a data-parallel implementation of the dynamic option pricing model on a GPU-based system.Kernel offloading to the GPU of the compute-intensive segments of the pricing algorithms was done in OpenCL.The GPU approach was found to significantly reduce latency by an optimum of 541 times faster than a parallel implementation approach on the CPU,reducing the computation time from 46.24 minutes to 5.12 seconds.
文摘Real-time capabilities and computational efficiency are provided by parallel image processing utilizing OpenMP. However, race conditions can affect the accuracy and reliability of the outcomes. This paper highlights the importance of addressing race conditions in parallel image processing, specifically focusing on color inverse filtering using OpenMP. We considered three solutions to solve race conditions, each with distinct characteristics: #pragma omp atomic: Protects individual memory operations for fine-grained control. #pragma omp critical: Protects entire code blocks for exclusive access. #pragma omp parallel sections reduction: Employs a reduction clause for safe aggregation of values across threads. Our findings show that the produced images were unaffected by race condition. However, it becomes evident that solving the race conditions in the code makes it significantly faster, especially when it is executed on multiple cores.
基金supported by the Major International Joint Research Project from the National Nature Science Foundation of China (No. 51210014)Major Program of National Natural Science Foundation of China (No. 91547201)
文摘Power flow transfer(PFT) analysis under various anticipated faults in advance is important for securing power system operations. In China, PSD-BPA software is the most widely used tool for power system analysis, but its input/output interface is easily adapted for PFT analysis,which is also difficult due to its computationally intensity.To solve this issue, and achieve a fast and accurate PFT analysis, a modular parallelization framework is developed in this paper. Two major contributions are included. One is several integrated PFT analysis modules, including parameter initialization, fault setting, network integrity detection, reasonableness identification and result analysis.The other is a parallelization technique for enhancing computation efficiency using a Fork/Join framework. The proposed framework has been tested and validated by the IEEE 39 bus reference power system. Furthermore, it has been applied to a practical power network with 11052 buses and 12487 branches in the Yunnan Power Grid ofChina, providing decision support for large-scale power system analysis.
基金Supported by Key Scientific Research Platforms and Projects of Guangdong Regular Institutions of Higher Education of China(Grant No.2022KCXTD033)Guangdong Provincial Natural Science Foundation of China(Grant No.2023A1515012103)+1 种基金Guangdong Provincial Scientific Research Capacity Improvement Project of Key Developing Disciplines of China(Grant No.2021ZDJS084)National Natural Science Foundation of China(Grant No.52105009).
文摘The current parallel ankle rehabilitation robot(ARR)suffers from the problem of difficult real-time alignment of the human-robot joint center of rotation,which may lead to secondary injuries to the patient.This study investigates type synthesis of a parallel self-alignment ankle rehabilitation robot(PSAARR)based on the kinematic characteristics of ankle joint rotation center drift from the perspective of introducing"suitable passive degrees of freedom(DOF)"with a suitable number and form.First,the self-alignment principle of parallel ARR was proposed by deriving conditions for transforming a human-robot closed chain(HRCC)formed by an ARR and human body into a kinematic suitable constrained system and introducing conditions of"decoupled"and"less limb".Second,the relationship between the self-alignment principle and actuation wrenches(twists)of PSAARR was analyzed with the velocity Jacobian matrix as a"bridge".Subsequently,the type synthesis conditions of PSAARR were proposed.Third,a PSAARR synthesis method was proposed based on the screw theory and type of PSAARR synthesis conducted.Finally,an HRCC kinematic model was established to verify the self-alignment capability of the PSAARR.In this study,93 types of PSAARR limb structures were synthesized and the self-alignment capability of a human-robot joint axis was verified through kinematic analysis,which provides a theoretical basis for the design of such an ARR.
基金Supported by National Natural Science Foundation of China(Grant No.52075145)S&T Program of Hebei Province of China(Grant Nos.20281805Z,E2020103001)Central Government Guides Basic Research Projects of Local Science and Technology Development Funds of China(Grant No.206Z1801G).
文摘The kinematic equivalent model of an existing ankle-rehabilitation robot is inconsistent with the anatomical structure of the human ankle,which influences the rehabilitation effect.Therefore,this study equates the human ankle to the UR model and proposes a novel three degrees of freedom(3-DOF)generalized spherical parallel mechanism for ankle rehabilitation.The parallel mechanism has two spherical centers corresponding to the rotation centers of tibiotalar and subtalar joints.Using screw theory,the mobility of the parallel mechanism,which meets the requirements of the human ankle,is analyzed.The inverse kinematics are presented,and singularities are identified based on the Jacobian matrix.The workspaces of the parallel mechanism are obtained through the search method and compared with the motion range of the human ankle,which shows that the parallel mechanism can meet the motion demand of ankle rehabilitation.Additionally,based on the motion-force transmissibility,the performance atlases are plotted in the parameter optimal design space,and the optimum parameter is obtained according to the demands of practical applications.The results show that the parallel mechanism can meet the motion requirements of ankle rehabilitation and has excellent kinematic performance in its rehabilitation range,which provides a theoretical basis for the prototype design and experimental verification.
基金the Deanship of Scientific Research at King Abdulaziz University,Jeddah,Saudi Arabia under the Grant No.RG-12-611-43.
文摘The Message Passing Interface (MPI) is a widely accepted standard for parallel computing on distributed memorysystems.However, MPI implementations can contain defects that impact the reliability and performance of parallelapplications. Detecting and correcting these defects is crucial, yet there is a lack of published models specificallydesigned for correctingMPI defects. To address this, we propose a model for detecting and correcting MPI defects(DC_MPI), which aims to detect and correct defects in various types of MPI communication, including blockingpoint-to-point (BPTP), nonblocking point-to-point (NBPTP), and collective communication (CC). The defectsaddressed by the DC_MPI model include illegal MPI calls, deadlocks (DL), race conditions (RC), and messagemismatches (MM). To assess the effectiveness of the DC_MPI model, we performed experiments on a datasetconsisting of 40 MPI codes. The results indicate that the model achieved a detection rate of 37 out of 40 codes,resulting in an overall detection accuracy of 92.5%. Additionally, the execution duration of the DC_MPI modelranged from 0.81 to 1.36 s. These findings show that the DC_MPI model is useful in detecting and correctingdefects in MPI implementations, thereby enhancing the reliability and performance of parallel applications. TheDC_MPImodel fills an important research gap and provides a valuable tool for improving the quality ofMPI-basedparallel computing systems.
基金supported by the National Natural Science Foundation of China(Nos.52034006,52004229,52225401,and 52274231)the Regional Innovation Cooperation Project of Sichuan Province(No.2022YFQ0059)+3 种基金Science and Technology Cooperation Project of the CNPC-SWPU Innovation Alliance(No.2020CX040301)Natural Science Foundation of Sichuan Province(No.2023NSFSC0431)Science and Technology Strategic Cooperation Project between Nanchong City and Southwest Petroleum University(No.SXHZ004)Research and innovation Fund for Graduate Students of Southwest Petroleum University(No.2022KYCX058).
文摘Hydraulic-electric rock fragmentation(HERF)plays a significant role in improving the efficiency of high voltage pulse rock breaking.However,the underlying mechanism of HERF remains unclear.In this study,considering the heterogeneity of the rock,microscopic thermodynamic properties,and shockwave time domain waveforms,based on the shockwave model,digital imaging technology and the discrete element method,the cyclic loading numerical simulations of HERF is achieved by coupling electrical,thermal,and solid mechanics under different formation temperatures,confining pressure,initial peak voltage,electrode bit diameter,and loading times.Meanwhile,the HERF discharge system is conducive to the laboratory experiments with various electrical parameters and the resulting broken pits are numerically reconstructed to obtain the geometric parameters.The results show that,the completely broken area consists of powdery rock debris.In the pre-broken zone,the mineral cementation of the rock determines the transition of type CⅠcracks to type CⅡand type CⅢcracks.Furthermore,the peak pressure of the shockwave increased with initial peak voltage but decreased with electrode bit diameter,while the wave front time reduced.Moreover,increasing well depth,formation temperature and confining pressure augment and inhibit HERF,but once confining pressure surpassed the threshold of 60 MPa for 152.40,215.90,and 228.60 mm electrode bits,and 40 MPa for 309.88 mm electrode bits,HERF is promoted.Additionally,for the same kind of rock,the volume and width of the broken pit increase with higher initial peak voltage and rock fissures will promote HERF.Eventually,the electrode drill bit with a 215.90 mm diameter is more suitable for drilling pink granite.This research contributes to a better microscopic understanding of HERF and provides valuable insights for electrode bit selection,as well as the optimization of circuit parameters for HERF technology.