Ray tracing is a computer graphics method that renders images realistically. As the name suggests, this technique primarily traces the path of light rays interacting with objects in a scene [1], permitting the calcula...Ray tracing is a computer graphics method that renders images realistically. As the name suggests, this technique primarily traces the path of light rays interacting with objects in a scene [1], permitting the calculation of lighting and reflecting impact [2]. As ray tracing is a time-consuming process, the need for parallelization to solve this problem arises. One downside of this solution is the existence of race conditions. In this work, we explore and experiment with a different, well-known solution for this race condition. Starting with the introduction and the background section, a brief overview of the topic is followed by a detailed part of how the race conditions may occur in the case of the ray tracing algorithm. Continuing with the methods and results section, we have used OpenMP to parallelize the Ray tracing algorithm with the different compiler directives critical, atomic, and first-private. Hence, it concluded that both critical and atomic are not efficient solutions to produce a good-quality picture, but first-private succeeded in producing a high-quality picture.展开更多
For the characteristics of intra prediction algorithms, the data dependence and parallelism between intra prediction models are first analyzed. This paper proposes a parallelization method based on dynamic reconfigura...For the characteristics of intra prediction algorithms, the data dependence and parallelism between intra prediction models are first analyzed. This paper proposes a parallelization method based on dynamic reconfigurable array processors provided by the project team, and uses data level parallel(DLP) algorithms in multi-core units. The experimental results show that Y-component of peak signal to noise ratio(Y-PSNR) is improved about 10 dB and the time is saved 63% compared with high-efficiency video coding(HEVC) test model HM10.0. This method can effectively reduce codec time of the video and reduce computational complexity.展开更多
The combined finiteediscrete element method (FDEM) belongs to a family of methods of computationalmechanics of discontinua. The method is suitable for problems of discontinua, where particles aredeformable and can f...The combined finiteediscrete element method (FDEM) belongs to a family of methods of computationalmechanics of discontinua. The method is suitable for problems of discontinua, where particles aredeformable and can fracture or fragment. The applications of FDEM have spread over a number of disciplinesincluding rock mechanics, where problems like mining, mineral processing or rock blasting canbe solved by employing FDEM. In this work, a novel approach for the parallelization of two-dimensional(2D) FDEM aiming at clusters and desktop computers is developed. Dynamic domain decompositionbased parallelization solvers covering all aspects of FDEM have been developed. These have beenimplemented into the open source Y2D software package and have been tested on a PC cluster. Theoverall performance and scalability of the parallel code have been studied using numerical examples. Theresults obtained confirm the suitability of the parallel implementation for solving large scale problems. 2014 Institute of Rock and Soil Mechanics, Chinese Academy of Sciences. Production and hosting byElsevier B.V. All rights reserved.展开更多
After the extension of depth modeling mode 4(DMM-4)in 3D high efficiency video coding(3D-HEVC),the computational complexity increases sharply,which causes the real-time performance of video coding to be impacted.To re...After the extension of depth modeling mode 4(DMM-4)in 3D high efficiency video coding(3D-HEVC),the computational complexity increases sharply,which causes the real-time performance of video coding to be impacted.To reduce the computational complexity of DMM-4,a simplified hardware-friendly contour prediction algorithm is proposed in this paper.Based on the similarity between texture and depth map,the proposed algorithm directly codes depth blocks to calculate edge regions to reduce the number of reference blocks.Through the verification of the test sequence on HTM16.1,the proposed algorithm coding time is reduced by 9.42%compared with the original algorithm.To avoid the time consuming of serial coding on HTM,a parallelization design of the proposed algorithm based on reconfigurable array processor(DPR-CODEC)is proposed.The parallelization design reduces the storage access time,configuration time and saves the storage cost.Verified with the Xilinx Virtex 6 FPGA,experimental results show that parallelization design is capable of processing HD 1080p at a speed above 30 frames per second.Compared with the related work,the scheme reduces the LUTs by 42.3%,the REG by 85.5%and the hardware resources by 66.7%.The data loading speedup ratio of parallel scheme can reach 3.4539.On average,the different sized templates serial/parallel speedup ratio of encoding time can reach 2.446.展开更多
This paper studies the;complexity of multighd mpllelization on message PaSsing computers. Parallelization is by domain decomposition. An optimal strip decomposition is proposed. With natural ordering of the grid point...This paper studies the;complexity of multighd mpllelization on message PaSsing computers. Parallelization is by domain decomposition. An optimal strip decomposition is proposed. With natural ordering of the grid points,the strip decomposition leads to good processor utilization. The efficiency could be significantly improved. Better performances could be achieved by making use of Van der Vorst ordering.展开更多
The Global-Regional Integrated forecast System(GRIST)is the next-generation weather and climate integrated model dynamic framework developed by Chinese Academy of Meteorological Sciences.In this paper,we present sever...The Global-Regional Integrated forecast System(GRIST)is the next-generation weather and climate integrated model dynamic framework developed by Chinese Academy of Meteorological Sciences.In this paper,we present several changes made to the global nonhydrostatic dynamical(GND)core,which is part of the ongoing prototype of GRIST.The changes leveraging MPI and PnetCDF techniques were targeted at the parallelization and performance optimization to the original serial GND core.Meanwhile,some sophisticated data structures and interfaces were designed to adjust flexibly the size of boundary and halo domains according to the variable accuracy in parallel context.In addition,the I/O performance of PnetCDF decreases as the number of MPI processes increases in our experimental environment.Especially when the number exceeds 6000,it caused system-wide outages(SWO).Thus,a grouping solution was proposed to overcome that issue.Several experiments were carried out on the supercomputing platform based on Intel x86 CPUs in the National Supercomputing Center in Wuxi.The results demonstrated that the parallel GND core based on grouping solution achieves good strong scalability and improves the performance significantly,as well as avoiding the SWOs.展开更多
To reduce the computational complexity and storage cost caused by wedge segmentation algorithm,a scheme of simplifying wedge matching is proposed.It takes advantage of the correlation of the wedge separation line of d...To reduce the computational complexity and storage cost caused by wedge segmentation algorithm,a scheme of simplifying wedge matching is proposed.It takes advantage of the correlation of the wedge separation line of depth map and the direction of intra-prediction for 3D high-efficiency video coding(3D-HEVC).According to the difference of wedge segmentation between adjacent edge and opposite edge,a set only including 104×4 wedgelet templates is given.By expanding of the wedge wave of a certain minimum unit,a simple separation line acquisition method for different size of depth block is put forward.Furthermore,based on the array processor(DPR-CODEC)developed by project team,an efficient parallel scheme of the improved wedge segmentation mode prediction is introduced.By the scheme,prediction unit(PU)size can be changed randomly from 4×4 to 8×8,16×16,and 32×32,which is more in line with the needs of the HEVC standard.Veri-fied with test sequence in HTM16.1 and the Xilinx virtex-6 field programmable gate array(FPGA)respectively,the experiment results show that the proposed methods save 99.2%of the storage space and 63.94%of the encoding time,the serial/parallel acceleration ratio of each template reaches 1.84 in average.The coding performance,storage and resource consumption are considered for both.展开更多
With the development of satellite remote sensing technology, more and more requirements are put forward on the timeliness and stability of the satellite weather service system. The FY satellite rainfall estimate day k...With the development of satellite remote sensing technology, more and more requirements are put forward on the timeliness and stability of the satellite weather service system. The FY satellite rainfall estimate day knock off product algorithm runs longer, about 20 minutes, which affects the estimated rainfall product generated timeliness. Research and development of parallel optimization algorithms based on the needs of satellite meteorological services and their effectiveness in practical applications are necessary ways to enhance the high-performance and high-availability capabilities of satellite meteorological services. So aiming at this problem, we started the parallel algorithm research based on the analysis of precipitation estimation algorithm. Firstly, we explained the steps of precipitation estimated date knock off product algorithm;secondly, we analyzed the four main calculation module calculating the amount of algorithms;thirdly, multithreaded parallel algorithm and MPI parallelization was designed. Finally, the multithreaded parallel and MPI parallelization were realized. Experimental results show that the multithreaded parallel and MPI parallelization algorithm could greatly improve the overall degree of computational efficiency. And, MPI parallelization mode has a higher operating efficiency. The performance of parallel processing is closely related to the architecture of the computer. From the perspective of service scheduling and product algorithms, the MPI parallelization approach is adopted to achieve the purpose of improving service quality.展开更多
An OpenMP approach was proposed to parallelize the sequential molecular dynamics(MD) code on shared memory machines. When a code is converted from the sequential form to the parallel form, data dependence is a main pr...An OpenMP approach was proposed to parallelize the sequential molecular dynamics(MD) code on shared memory machines. When a code is converted from the sequential form to the parallel form, data dependence is a main problem. A traditional sequential molecular dynamics code is anatomized to find the data dependence segments in it, and the two different methods, i.e., recover method and backward mapping method were used to eliminate those data dependencies in order to realize the parallelization of this sequential MD code. The performance of the parallelized MD code was analyzed by using some performance analysis tools. The results of the test show that the computing size of this code increases sharply form 1 million atoms before parallelization to 20 million atoms after parallelization, and the wall clock during computing is reduced largely. Some hot-spots in this code are found and optimized by improved algorithm. The efficiency of parallel computing is 30% higher than that of before, and the calculation time is saved and larger scale calculation problems are solved.展开更多
Scale Invariant Feature Transform (SIFT) algorithm is a widely used computer vision algorithm that detects and extracts local feature descriptors from images. SIFT is computationally intensive, making it infeasible fo...Scale Invariant Feature Transform (SIFT) algorithm is a widely used computer vision algorithm that detects and extracts local feature descriptors from images. SIFT is computationally intensive, making it infeasible for single threaded im-plementation to extract local feature descriptors for high-resolution images in real time. In this paper, an approach to parallelization of the SIFT algorithm is demonstrated using NVIDIA’s Graphics Processing Unit (GPU). The parallel-ization design for SIFT on GPUs is divided into two stages, a) Algorithm de-sign-generic design strategies which focuses on data and b) Implementation de-sign-architecture specific design strategies which focuses on optimally using GPU resources for maximum occupancy. Increasing memory latency hiding, eliminating branches and data blocking achieve a significant decrease in aver-age computational time. Furthermore, it is observed via Paraver tools that our approach to parallelization while optimizing for maximum occupancy allows GPU to execute memory bound SIFT algorithm at optimal levels.展开更多
The general m-machine permutation flowshop problem with the total flow-time objective is known to be NP-hard for m ≥ 2. The only practical method for finding optimal solutions has been branch-and-bound algorithms. In...The general m-machine permutation flowshop problem with the total flow-time objective is known to be NP-hard for m ≥ 2. The only practical method for finding optimal solutions has been branch-and-bound algorithms. In this paper, we present an improved sequential algorithm which is based on a strict alternation of Generation and Exploration execution modes as well as Depth-First/Best-First hybrid strategies. The experimental results show that the proposed scheme exhibits improved performance compared with the algorithm in [1]. More importantly, our method can be easily extended and implemented with lightweight threads to speed up the execution times. Good speedups can be obtained on shared-memory multicore systems.展开更多
A rate-dependent peridynamic ceramic model,considering the brittle tensile response,compressive plastic softening and strain-rate dependence,can accurately represent the dynamic response and crack propagation of ceram...A rate-dependent peridynamic ceramic model,considering the brittle tensile response,compressive plastic softening and strain-rate dependence,can accurately represent the dynamic response and crack propagation of ceramic materials.However,it also considers the strain-rate dependence and damage accumulation caused by compressive plastic softening during the compression stage,requiring more computational resources for the bond force evaluation and damage evolution.Herein,the OpenMP parallel optimization of the rate-dependent peridynamic ceramicmodel is investigated.Also,themodules that compute the interactions betweenmaterial points and update damage index are vectorized and parallelized.Moreover,the numerical examples are carried out to simulate the dynamic response and fracture of the ceramic plate under normal impact.Furthermore,the speed-up ratio and computational efficiency by multi-threads are evaluated and discussed to demonstrate the reliability of parallelized programs.The results reveal that the totalwall clock time has been significantly reduced after optimization,showing the promise of parallelization process in terms of accuracy and stability.展开更多
In this paper, we present parallel programming approaches to calculate the values of the cells in matrix’s scoring used in the Smith-Waterman’s algorithm for sequence alignment. This algorithm, well known in bioinfo...In this paper, we present parallel programming approaches to calculate the values of the cells in matrix’s scoring used in the Smith-Waterman’s algorithm for sequence alignment. This algorithm, well known in bioinformatics for its applications, is unfortunately time-consuming on a serial computer. We use formulation based on anti-diagonals structure of data. This representation focuses on parallelizable parts of the algorithm without changing the initial formulation of the algorithm. Approaching data in that way give us a formulation more flexible. To examine this approach, we encode it in OpenMP and Cuda C. The performance obtained shows the interest of our paper.展开更多
The parallelization of the diagnostics for climate research has been an important goal in the performance testing and improvement of the diagnostics for the Department of Energy’s (DOE’s) Accelerated Climate Modelin...The parallelization of the diagnostics for climate research has been an important goal in the performance testing and improvement of the diagnostics for the Department of Energy’s (DOE’s) Accelerated Climate Modeling for Energy (ACME) project [1]. The primary mission of the ACME project is to build and test the next-generation Earth system model for current and future generations of computing systems operated by the DOE office of science computing facilities, including the envisioned exascale systems foreseen in the early part of the next decade. As part of the underpinning workflow environment, a diagnostics, model metrics, and intercomparison Python framework, called UVC Metrics was created to aid in testing and production execution of the model. This framework builds on common methods and similar metrics to accommodate and diagnose individual component models, such as atmosphere, land, ocean, sea ice, and land ice. This paper reports on initial parallelization of UVC Metrics for the atmosphere model component using two popular frameworks: MPI and SPARK. A timing study is presented to assess the performance of each method in which significant improvement was achieved for both frameworks despite I/O contentions with NFS. The advantages and disadvantages of each framework are also presented.展开更多
Pseudo-Particle Modeling (PPM) is a particle method proposed by Ge and Li in 1996 [Ge, W., & Li, J. (1996). Pseudo-particle approach to hydrodynamics of particle-fluid systems, in M. Kwauk & J. Li (Eds.), Proc...Pseudo-Particle Modeling (PPM) is a particle method proposed by Ge and Li in 1996 [Ge, W., & Li, J. (1996). Pseudo-particle approach to hydrodynamics of particle-fluid systems, in M. Kwauk & J. Li (Eds.), Proceedings of the 5th international conference on drculating fluidized bed (pp. 260-265). Beijing: Science Press] and has been used to explore the microscopic mechanism in complex particle-fluid systems. But as a particle method, high computational cost remains a main obstacle for its large-scale application; therefore, parallel implementation of this method is highly desirable. Parallelization of two-dimensional PPM was carried out by spatial decomposition in this paper. The time costs of the major functions in the program were analyzed and the program was then optimized for higher efficiency by dynamic load balancing and resetting of particle arrays. Finally, simulation on a gas-solid fluidized bed with 102,400 solid particles and 1.8 × 10^7 pseudo-particles was performed successfully with this code, indicating its scalability in future applications.展开更多
Power flow transfer(PFT) analysis under various anticipated faults in advance is important for securing power system operations. In China, PSD-BPA software is the most widely used tool for power system analysis, but i...Power flow transfer(PFT) analysis under various anticipated faults in advance is important for securing power system operations. In China, PSD-BPA software is the most widely used tool for power system analysis, but its input/output interface is easily adapted for PFT analysis,which is also difficult due to its computationally intensity.To solve this issue, and achieve a fast and accurate PFT analysis, a modular parallelization framework is developed in this paper. Two major contributions are included. One is several integrated PFT analysis modules, including parameter initialization, fault setting, network integrity detection, reasonableness identification and result analysis.The other is a parallelization technique for enhancing computation efficiency using a Fork/Join framework. The proposed framework has been tested and validated by the IEEE 39 bus reference power system. Furthermore, it has been applied to a practical power network with 11052 buses and 12487 branches in the Yunnan Power Grid ofChina, providing decision support for large-scale power system analysis.展开更多
Multi-view video coding (MVC) comprises rich 3D information and is widely used in new visual media, such as 3DTV and free viewpoint TV (FTV). However, even with mainstream computer manufacturers migrating to multi...Multi-view video coding (MVC) comprises rich 3D information and is widely used in new visual media, such as 3DTV and free viewpoint TV (FTV). However, even with mainstream computer manufacturers migrating to multi-core processors, the huge computational requirement of MVC currently prohibits its wide use in consumer markets. In this paper, we demonstrate the design and implementation of the first parallel MVC system on Cell Broadband Engine^TM processor which is a state-of-the-art multi-core processor. We propose a task-dispatching algorithm which is adaptive data-driven on the frame level for MVC, and implement a parallel multi-view video decoder with modified H.264/AVC codec on real machine. This approach provides scalable speedup (up to 16 times on sixteen cores) through proper local store management, utilization of code locality and SIMD improvement. Decoding speed, speedup and utilization rate of cores are expressed in experimental results.展开更多
Due to the complex high-temperature characteristics of hydrocarbon fuel,the research on the long-term working process of parallel channel structure under variable working conditions,especially under high heat-mass rat...Due to the complex high-temperature characteristics of hydrocarbon fuel,the research on the long-term working process of parallel channel structure under variable working conditions,especially under high heat-mass ratio,has not been systematically carried out.In this paper,the heat transfer and flow characteristics of related high temperature fuels are studied by using typical engine parallel channel structure.Through numeri⁃cal simulation and systematic experimental verification,the flow and heat transfer characteristics of parallel chan⁃nels under typical working conditions are obtained,and the effectiveness of high-precision calculation method is preliminarily established.It is known that the stable time required for hot start of regenerative cooling engine is about 50 s,and the flow resistance of parallel channel structure first increases and then decreases with the in⁃crease of equivalence ratio(The following equivalence ratio is expressed byΦ),and there is a flow resistance peak in the range ofΦ=0.5~0.8.This is mainly caused by the coupling effect of high temperature physical proper⁃ties,flow rate and pressure of fuel in parallel channels.At the same time,the cooling and heat transfer character⁃istics of parallel channels under some conditions of high heat-mass ratio are obtained,and the main factors affect⁃ing the heat transfer of parallel channels such as improving surface roughness and strengthening heat transfer are mastered.In the experiment,whenΦis less than 0.9,the phenomenon of local heat transfer enhancement and deterioration can be obviously observed,and the temperature rise of local structures exceeds 200℃,which is the risk of structural damage.Therefore,the reliability of long-term parallel channel structure under the condition of high heat-mass ratio should be fully considered in structural design.展开更多
The new encoding tools of high efficiency video coding(HEVC) make the interpolation operation more complex in motion compensation(MC) for better video compression, but impose higher requirements on the computational e...The new encoding tools of high efficiency video coding(HEVC) make the interpolation operation more complex in motion compensation(MC) for better video compression, but impose higher requirements on the computational efficiency and control logic of the hardware architecture. The reconfigurable array processor can take into consideration both the computational efficiency and flexible switching of algorithms very well. Through mining the data dependency and parallelism among interpolation operation, this paper presents a parallelization method based on the dynamic reconfigurable array processor proposed by the project team. The number of pixels loaded from the external memory is reduced significantly, by multiplexing the common data in the previous reference block and the current reference block. Flexible switching of variable block operation is realized by using dynamic reconfiguration mechanism. A 16×16 processor element(PE)’s array is used to dynamically process a 4×4-64×64 block size. The experimental results show that, the reference block update speed is increased by 39.9%. In the case of an array size of 16 PEs, the number of pixels processed in parallel reaches 16.展开更多
Purpose–The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core pr...Purpose–The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core processors such as graphics processing units(GPUs)and multi-core processors(MCPs).Design/methodology/approach–For distributed genetic algorithm(GA)models,the paper proposes a method where an island’s ID number is added to the header of data transferred by this island for use in fault detection.Findings–The paper has shown that the processing time of the proposed idea is practically negligible in applications and also shown that an optimal solution can be obtained even with a single stuck-at fault or a transient fault,and that increasing the number of parallel threads makes the system less susceptible to faults.Originality/value–The study described in this paper is a new approach to increase the sustainability of application program using distributed GA on GPUs and MCPs.展开更多
文摘Ray tracing is a computer graphics method that renders images realistically. As the name suggests, this technique primarily traces the path of light rays interacting with objects in a scene [1], permitting the calculation of lighting and reflecting impact [2]. As ray tracing is a time-consuming process, the need for parallelization to solve this problem arises. One downside of this solution is the existence of race conditions. In this work, we explore and experiment with a different, well-known solution for this race condition. Starting with the introduction and the background section, a brief overview of the topic is followed by a detailed part of how the race conditions may occur in the case of the ray tracing algorithm. Continuing with the methods and results section, we have used OpenMP to parallelize the Ray tracing algorithm with the different compiler directives critical, atomic, and first-private. Hence, it concluded that both critical and atomic are not efficient solutions to produce a good-quality picture, but first-private succeeded in producing a high-quality picture.
基金Supported by the National Natural Science Foundation of China(No.61772417,61634004,61602377,61272120)the Shaanxi Provincial Co-ordination Innovation Project of Science and Technology(No.2016KTZDGY02-04-02)the Shaanxi Provincial key R&D plan(No.2017GY-060)
文摘For the characteristics of intra prediction algorithms, the data dependence and parallelism between intra prediction models are first analyzed. This paper proposes a parallelization method based on dynamic reconfigurable array processors provided by the project team, and uses data level parallel(DLP) algorithms in multi-core units. The experimental results show that Y-component of peak signal to noise ratio(Y-PSNR) is improved about 10 dB and the time is saved 63% compared with high-efficiency video coding(HEVC) test model HM10.0. This method can effectively reduce codec time of the video and reduce computational complexity.
文摘The combined finiteediscrete element method (FDEM) belongs to a family of methods of computationalmechanics of discontinua. The method is suitable for problems of discontinua, where particles aredeformable and can fracture or fragment. The applications of FDEM have spread over a number of disciplinesincluding rock mechanics, where problems like mining, mineral processing or rock blasting canbe solved by employing FDEM. In this work, a novel approach for the parallelization of two-dimensional(2D) FDEM aiming at clusters and desktop computers is developed. Dynamic domain decompositionbased parallelization solvers covering all aspects of FDEM have been developed. These have beenimplemented into the open source Y2D software package and have been tested on a PC cluster. Theoverall performance and scalability of the parallel code have been studied using numerical examples. Theresults obtained confirm the suitability of the parallel implementation for solving large scale problems. 2014 Institute of Rock and Soil Mechanics, Chinese Academy of Sciences. Production and hosting byElsevier B.V. All rights reserved.
基金Supported by the National Natural Science Foundation of China(No.61834005,61772417,61802304,61602377,61874087,61634004)the Shaanxi Province Key R&D Plan(No.2020JM-525,2021GY-029,2021KW-16)。
文摘After the extension of depth modeling mode 4(DMM-4)in 3D high efficiency video coding(3D-HEVC),the computational complexity increases sharply,which causes the real-time performance of video coding to be impacted.To reduce the computational complexity of DMM-4,a simplified hardware-friendly contour prediction algorithm is proposed in this paper.Based on the similarity between texture and depth map,the proposed algorithm directly codes depth blocks to calculate edge regions to reduce the number of reference blocks.Through the verification of the test sequence on HTM16.1,the proposed algorithm coding time is reduced by 9.42%compared with the original algorithm.To avoid the time consuming of serial coding on HTM,a parallelization design of the proposed algorithm based on reconfigurable array processor(DPR-CODEC)is proposed.The parallelization design reduces the storage access time,configuration time and saves the storage cost.Verified with the Xilinx Virtex 6 FPGA,experimental results show that parallelization design is capable of processing HD 1080p at a speed above 30 frames per second.Compared with the related work,the scheme reduces the LUTs by 42.3%,the REG by 85.5%and the hardware resources by 66.7%.The data loading speedup ratio of parallel scheme can reach 3.4539.On average,the different sized templates serial/parallel speedup ratio of encoding time can reach 2.446.
文摘This paper studies the;complexity of multighd mpllelization on message PaSsing computers. Parallelization is by domain decomposition. An optimal strip decomposition is proposed. With natural ordering of the grid points,the strip decomposition leads to good processor utilization. The efficiency could be significantly improved. Better performances could be achieved by making use of Van der Vorst ordering.
基金This work was supported by the National Key Research and Development Program of China under Grant No.2017YFC1502203.
文摘The Global-Regional Integrated forecast System(GRIST)is the next-generation weather and climate integrated model dynamic framework developed by Chinese Academy of Meteorological Sciences.In this paper,we present several changes made to the global nonhydrostatic dynamical(GND)core,which is part of the ongoing prototype of GRIST.The changes leveraging MPI and PnetCDF techniques were targeted at the parallelization and performance optimization to the original serial GND core.Meanwhile,some sophisticated data structures and interfaces were designed to adjust flexibly the size of boundary and halo domains according to the variable accuracy in parallel context.In addition,the I/O performance of PnetCDF decreases as the number of MPI processes increases in our experimental environment.Especially when the number exceeds 6000,it caused system-wide outages(SWO).Thus,a grouping solution was proposed to overcome that issue.Several experiments were carried out on the supercomputing platform based on Intel x86 CPUs in the National Supercomputing Center in Wuxi.The results demonstrated that the parallel GND core based on grouping solution achieves good strong scalability and improves the performance significantly,as well as avoiding the SWOs.
基金the National Natural Science Foundation of China(No.61834005,61772417,61802304,61602377,61874087,61634004)Shaanxi International Science and Technology Cooperation Program(No.2018KW-006).
文摘To reduce the computational complexity and storage cost caused by wedge segmentation algorithm,a scheme of simplifying wedge matching is proposed.It takes advantage of the correlation of the wedge separation line of depth map and the direction of intra-prediction for 3D high-efficiency video coding(3D-HEVC).According to the difference of wedge segmentation between adjacent edge and opposite edge,a set only including 104×4 wedgelet templates is given.By expanding of the wedge wave of a certain minimum unit,a simple separation line acquisition method for different size of depth block is put forward.Furthermore,based on the array processor(DPR-CODEC)developed by project team,an efficient parallel scheme of the improved wedge segmentation mode prediction is introduced.By the scheme,prediction unit(PU)size can be changed randomly from 4×4 to 8×8,16×16,and 32×32,which is more in line with the needs of the HEVC standard.Veri-fied with test sequence in HTM16.1 and the Xilinx virtex-6 field programmable gate array(FPGA)respectively,the experiment results show that the proposed methods save 99.2%of the storage space and 63.94%of the encoding time,the serial/parallel acceleration ratio of each template reaches 1.84 in average.The coding performance,storage and resource consumption are considered for both.
文摘With the development of satellite remote sensing technology, more and more requirements are put forward on the timeliness and stability of the satellite weather service system. The FY satellite rainfall estimate day knock off product algorithm runs longer, about 20 minutes, which affects the estimated rainfall product generated timeliness. Research and development of parallel optimization algorithms based on the needs of satellite meteorological services and their effectiveness in practical applications are necessary ways to enhance the high-performance and high-availability capabilities of satellite meteorological services. So aiming at this problem, we started the parallel algorithm research based on the analysis of precipitation estimation algorithm. Firstly, we explained the steps of precipitation estimated date knock off product algorithm;secondly, we analyzed the four main calculation module calculating the amount of algorithms;thirdly, multithreaded parallel algorithm and MPI parallelization was designed. Finally, the multithreaded parallel and MPI parallelization were realized. Experimental results show that the multithreaded parallel and MPI parallelization algorithm could greatly improve the overall degree of computational efficiency. And, MPI parallelization mode has a higher operating efficiency. The performance of parallel processing is closely related to the architecture of the computer. From the perspective of service scheduling and product algorithms, the MPI parallelization approach is adopted to achieve the purpose of improving service quality.
基金Project (50371026) supported by the National Natural Science Foundation of China
文摘An OpenMP approach was proposed to parallelize the sequential molecular dynamics(MD) code on shared memory machines. When a code is converted from the sequential form to the parallel form, data dependence is a main problem. A traditional sequential molecular dynamics code is anatomized to find the data dependence segments in it, and the two different methods, i.e., recover method and backward mapping method were used to eliminate those data dependencies in order to realize the parallelization of this sequential MD code. The performance of the parallelized MD code was analyzed by using some performance analysis tools. The results of the test show that the computing size of this code increases sharply form 1 million atoms before parallelization to 20 million atoms after parallelization, and the wall clock during computing is reduced largely. Some hot-spots in this code are found and optimized by improved algorithm. The efficiency of parallel computing is 30% higher than that of before, and the calculation time is saved and larger scale calculation problems are solved.
文摘Scale Invariant Feature Transform (SIFT) algorithm is a widely used computer vision algorithm that detects and extracts local feature descriptors from images. SIFT is computationally intensive, making it infeasible for single threaded im-plementation to extract local feature descriptors for high-resolution images in real time. In this paper, an approach to parallelization of the SIFT algorithm is demonstrated using NVIDIA’s Graphics Processing Unit (GPU). The parallel-ization design for SIFT on GPUs is divided into two stages, a) Algorithm de-sign-generic design strategies which focuses on data and b) Implementation de-sign-architecture specific design strategies which focuses on optimally using GPU resources for maximum occupancy. Increasing memory latency hiding, eliminating branches and data blocking achieve a significant decrease in aver-age computational time. Furthermore, it is observed via Paraver tools that our approach to parallelization while optimizing for maximum occupancy allows GPU to execute memory bound SIFT algorithm at optimal levels.
文摘The general m-machine permutation flowshop problem with the total flow-time objective is known to be NP-hard for m ≥ 2. The only practical method for finding optimal solutions has been branch-and-bound algorithms. In this paper, we present an improved sequential algorithm which is based on a strict alternation of Generation and Exploration execution modes as well as Depth-First/Best-First hybrid strategies. The experimental results show that the proposed scheme exhibits improved performance compared with the algorithm in [1]. More importantly, our method can be easily extended and implemented with lightweight threads to speed up the execution times. Good speedups can be obtained on shared-memory multicore systems.
基金supported by the National Natural Science Foundation of China(Nos.11972267,11802214 and 51932006)the Fundamental Research Funds for the Central Universities(WUT:2020lll031GX).
文摘A rate-dependent peridynamic ceramic model,considering the brittle tensile response,compressive plastic softening and strain-rate dependence,can accurately represent the dynamic response and crack propagation of ceramic materials.However,it also considers the strain-rate dependence and damage accumulation caused by compressive plastic softening during the compression stage,requiring more computational resources for the bond force evaluation and damage evolution.Herein,the OpenMP parallel optimization of the rate-dependent peridynamic ceramicmodel is investigated.Also,themodules that compute the interactions betweenmaterial points and update damage index are vectorized and parallelized.Moreover,the numerical examples are carried out to simulate the dynamic response and fracture of the ceramic plate under normal impact.Furthermore,the speed-up ratio and computational efficiency by multi-threads are evaluated and discussed to demonstrate the reliability of parallelized programs.The results reveal that the totalwall clock time has been significantly reduced after optimization,showing the promise of parallelization process in terms of accuracy and stability.
文摘In this paper, we present parallel programming approaches to calculate the values of the cells in matrix’s scoring used in the Smith-Waterman’s algorithm for sequence alignment. This algorithm, well known in bioinformatics for its applications, is unfortunately time-consuming on a serial computer. We use formulation based on anti-diagonals structure of data. This representation focuses on parallelizable parts of the algorithm without changing the initial formulation of the algorithm. Approaching data in that way give us a formulation more flexible. To examine this approach, we encode it in OpenMP and Cuda C. The performance obtained shows the interest of our paper.
文摘The parallelization of the diagnostics for climate research has been an important goal in the performance testing and improvement of the diagnostics for the Department of Energy’s (DOE’s) Accelerated Climate Modeling for Energy (ACME) project [1]. The primary mission of the ACME project is to build and test the next-generation Earth system model for current and future generations of computing systems operated by the DOE office of science computing facilities, including the envisioned exascale systems foreseen in the early part of the next decade. As part of the underpinning workflow environment, a diagnostics, model metrics, and intercomparison Python framework, called UVC Metrics was created to aid in testing and production execution of the model. This framework builds on common methods and similar metrics to accommodate and diagnose individual component models, such as atmosphere, land, ocean, sea ice, and land ice. This paper reports on initial parallelization of UVC Metrics for the atmosphere model component using two popular frameworks: MPI and SPARK. A timing study is presented to assess the performance of each method in which significant improvement was achieved for both frameworks despite I/O contentions with NFS. The advantages and disadvantages of each framework are also presented.
基金the Designated Funding for Winners of President’s Awards of Chinese Academy of Sciences(CAS,2006)financial supports from the National Natural Science Foundation of China(NSFC)under the Grant No.20221603 and 20706057
文摘Pseudo-Particle Modeling (PPM) is a particle method proposed by Ge and Li in 1996 [Ge, W., & Li, J. (1996). Pseudo-particle approach to hydrodynamics of particle-fluid systems, in M. Kwauk & J. Li (Eds.), Proceedings of the 5th international conference on drculating fluidized bed (pp. 260-265). Beijing: Science Press] and has been used to explore the microscopic mechanism in complex particle-fluid systems. But as a particle method, high computational cost remains a main obstacle for its large-scale application; therefore, parallel implementation of this method is highly desirable. Parallelization of two-dimensional PPM was carried out by spatial decomposition in this paper. The time costs of the major functions in the program were analyzed and the program was then optimized for higher efficiency by dynamic load balancing and resetting of particle arrays. Finally, simulation on a gas-solid fluidized bed with 102,400 solid particles and 1.8 × 10^7 pseudo-particles was performed successfully with this code, indicating its scalability in future applications.
基金supported by the Major International Joint Research Project from the National Nature Science Foundation of China (No. 51210014)Major Program of National Natural Science Foundation of China (No. 91547201)
文摘Power flow transfer(PFT) analysis under various anticipated faults in advance is important for securing power system operations. In China, PSD-BPA software is the most widely used tool for power system analysis, but its input/output interface is easily adapted for PFT analysis,which is also difficult due to its computationally intensity.To solve this issue, and achieve a fast and accurate PFT analysis, a modular parallelization framework is developed in this paper. Two major contributions are included. One is several integrated PFT analysis modules, including parameter initialization, fault setting, network integrity detection, reasonableness identification and result analysis.The other is a parallelization technique for enhancing computation efficiency using a Fork/Join framework. The proposed framework has been tested and validated by the IEEE 39 bus reference power system. Furthermore, it has been applied to a practical power network with 11052 buses and 12487 branches in the Yunnan Power Grid ofChina, providing decision support for large-scale power system analysis.
基金Supported partially by the National Natural Science Foundation of China (Grant No.60503063)the National High-Tech Research & Development Program of China (Grant No.2006AA01Z321)the National Basic Research Program of China (Grant No.2006CB303103)
文摘Multi-view video coding (MVC) comprises rich 3D information and is widely used in new visual media, such as 3DTV and free viewpoint TV (FTV). However, even with mainstream computer manufacturers migrating to multi-core processors, the huge computational requirement of MVC currently prohibits its wide use in consumer markets. In this paper, we demonstrate the design and implementation of the first parallel MVC system on Cell Broadband Engine^TM processor which is a state-of-the-art multi-core processor. We propose a task-dispatching algorithm which is adaptive data-driven on the frame level for MVC, and implement a parallel multi-view video decoder with modified H.264/AVC codec on real machine. This approach provides scalable speedup (up to 16 times on sixteen cores) through proper local store management, utilization of code locality and SIMD improvement. Decoding speed, speedup and utilization rate of cores are expressed in experimental results.
文摘Due to the complex high-temperature characteristics of hydrocarbon fuel,the research on the long-term working process of parallel channel structure under variable working conditions,especially under high heat-mass ratio,has not been systematically carried out.In this paper,the heat transfer and flow characteristics of related high temperature fuels are studied by using typical engine parallel channel structure.Through numeri⁃cal simulation and systematic experimental verification,the flow and heat transfer characteristics of parallel chan⁃nels under typical working conditions are obtained,and the effectiveness of high-precision calculation method is preliminarily established.It is known that the stable time required for hot start of regenerative cooling engine is about 50 s,and the flow resistance of parallel channel structure first increases and then decreases with the in⁃crease of equivalence ratio(The following equivalence ratio is expressed byΦ),and there is a flow resistance peak in the range ofΦ=0.5~0.8.This is mainly caused by the coupling effect of high temperature physical proper⁃ties,flow rate and pressure of fuel in parallel channels.At the same time,the cooling and heat transfer character⁃istics of parallel channels under some conditions of high heat-mass ratio are obtained,and the main factors affect⁃ing the heat transfer of parallel channels such as improving surface roughness and strengthening heat transfer are mastered.In the experiment,whenΦis less than 0.9,the phenomenon of local heat transfer enhancement and deterioration can be obviously observed,and the temperature rise of local structures exceeds 200℃,which is the risk of structural damage.Therefore,the reliability of long-term parallel channel structure under the condition of high heat-mass ratio should be fully considered in structural design.
基金supported by the National Natural Science Foundation of China(61834005,61772417,61802304,61874087,61602377,61634004,61272120)the Shaanxi Province Coordination Innovation Project of Science and Technology(2016KTZDGY02-04-02)+1 种基金the Shaanxi Provincial Key R&D Plan(2017GY-060)Shaanxi International Science and Technology Cooperation Program(2018KW-006)。
文摘The new encoding tools of high efficiency video coding(HEVC) make the interpolation operation more complex in motion compensation(MC) for better video compression, but impose higher requirements on the computational efficiency and control logic of the hardware architecture. The reconfigurable array processor can take into consideration both the computational efficiency and flexible switching of algorithms very well. Through mining the data dependency and parallelism among interpolation operation, this paper presents a parallelization method based on the dynamic reconfigurable array processor proposed by the project team. The number of pixels loaded from the external memory is reduced significantly, by multiplexing the common data in the previous reference block and the current reference block. Flexible switching of variable block operation is realized by using dynamic reconfiguration mechanism. A 16×16 processor element(PE)’s array is used to dynamically process a 4×4-64×64 block size. The experimental results show that, the reference block update speed is increased by 39.9%. In the case of an array size of 16 PEs, the number of pixels processed in parallel reaches 16.
文摘Purpose–The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core processors such as graphics processing units(GPUs)and multi-core processors(MCPs).Design/methodology/approach–For distributed genetic algorithm(GA)models,the paper proposes a method where an island’s ID number is added to the header of data transferred by this island for use in fault detection.Findings–The paper has shown that the processing time of the proposed idea is practically negligible in applications and also shown that an optimal solution can be obtained even with a single stuck-at fault or a transient fault,and that increasing the number of parallel threads makes the system less susceptible to faults.Originality/value–The study described in this paper is a new approach to increase the sustainability of application program using distributed GA on GPUs and MCPs.