The wide acceptance and data deluge in medical imaging processing require faster and more efficient systems to be built.Due to the advances in heterogeneous architectures recently,there has been a resurgence in the fi...The wide acceptance and data deluge in medical imaging processing require faster and more efficient systems to be built.Due to the advances in heterogeneous architectures recently,there has been a resurgence in the first research aimed at FPGA-based as well as GPGPU-based accelerator design.This paper quantitatively analyzes the workload,computational intensity and memory performance of a single-particle 3D reconstruction application,called EMAN,and parallelizes it on CUDA GPGPU architectures and decouples the memory operations from the computing flow and orchestrates the thread-data mapping to reduce the overhead of off-chip memory operations.Then it exploits the trend towards FPGA-based accelerator design,which is achieved by offloading computingintensive kernels to dedicated hardware modules.Furthermore,a customized memory subsystem is also designed to facilitate the decoupling and optimization of computing dominated data access patterns.This paper evaluates the proposed accelerator design strategies by comparing it with a parallelized program on a 4-cores CPU.The CUDA version on a GTX480 shows a speedup of about 6 times.The performance of the stream architecture implemented on a Xilinx Virtex LX330 FPGA is justified by the reported speedup of 2.54 times.Meanwhile,measured in terms of power efficiency,the FPGA-based accelerator outperforms a 4-cores CPU and a GTX480 by 7.3 times and 3.4 times,respectively.展开更多
Numerical treatment of engineering application problems often eventually results in a solution of systems of linear or nonlinear equations.The solution process using digital computational devices usually takes tremend...Numerical treatment of engineering application problems often eventually results in a solution of systems of linear or nonlinear equations.The solution process using digital computational devices usually takes tremendous time due to the extremely large size encountered in most real-world engineering applications.So,practical solvers for systems of linear and nonlinear equations based on multi graphic process units(GPUs)are proposed in order to accelerate the solving process.In the linear and nonlinear solvers,the preconditioned bi-conjugate gradient stable(PBi-CGstab)method and the Inexact Newton method are used to achieve the fast and stable convergence behavior.Multi-GPUs are utilized to obtain more data storage that large size problems need.展开更多
The simulation is an important means of performance evaluation of the computer architecture. Nowadays, the serial simulation of general purpose graphics processing unit(GPGPU) architecture is the main bottleneck for t...The simulation is an important means of performance evaluation of the computer architecture. Nowadays, the serial simulation of general purpose graphics processing unit(GPGPU) architecture is the main bottleneck for the simulation speed. To address this issue, we propose the intra-kernel parallelization on a multicore processor and the inter-kernel parallelization on a multiple-machine platform. We apply these two methods to the GPGPU-sim simulator. The intra-kernel parallelization method firstly parallelizes the serial simulation of multiple compute units in one cycle. Then it parallelizes the timing and functional simulation to reduce the performance loss caused by the synchronization between different compute units. The inter-kernel parallelization method divides multiple kernels of a CUDA program into several groups and distributes these groups across multiple simulation hosts to perform the simulation. Experimental results show that the intra-kernel parallelization method achieves a speed-up of up to 12 with a maximum error rate of 0.009 4% on a 32-core machine, and the inter-kernel parallelization method can accelerate the simulation by a factor of up to 3.9 with a maximum error rate of 0.11% on four simulation hosts. The orthogonality between these two methods allows us to combine them together on multiple multi-core hosts to get further performance improvements.展开更多
General purpose graphic processing unit (GPU) calculation technology is gradually widely used in various fields. Its mode of single instruction, multiple threads is capable of seismic numerical simulation which has ...General purpose graphic processing unit (GPU) calculation technology is gradually widely used in various fields. Its mode of single instruction, multiple threads is capable of seismic numerical simulation which has a huge quantity of data and calculation steps. In this study, we introduce a GPU-based parallel calculation method of a precise integration method (PIM) for seismic forward modeling. Compared with CPU single-core calculation, GPU parallel calculating perfectly keeps the features of PIM, which has small bandwidth, high accuracy and capability of modeling complex substructures, and GPU calculation brings high computational efficiency, which means that high-performing GPU parallel calculation can make seismic forward modeling closer to real seismic records.展开更多
This paper describes a parallel fast convolution back-projection algorithm design for radar image reconstruction. State-of-the-art general purpose graphic processing units (GPGPU) were utilized to accelerate the pro...This paper describes a parallel fast convolution back-projection algorithm design for radar image reconstruction. State-of-the-art general purpose graphic processing units (GPGPU) were utilized to accelerate the processing. The implementation achieves much better performance than conventional processing systems, with a speedup of more than 890 times on NVIDIA Tesla C1060 supercomputing cards compared to an Intel P4 2.4 GHz CPU. 256×256 pixel images could be reconstructed within 6.3 s, which makes real-time imaging possible. Six platforms were tested and compared. The results show that the GPGPU super-computing system has great potential for radar image processing.展开更多
With the latest advances in computing technology, a huge amount of efforts have gone into simulation of a range of scientific phenomena in engineering fields. One such case is the simulation of heat and mass transfer ...With the latest advances in computing technology, a huge amount of efforts have gone into simulation of a range of scientific phenomena in engineering fields. One such case is the simulation of heat and mass transfer in capillary porous media, which is becoming more and more necessary in analyzing a number of eventualities in science and engineering applications. However, this procedure of numerical solution of heat and mass transfer equations for capillary porous media is very time consuming. Therefore, this paper pursuit is at making use of one of the acceleration methods developed in the graphics community that exploits a graphical processing unit (GPU), which is applied to the numerical solutions of such heat and mass transfer equations. The nVidia Compute Unified Device Architecture (CUDA) programming model offers a correct approach of applying parallel computing to applications with graphical processing unit. This paper suggests a true improvement in the performance while solving the heat and mass transfer equations for capillary porous radially composite cylinder with the first type of boundary conditions. This heat and mass transfer simulation is carried out through the usage of CUDA platform on nVidia Quadro FX 4800 graphics card. Our experimental outcomes exhibit the drastic overall performance enhancement when GPU is used to illustrate heat and mass transfer simulation. GPU can considerably accelerate the performance with a maximum found speedup of more than 5-fold times. Therefore, the GPU is a good strategy to accelerate the heat and mass transfer simulation in porous media.展开更多
With the recent developments in computing technology, increased efforts have gone into simulation of various scientific methods and phenomenon in engineering fields. One such case is the simulation of heat and mass tr...With the recent developments in computing technology, increased efforts have gone into simulation of various scientific methods and phenomenon in engineering fields. One such case is the simulation of heat and mass transfer in capillary porous media, which is becoming more and more important in analysing various scenarios in engineering applications. Analysing such heat and mass transfer phenomenon in a given environment requires us to simulate it. This entails simulation of coupled heat mass transfer equations. However, this process of numerical solution of heat and mass transfer equations is very much time consuming. Therefore, this paper aims at utilizing one of the acceleration techniques developed in the graphics community that exploits a graphics processing unit (GPU) which is applied to the numerical solutions of heat and mass transfer equations. The nVidia Compute Unified Device Architecture (CUDA) programming model caters a good method of applying parallel computing to program the graphical processing unit. This paper shows a good improvement in the performance while solving the heat and mass transfer equations for capillary porous composite cylinder with the second kind of boundary conditions numerically running on GPU. This heat and mass transfer simulation is implemented using CUDA platform on nVidia Quadro FX 4800 graphics card. Our experimental results depict the drastic performance improvement when GPU is used to perform heat and mass transfer simulation. GPU can significantly accelerate the performance with a maximum observed speedup of more than 7-fold times. Therefore, the GPU is a good approach to accelerate the heat and mass transfer simulation.展开更多
Inverse distance weighting (IDW) interpolation and viewshed are two popular algorithms for geospatial analysis.IDW interpolation assigns geographical values to unknown spatial points using values from a usually scatte...Inverse distance weighting (IDW) interpolation and viewshed are two popular algorithms for geospatial analysis.IDW interpolation assigns geographical values to unknown spatial points using values from a usually scattered set of known points,and viewshed identifies the cells in a spatial raster that can be seen by observers.Although the implementations of both algorithms are available for different scales of input data,the computation for a large-scale domain requires a mass amount of cycles,which limits their usage.Due to the growing popularity of the graphics processing unit (GPU) for general purpose applications,we aim to accelerate geospatial analysis via a GPU based parallel computing approach.In this paper,we propose a generic methodological framework for geospatial analysis based on GPU and its programming model Compute Unified Device Architecture (CUDA),and explore how to map the inherent parallelism degrees of IDW interpolation and viewshed to the framework,which gives rise to a high computational throughput.The CUDA-based implementations of IDW interpolation and viewshed indicate that the architecture of GPU is suitable for parallelizing the algorithms of geospatial analysis.Experimental results show that the CUDA-based implementations running on GPU can lead to dataset dependent speedups in the range of 13-33-fold for IDW interpolation and 28-925-fold for viewshed analysis.Their computation time can be reduced by an order of magnitude compared to classical sequential versions,without losing the accuracy of interpolation and visibility judgment.展开更多
A multi-scale hardware and software architecture implementing the EMMS (energy-minimization multi-scale) paradigm is proven to be effective in the simulation of a two-dimensional gas-solid suspension. General purpos...A multi-scale hardware and software architecture implementing the EMMS (energy-minimization multi-scale) paradigm is proven to be effective in the simulation of a two-dimensional gas-solid suspension. General purpose CPUs are employed for macro-scale control and optimization, and many integrated cores (MlCs) operating in multiple-instruction multiple-data mode are used for a molecular dynamics simulation of the solid particles at the meso-scale. Many cores operating in single-instruction multiple- data mode, such as general purpose graphics processing units (GPGPUs), are employed for direct numerical simulation of the fluid flow at the micro-scale using the lattice Boltzmann method. This architecture is also expected to be efficient for the multi-scale simulation of other comolex systems.展开更多
基金Supported by the National Basic Research Program of China(No.2012CB316502)the National High Technology Research and DevelopmentProgram of China(No.2009AA01A129)the National Natural Science Foundation of China(No.60921002)
文摘The wide acceptance and data deluge in medical imaging processing require faster and more efficient systems to be built.Due to the advances in heterogeneous architectures recently,there has been a resurgence in the first research aimed at FPGA-based as well as GPGPU-based accelerator design.This paper quantitatively analyzes the workload,computational intensity and memory performance of a single-particle 3D reconstruction application,called EMAN,and parallelizes it on CUDA GPGPU architectures and decouples the memory operations from the computing flow and orchestrates the thread-data mapping to reduce the overhead of off-chip memory operations.Then it exploits the trend towards FPGA-based accelerator design,which is achieved by offloading computingintensive kernels to dedicated hardware modules.Furthermore,a customized memory subsystem is also designed to facilitate the decoupling and optimization of computing dominated data access patterns.This paper evaluates the proposed accelerator design strategies by comparing it with a parallelized program on a 4-cores CPU.The CUDA version on a GTX480 shows a speedup of about 6 times.The performance of the stream architecture implemented on a Xilinx Virtex LX330 FPGA is justified by the reported speedup of 2.54 times.Meanwhile,measured in terms of power efficiency,the FPGA-based accelerator outperforms a 4-cores CPU and a GTX480 by 7.3 times and 3.4 times,respectively.
文摘Numerical treatment of engineering application problems often eventually results in a solution of systems of linear or nonlinear equations.The solution process using digital computational devices usually takes tremendous time due to the extremely large size encountered in most real-world engineering applications.So,practical solvers for systems of linear and nonlinear equations based on multi graphic process units(GPUs)are proposed in order to accelerate the solving process.In the linear and nonlinear solvers,the preconditioned bi-conjugate gradient stable(PBi-CGstab)method and the Inexact Newton method are used to achieve the fast and stable convergence behavior.Multi-GPUs are utilized to obtain more data storage that large size problems need.
基金the National Natural Science Foundation of China(Nos.61572508,61272144,61303065and 61202121)the National High Technology Research and Development Program(863)of China(No.2012AA010905)+2 种基金the Research Project of National University of Defense Technology(No.JC13-06-02)the Doctoral Fund of Ministry of Education of China(No.20134307120028)the Research Fund for the Doctoral Program of Higher Education of China(No.20114307120013)
文摘The simulation is an important means of performance evaluation of the computer architecture. Nowadays, the serial simulation of general purpose graphics processing unit(GPGPU) architecture is the main bottleneck for the simulation speed. To address this issue, we propose the intra-kernel parallelization on a multicore processor and the inter-kernel parallelization on a multiple-machine platform. We apply these two methods to the GPGPU-sim simulator. The intra-kernel parallelization method firstly parallelizes the serial simulation of multiple compute units in one cycle. Then it parallelizes the timing and functional simulation to reduce the performance loss caused by the synchronization between different compute units. The inter-kernel parallelization method divides multiple kernels of a CUDA program into several groups and distributes these groups across multiple simulation hosts to perform the simulation. Experimental results show that the intra-kernel parallelization method achieves a speed-up of up to 12 with a maximum error rate of 0.009 4% on a 32-core machine, and the inter-kernel parallelization method can accelerate the simulation by a factor of up to 3.9 with a maximum error rate of 0.11% on four simulation hosts. The orthogonality between these two methods allows us to combine them together on multiple multi-core hosts to get further performance improvements.
基金supported by the National Natural Science Foundation of China (Nos 40974066 and 40821062)National Basic Research Program of China (No 2007CB209602)
文摘General purpose graphic processing unit (GPU) calculation technology is gradually widely used in various fields. Its mode of single instruction, multiple threads is capable of seismic numerical simulation which has a huge quantity of data and calculation steps. In this study, we introduce a GPU-based parallel calculation method of a precise integration method (PIM) for seismic forward modeling. Compared with CPU single-core calculation, GPU parallel calculating perfectly keeps the features of PIM, which has small bandwidth, high accuracy and capability of modeling complex substructures, and GPU calculation brings high computational efficiency, which means that high-performing GPU parallel calculation can make seismic forward modeling closer to real seismic records.
文摘This paper describes a parallel fast convolution back-projection algorithm design for radar image reconstruction. State-of-the-art general purpose graphic processing units (GPGPU) were utilized to accelerate the processing. The implementation achieves much better performance than conventional processing systems, with a speedup of more than 890 times on NVIDIA Tesla C1060 supercomputing cards compared to an Intel P4 2.4 GHz CPU. 256×256 pixel images could be reconstructed within 6.3 s, which makes real-time imaging possible. Six platforms were tested and compared. The results show that the GPGPU super-computing system has great potential for radar image processing.
文摘With the latest advances in computing technology, a huge amount of efforts have gone into simulation of a range of scientific phenomena in engineering fields. One such case is the simulation of heat and mass transfer in capillary porous media, which is becoming more and more necessary in analyzing a number of eventualities in science and engineering applications. However, this procedure of numerical solution of heat and mass transfer equations for capillary porous media is very time consuming. Therefore, this paper pursuit is at making use of one of the acceleration methods developed in the graphics community that exploits a graphical processing unit (GPU), which is applied to the numerical solutions of such heat and mass transfer equations. The nVidia Compute Unified Device Architecture (CUDA) programming model offers a correct approach of applying parallel computing to applications with graphical processing unit. This paper suggests a true improvement in the performance while solving the heat and mass transfer equations for capillary porous radially composite cylinder with the first type of boundary conditions. This heat and mass transfer simulation is carried out through the usage of CUDA platform on nVidia Quadro FX 4800 graphics card. Our experimental outcomes exhibit the drastic overall performance enhancement when GPU is used to illustrate heat and mass transfer simulation. GPU can considerably accelerate the performance with a maximum found speedup of more than 5-fold times. Therefore, the GPU is a good strategy to accelerate the heat and mass transfer simulation in porous media.
文摘With the recent developments in computing technology, increased efforts have gone into simulation of various scientific methods and phenomenon in engineering fields. One such case is the simulation of heat and mass transfer in capillary porous media, which is becoming more and more important in analysing various scenarios in engineering applications. Analysing such heat and mass transfer phenomenon in a given environment requires us to simulate it. This entails simulation of coupled heat mass transfer equations. However, this process of numerical solution of heat and mass transfer equations is very much time consuming. Therefore, this paper aims at utilizing one of the acceleration techniques developed in the graphics community that exploits a graphics processing unit (GPU) which is applied to the numerical solutions of heat and mass transfer equations. The nVidia Compute Unified Device Architecture (CUDA) programming model caters a good method of applying parallel computing to program the graphical processing unit. This paper shows a good improvement in the performance while solving the heat and mass transfer equations for capillary porous composite cylinder with the second kind of boundary conditions numerically running on GPU. This heat and mass transfer simulation is implemented using CUDA platform on nVidia Quadro FX 4800 graphics card. Our experimental results depict the drastic performance improvement when GPU is used to perform heat and mass transfer simulation. GPU can significantly accelerate the performance with a maximum observed speedup of more than 7-fold times. Therefore, the GPU is a good approach to accelerate the heat and mass transfer simulation.
基金Project supported by the National Natural Science Foundation of China (No. 61002009)the Science and Technology Planning Project of Zhejiang Province (No. 2010C31018)the Scientific Research Fund of Hangzhou Normal University (No. HSKQ0042),China
文摘Inverse distance weighting (IDW) interpolation and viewshed are two popular algorithms for geospatial analysis.IDW interpolation assigns geographical values to unknown spatial points using values from a usually scattered set of known points,and viewshed identifies the cells in a spatial raster that can be seen by observers.Although the implementations of both algorithms are available for different scales of input data,the computation for a large-scale domain requires a mass amount of cycles,which limits their usage.Due to the growing popularity of the graphics processing unit (GPU) for general purpose applications,we aim to accelerate geospatial analysis via a GPU based parallel computing approach.In this paper,we propose a generic methodological framework for geospatial analysis based on GPU and its programming model Compute Unified Device Architecture (CUDA),and explore how to map the inherent parallelism degrees of IDW interpolation and viewshed to the framework,which gives rise to a high computational throughput.The CUDA-based implementations of IDW interpolation and viewshed indicate that the architecture of GPU is suitable for parallelizing the algorithms of geospatial analysis.Experimental results show that the CUDA-based implementations running on GPU can lead to dataset dependent speedups in the range of 13-33-fold for IDW interpolation and 28-925-fold for viewshed analysis.Their computation time can be reduced by an order of magnitude compared to classical sequential versions,without losing the accuracy of interpolation and visibility judgment.
基金supported by the National Science Foundation for Distinguished Young Scholars of China under Grant No.21225628the Science Fund for Creative Research Groups of the National Natural Science Foundation of China under Grant No.20821092+1 种基金the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No.XDA07080100the National Natural Science Foundation of China under Grant No. 21206167
文摘A multi-scale hardware and software architecture implementing the EMMS (energy-minimization multi-scale) paradigm is proven to be effective in the simulation of a two-dimensional gas-solid suspension. General purpose CPUs are employed for macro-scale control and optimization, and many integrated cores (MlCs) operating in multiple-instruction multiple-data mode are used for a molecular dynamics simulation of the solid particles at the meso-scale. Many cores operating in single-instruction multiple- data mode, such as general purpose graphics processing units (GPGPUs), are employed for direct numerical simulation of the fluid flow at the micro-scale using the lattice Boltzmann method. This architecture is also expected to be efficient for the multi-scale simulation of other comolex systems.