A multi-GPU system designed for high-speed,real-time signal processing of optical coherencetomography(OCT)is described herein.For the OCT data sampled in linear wave numbers,themaximum procesing rates reached 2.95 MHz...A multi-GPU system designed for high-speed,real-time signal processing of optical coherencetomography(OCT)is described herein.For the OCT data sampled in linear wave numbers,themaximum procesing rates reached 2.95 MHz for 1024-OCT and 1.96 MHz for 2048-OCT.Data sampled using linear wavelengths were re-sampled using a time-domain interpolation method and zero-padding interpolation method to improve image quality.The maximum processing rates for1024-OCT reached 2.16 MHz for the time-domain method and 1.26 MHz for the zero-paddingmethod.The maximum processing rates for 2048-0CT reached_1.58 MHz,and 0.68 MHz,respectively.This method is capable of high-speed,real-time processing for O CT systems.展开更多
In recent years, the widespread adoption of parallel computing, especially in multi-core processors and high-performance computing environments, ushered in a new era of efficiency and speed. This trend was particularl...In recent years, the widespread adoption of parallel computing, especially in multi-core processors and high-performance computing environments, ushered in a new era of efficiency and speed. This trend was particularly noteworthy in the field of image processing, which witnessed significant advancements. This parallel computing project explored the field of parallel image processing, with a focus on the grayscale conversion of colorful images. Our approach involved integrating OpenMP into our framework for parallelization to execute a critical image processing task: grayscale conversion. By using OpenMP, we strategically enhanced the overall performance of the conversion process by distributing the workload across multiple threads. The primary objectives of our project revolved around optimizing computation time and improving overall efficiency, particularly in the task of grayscale conversion of colorful images. Utilizing OpenMP for concurrent processing across multiple cores significantly reduced execution times through the effective distribution of tasks among these cores. The speedup values for various image sizes highlighted the efficacy of parallel processing, especially for large images. However, a detailed examination revealed a potential decline in parallelization efficiency with an increasing number of cores. This underscored the importance of a carefully optimized parallelization strategy, considering factors like load balancing and minimizing communication overhead. Despite challenges, the overall scalability and efficiency achieved with parallel image processing underscored OpenMP’s effectiveness in accelerating image manipulation tasks.展开更多
Lattice-Boltzmann methods are versatile numerical modeling techniques capable of reproducing a wide variety of fluid-mechanical behavior.These methods are well suited to parallel implementation,particularly on the sin...Lattice-Boltzmann methods are versatile numerical modeling techniques capable of reproducing a wide variety of fluid-mechanical behavior.These methods are well suited to parallel implementation,particularly on the single-instruction multiple data(SIMD)parallel processing environments found in computer graphics processing units(GPUs).Although recent programming tools dramatically improve the ease with which GPUbased applications can be written,the programming environment still lacks the flexibility available to more traditional CPU programs.In particular,it may be difficult to develop modular and extensible programs that require variable on-device functionality with current GPU architectures.This paper describes a process of automatic code generation that overcomes these difficulties for lattice-Boltzmann simulations.It details the development of GPU-based modules for an extensible lattice-Boltzmann simulation package-LBHydra.The performance of the automatically generated code is compared to equivalent purpose written codes for both single-phase,multiphase,and multicomponent flows.The flexibility of the new method is demonstrated by simulating a rising,dissolving droplet moving through a porous medium with user generated lattice-Boltzmann models and subroutines.展开更多
Systems containing multiple graphics-processing-unit(GPU)clusters are difficult to use for real-time electroholography when using only a single spatial light modulator because the transfer of the computer-generated ho...Systems containing multiple graphics-processing-unit(GPU)clusters are difficult to use for real-time electroholography when using only a single spatial light modulator because the transfer of the computer-generated hologram data between the GPUs is bottlenecked.To overcome this bottleneck,we propose a rapid GPU packing scheme that significantly reduces the volume of the required data transfer.The proposed method uses a multi-GPU cluster system connected with a cost-effective gigabit Ethernet network.In tests,we achieved real-time electroholography of a three-dimensional(3D)video presenting a point-cloud 3D object made up of approximately 200,000 points.展开更多
Graphics processing units(GPUs)employ the single instruction multiple data(SIMD)hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow.Threads running concurrently within a war...Graphics processing units(GPUs)employ the single instruction multiple data(SIMD)hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow.Threads running concurrently within a warp may jump to different paths after conditional branches.Such divergent control flow makes some lanes idle and hence reduces the SIMD utilization of GPUs.To alleviate the waste of SIMD lanes,threads from multiple warps can be collected together to improve the SIMD lane utilization by compacting threads into idle lanes.However,this mechanism induces extra barrier synchronizations since warps have to be stalled to wait for other warps for compactions,resulting in that no warps are scheduled in some cases.In this paper,we propose an approach to reduce the overhead of barrier synchronizat ions induced by compactions,In our approach,a compaction is bypassed by warps whose threads all jump to the same path after branches.Moreover,warps waiting for a compaction can also bypass this compaction when no warps are ready for issuing.In addition,a compaction is canceled if idle lanes can not be reduced via this compaction.The experimental results demonstrate that our approach provides an average improvement of 21%over the baseline GPU for applications with massive divergent branches,while recovering the performance loss induced by compactions by 13%on average for applications with many non-divergent control flows.展开更多
Porous materials present significant advantages for absorbing radioactive isotopes in nuclear waste streams.To improve absorption efficiency in nuclear waste treatment,a thorough understanding of the diffusion-advecti...Porous materials present significant advantages for absorbing radioactive isotopes in nuclear waste streams.To improve absorption efficiency in nuclear waste treatment,a thorough understanding of the diffusion-advection process within porous structures is essential for material design.In this study,we present advancements in the volumetric lattice Boltzmann method(VLBM)for modeling and simulating pore-scale diffusion-advection of radioactive isotopes within geopolymer porous structures.These structures are created using the phase field method(PFM)to precisely control pore architectures.In our VLBM approach,we introduce a concentration field of an isotope seamlessly coupled with the velocity field and solve it by the time evolution of its particle population function.To address the computational intensity inherent in the coupled lattice Boltzmann equations for velocity and concentration fields,we implement graphics processing unit(GPU)parallelization.Validation of the developed model involves examining the flow and diffusion fields in porous structures.Remarkably,good agreement is observed for both the velocity field from VLBM and multiphysics object-oriented simulation environment(MOOSE),and the concentration field from VLBM and the finite difference method(FDM).Furthermore,we investigate the effects of background flow,species diffusivity,and porosity on the diffusion-advection behavior by varying the background flow velocity,diffusion coefficient,and pore volume fraction,respectively.Notably,all three parameters exert an influence on the diffusion-advection process.Increased background flow and diffusivity markedly accelerate the process due to increased advection intensity and enhanced diffusion capability,respectively.Conversely,increasing the porosity has a less significant effect,causing a slight slowdown of the diffusion-advection process due to the expanded pore volume.This comprehensive parametric study provides valuable insights into the kinetics of isotope uptake in porous structures,facilitating the development of porous materials for nuclear waste treatment applications.展开更多
Aiming to solve the bottleneck problem of electromagnetic scattering simulation in the scenes of extremely large-scale seas and ships,a high-frequency method by using graphics processing unit(GPU)parallel acceleration...Aiming to solve the bottleneck problem of electromagnetic scattering simulation in the scenes of extremely large-scale seas and ships,a high-frequency method by using graphics processing unit(GPU)parallel acceleration technique is proposed.For the implementation of different electromagnetic methods of physical optics(PO),shooting and bouncing ray(SBR),and physical theory of diffraction(PTD),a parallel computing scheme based on the CPU-GPU parallel computing scheme is realized to balance computing tasks.Finally,a multi-GPU framework is further proposed to solve the computational difficulty caused by the massive number of ray tubes in the ray tracing process.By using the established simulation platform,signals of ships at different seas are simulated and their images are achieved as well.It is shown that the higher sea states degrade the averaged peak signal-to-noise ratio(PSNR)of radar image.展开更多
General purpose graphic processing unit (GPU) calculation technology is gradually widely used in various fields. Its mode of single instruction, multiple threads is capable of seismic numerical simulation which has ...General purpose graphic processing unit (GPU) calculation technology is gradually widely used in various fields. Its mode of single instruction, multiple threads is capable of seismic numerical simulation which has a huge quantity of data and calculation steps. In this study, we introduce a GPU-based parallel calculation method of a precise integration method (PIM) for seismic forward modeling. Compared with CPU single-core calculation, GPU parallel calculating perfectly keeps the features of PIM, which has small bandwidth, high accuracy and capability of modeling complex substructures, and GPU calculation brings high computational efficiency, which means that high-performing GPU parallel calculation can make seismic forward modeling closer to real seismic records.展开更多
The signal processing speed of spectral domain optical coherence tomography(SD-OCT)has become a bottleneck in a lot of medical applications.Recently,a time-domain interpolation method was proposed.This method can get ...The signal processing speed of spectral domain optical coherence tomography(SD-OCT)has become a bottleneck in a lot of medical applications.Recently,a time-domain interpolation method was proposed.This method can get better signal-to-noise ratio(SNR)but much-reduced signal processing time in SD-OCT data processing as compared with the commonly used zeropadding interpolation method.Additionally,the resampled data can be obtained by a few data and coefficients in the cutoff window.Thus,a lot of interpolations can be performed simultaneously.So,this interpolation method is suitable for parallel computing.By using graphics processing unit(GPU)and the compute unified device architecture(CUDA)program model,time-domain interpolation can be accelerated significantly.The computing capability can be achieved more than 250,000 A-lines,200,000 A-lines,and 160,000 A-lines in a second for 2,048 pixel OCT when the cutoff length is L=11,L=21,and L=31,respectively.A frame SD-OCT data(400A-lines×2,048 pixel per line)is acquired and processed on GPU in real time.The results show that signal processing time of SD-OCT can befinished in 6.223 ms when the cutoff length L=21,which is much faster than that on central processing unit(CPU).Real-time signal processing of acquired data can be realized.展开更多
Organic reefs, the targets of deep-water petro- leum exploration, developed widely in Xisha area. However, there are concealed igneous rocks undersea, to which organic rocks have nearly equal wave impedance. So the ig...Organic reefs, the targets of deep-water petro- leum exploration, developed widely in Xisha area. However, there are concealed igneous rocks undersea, to which organic rocks have nearly equal wave impedance. So the igneous rocks have become interference for future explo- ration by having similar seismic reflection characteristics. Yet, the density and magnetism of organic reefs are very different from igneous rocks. It has obvious advantages to identify organic reefs and igneous rocks by gravity and magnetic data. At first, frequency decomposition was applied to the free-air gravity anomaly in Xisha area to obtain the 2D subdivision of the gravity anomaly and magnetic anomaly in the vertical direction. Thus, the dis- tribution of igneous rocks in the horizontal direction can be acquired according to high-frequency field, low-frequency field, and its physical properties. Then, 3D forward model- ing of gravitational field was carried out to establish the density model of this area by reference to physical properties of rocks based on former researches. Furthermore, 3D inversion of gravity anomaly by genetic algorithm method of the graphic processing unit (GPU) parallel processing in Xisha target area was applied, and 3D density structure of this area was obtained. By this way, we can confine the igneous rocks to the certain depth according to the density of the igneous rocks. The frequency decomposition and 3D inversion of gravity anomaly by genetic algorithm method of the GPU parallel processing proved to be a useful method for recognizing igneous rocks to its 3D geological position. So organic reefs and igneous rocks can be identified, which provide a prescient information for further exploration.展开更多
Fluid-structure interaction (FSI) problems in microchannels play a prominent role in many engineering applications. The present study is an effort toward the simulation of flow in microchannel considering FSI. The b...Fluid-structure interaction (FSI) problems in microchannels play a prominent role in many engineering applications. The present study is an effort toward the simulation of flow in microchannel considering FSI. The bottom boundary of the microchannel is simulated by size-dependent beam elements for the finite element method (FEM) based on a modified cou- ple stress theory. The lattice Boltzmann method (LBM) using the D2Q13 LB model is coupled to the FEM in order to solve the fluid part of the FSI problem. Because of the fact that the LBM generally needs only nearest neighbor information, the algorithm is an ideal candidate for parallel computing. The simulations are carried out on graphics processing units (GPUs) using computed unified device architecture (CUDA). In the present study, the governing equations are non-dimensionalized and the set of dimensionless groups is exhibited to show their effects on micro-beam displacement. The numerical results show that the displacements of the micro-beam predicted by the size-dependent beam element are smaller than those by the classical beam element.展开更多
Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simul...Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simulate incompressible turbulent cavity flows with the Reynolds numbers up to 1 × 10^7. To improve the computation efficiency of LBM on the numerical simulations of turbulent flows, the massively parallel computing power from a graphic processing unit (GPU) with a computing unified device architecture (CUDA) is introduced into the MRT-LBE-LES model. The model performs well, compared with the results from others, with an increase of 76 times in computation efficiency. It appears that the higher the Reynolds numbers is, the smaller the Smagorinsky constant should be, if the lattice number is fixed. Also, for a selected high Reynolds number and a selected proper Smagorinsky constant, there is a minimum requirement for the lattice number so that the Smagorinsky eddy viscosity will not be excessively large.展开更多
Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/N...Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/Navier-Stokes solver is developed for 2-D compressible flows by using NVIDIA′s Compute Unified Device Architecture(CUDA)programming model in CUDA Fortran programming language.The techniques of implementation of CUDA kernels,double-layered thread hierarchy and variety memory hierarchy are presented to form the GPU-based algorithm of Euler/Navier-Stokes equations.The resulting parallel solver is validated by a set of typical test flow cases.The numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform,which demonstrates that a GPU desktop can serve as a costeffective parallel computing platform to accelerate computational fluid dynamics(CFD)simulations substantially.展开更多
A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary netw...A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary network adopted in the paper can overcome the disadvantage of traditional neural network with small inputs. The whole image is considered as the input of the neural network, so the maximal features can be kept for recognition. To speed up the recognition process of the neural network, a fast implementation of the partially connected neural network was conducted on NVIDIA Tesla C1060 using the NVIDIA compute unified device architecture (CUDA) framework. Image sets of eight biological species were obtained to test the GPU implementation and counterpart serial CPU implementation, and experiment results showed GPU implementation works effectively on both recognition rate and speed, and gained 343 speedup over its counterpart CPU implementation. Comparing to feature-based recognition method on the same recognition task, the method also achieved an acceptable correct rate of 84.6% when testing on eight biological species.展开更多
Graphic processing units (GPUs) have been widely recognized as cost-efficient co-processors with acceptable size, weight, and power consumption. However, adopting GPUs in real-time systems is still challenging, due ...Graphic processing units (GPUs) have been widely recognized as cost-efficient co-processors with acceptable size, weight, and power consumption. However, adopting GPUs in real-time systems is still challenging, due to the lack in framework for real-time analysis. In order to guarantee real-time requirements while maintaining system utilization ~in modern heterogeneous systems, such as multicore multi-GPU systems, a novel suspension-based k-exclusion real-time locking protocol and the associated suspension-aware schedulability analysis are proposed. The proposed protocol provides a synchronization framework that enables multiple GPUs to be efficiently integrated in multicore real-time systems. Comparative evaluations show that the proposed methods improve upon the existing work in terms of schedulability.展开更多
Three-dimensional(3D)image reconstruction involves the computations of an extensive amount of data that leads to tremendous processing time.Therefore,optimization is crucially needed to improve the performance and eff...Three-dimensional(3D)image reconstruction involves the computations of an extensive amount of data that leads to tremendous processing time.Therefore,optimization is crucially needed to improve the performance and efficiency.With the widespread use of graphics processing units(GPU),parallel computing is transforming this arduous reconstruction process for numerous imaging modalities,and photoacoustic computed tomography(PACT)is not an exception.Existing works have investigated GPU-based optimization on photoacoustic microscopy(PAM)and PACT reconstruction using compute unified device architecture(CUDA)on either C++or MATLAB only.However,our study is the first that uses cross-platform GPU computation.It maintains the simplicity of MATLAB,while improves the speed through CUDA/C++−based MATLAB converted functions called MEXCUDA.Compared to a purely MATLAB with GPU approach,our cross-platform method improves the speed five times.Because MATLAB is widely used in PAM and PACT,this study will open up new avenues for photoacoustic image reconstruction and relevant real-time imaging applications.展开更多
Recently,analog visual transmission has attracted considerable attention owing to its graceful performance degradation for various wireless channels.In this study,we propose a novel analog visual communications system...Recently,analog visual transmission has attracted considerable attention owing to its graceful performance degradation for various wireless channels.In this study,we propose a novel analog visual communications system,named DVCast,in which an image denoising algorithm is used to fully utilize spatial correlation;moreover,the variable block size Discrete Cosine Transform(DCT)is used to preserve more correlation information in an image.Obviously,there is a tradeoff between system performance and computing complexity.Therefore,to improve the real-time performance of the proposed system,implementation of Block Matching with 3D filtering(BM3D)and DCT by Graphics Processing Units(GPUs)is introduced.According to DCT block size,i.e.,88,1616,and 3232,the schemes DVCast8,DVCast16,and DVCast32,respectively,are designed and implemented.Simulations show that DVCast with larger block size achieves better gain and visual quality than reference schemes.Moreover,it requires less computing time.DVCast32 outperforms conventional digital schemes by approximately 3.51 dB and achieves a 1.12 dB gain over state-of-the-art reference schemes.Furthermore,the analysis shows that DVCast can reduce overhead by at least 75%.展开更多
A non-photorealistic rendering technique is a method to show various effects different from those of realistic image generation.Of the various techniques,flow-based image abstraction displays the shape and color featu...A non-photorealistic rendering technique is a method to show various effects different from those of realistic image generation.Of the various techniques,flow-based image abstraction displays the shape and color features well and performs a stylistic visual abstraction.But real-time rendering is impossible when CPU is used because it applies various filtering and iteration methods.In this paper,we present real-time processing methods of video abstraction using open open computing language(OpenCL),technique of general-purpose computing on graphics processing units(GPGPU).Through the acceleration of general-purpose computing(GPU),16 frame-per-second(FPS)or greater is shown to process video abstraction.展开更多
基金the support from the union project of Peking University third hospital&Chinese Academy of Sciences(Grant No.7490-04,Grant No.KJZD-EW-TZ-L03)the Sichuan Youth Science&Technology Foundation(Grant No.13QNJJ0034)+1 种基金the West Light Foundation of the Chinese Academy of Sciences,the National Major Scientific Equipment program(Grant No.2012YQ120080)the National Science Foundation of China(Grant No.6118082).
文摘A multi-GPU system designed for high-speed,real-time signal processing of optical coherencetomography(OCT)is described herein.For the OCT data sampled in linear wave numbers,themaximum procesing rates reached 2.95 MHz for 1024-OCT and 1.96 MHz for 2048-OCT.Data sampled using linear wavelengths were re-sampled using a time-domain interpolation method and zero-padding interpolation method to improve image quality.The maximum processing rates for1024-OCT reached 2.16 MHz for the time-domain method and 1.26 MHz for the zero-paddingmethod.The maximum processing rates for 2048-0CT reached_1.58 MHz,and 0.68 MHz,respectively.This method is capable of high-speed,real-time processing for O CT systems.
文摘In recent years, the widespread adoption of parallel computing, especially in multi-core processors and high-performance computing environments, ushered in a new era of efficiency and speed. This trend was particularly noteworthy in the field of image processing, which witnessed significant advancements. This parallel computing project explored the field of parallel image processing, with a focus on the grayscale conversion of colorful images. Our approach involved integrating OpenMP into our framework for parallelization to execute a critical image processing task: grayscale conversion. By using OpenMP, we strategically enhanced the overall performance of the conversion process by distributing the workload across multiple threads. The primary objectives of our project revolved around optimizing computation time and improving overall efficiency, particularly in the task of grayscale conversion of colorful images. Utilizing OpenMP for concurrent processing across multiple cores significantly reduced execution times through the effective distribution of tasks among these cores. The speedup values for various image sizes highlighted the efficacy of parallel processing, especially for large images. However, a detailed examination revealed a potential decline in parallelization efficiency with an increasing number of cores. This underscored the importance of a carefully optimized parallelization strategy, considering factors like load balancing and minimizing communication overhead. Despite challenges, the overall scalability and efficiency achieved with parallel image processing underscored OpenMP’s effectiveness in accelerating image manipulation tasks.
基金performed under the auspices of the U.S.Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344support by the National Science Foundation(NSF)under Grant No.DMS-0724560,Grant No.EAR-0838541,and Grant No.EAR-0941666the Department of Energy(DOE)under Grant No.DE-EE0002764.
文摘Lattice-Boltzmann methods are versatile numerical modeling techniques capable of reproducing a wide variety of fluid-mechanical behavior.These methods are well suited to parallel implementation,particularly on the single-instruction multiple data(SIMD)parallel processing environments found in computer graphics processing units(GPUs).Although recent programming tools dramatically improve the ease with which GPUbased applications can be written,the programming environment still lacks the flexibility available to more traditional CPU programs.In particular,it may be difficult to develop modular and extensible programs that require variable on-device functionality with current GPU architectures.This paper describes a process of automatic code generation that overcomes these difficulties for lattice-Boltzmann simulations.It details the development of GPU-based modules for an extensible lattice-Boltzmann simulation package-LBHydra.The performance of the automatically generated code is compared to equivalent purpose written codes for both single-phase,multiphase,and multicomponent flows.The flexibility of the new method is demonstrated by simulating a rising,dissolving droplet moving through a porous medium with user generated lattice-Boltzmann models and subroutines.
基金supported by the Japan Society for the Promotion of Science(JSPS)KAKENHI(Nos.18K11399 and 19H01097)the Telecommunications Advancement Foundation.
文摘Systems containing multiple graphics-processing-unit(GPU)clusters are difficult to use for real-time electroholography when using only a single spatial light modulator because the transfer of the computer-generated hologram data between the GPUs is bottlenecked.To overcome this bottleneck,we propose a rapid GPU packing scheme that significantly reduces the volume of the required data transfer.The proposed method uses a multi-GPU cluster system connected with a cost-effective gigabit Ethernet network.In tests,we achieved real-time electroholography of a three-dimensional(3D)video presenting a point-cloud 3D object made up of approximately 200,000 points.
基金the National Natural Science Foundation of China(No.61702521)the Natural Science Foundation of Tianjin(No.18JCQNJC00400)+1 种基金the Scientific Research Foundation of Civil Aviation University of China(No.2017QD12S)the Fundamental Research Funds for the Central Universities of Civil Aviation University of China(Nos.3122018C023 and 3122018C021)。
文摘Graphics processing units(GPUs)employ the single instruction multiple data(SIMD)hardware to run threads in parallel and allow each thread to maintain an arbitrary control flow.Threads running concurrently within a warp may jump to different paths after conditional branches.Such divergent control flow makes some lanes idle and hence reduces the SIMD utilization of GPUs.To alleviate the waste of SIMD lanes,threads from multiple warps can be collected together to improve the SIMD lane utilization by compacting threads into idle lanes.However,this mechanism induces extra barrier synchronizations since warps have to be stalled to wait for other warps for compactions,resulting in that no warps are scheduled in some cases.In this paper,we propose an approach to reduce the overhead of barrier synchronizat ions induced by compactions,In our approach,a compaction is bypassed by warps whose threads all jump to the same path after branches.Moreover,warps waiting for a compaction can also bypass this compaction when no warps are ready for issuing.In addition,a compaction is canceled if idle lanes can not be reduced via this compaction.The experimental results demonstrate that our approach provides an average improvement of 21%over the baseline GPU for applications with massive divergent branches,while recovering the performance loss induced by compactions by 13%on average for applications with many non-divergent control flows.
基金supported as part of the Center for Hierarchical Waste Form Materials,an Energy Frontier Research Center funded by the U.S.Department of Energy,Office of Science,Basic Energy Sciences under Award No.DE-SC0016574.
文摘Porous materials present significant advantages for absorbing radioactive isotopes in nuclear waste streams.To improve absorption efficiency in nuclear waste treatment,a thorough understanding of the diffusion-advection process within porous structures is essential for material design.In this study,we present advancements in the volumetric lattice Boltzmann method(VLBM)for modeling and simulating pore-scale diffusion-advection of radioactive isotopes within geopolymer porous structures.These structures are created using the phase field method(PFM)to precisely control pore architectures.In our VLBM approach,we introduce a concentration field of an isotope seamlessly coupled with the velocity field and solve it by the time evolution of its particle population function.To address the computational intensity inherent in the coupled lattice Boltzmann equations for velocity and concentration fields,we implement graphics processing unit(GPU)parallelization.Validation of the developed model involves examining the flow and diffusion fields in porous structures.Remarkably,good agreement is observed for both the velocity field from VLBM and multiphysics object-oriented simulation environment(MOOSE),and the concentration field from VLBM and the finite difference method(FDM).Furthermore,we investigate the effects of background flow,species diffusivity,and porosity on the diffusion-advection behavior by varying the background flow velocity,diffusion coefficient,and pore volume fraction,respectively.Notably,all three parameters exert an influence on the diffusion-advection process.Increased background flow and diffusivity markedly accelerate the process due to increased advection intensity and enhanced diffusion capability,respectively.Conversely,increasing the porosity has a less significant effect,causing a slight slowdown of the diffusion-advection process due to the expanded pore volume.This comprehensive parametric study provides valuable insights into the kinetics of isotope uptake in porous structures,facilitating the development of porous materials for nuclear waste treatment applications.
基金supported by the Opening Foundation of the Agile and Intelligence Computing Key Laboratory of Sichuan Province under Grant No.H23004the Chengdu Municipal Science and Technology Bureau Technological Innovation R&D Project(Key Project)under Grant No.2024-YF08-00106-GX.
文摘Aiming to solve the bottleneck problem of electromagnetic scattering simulation in the scenes of extremely large-scale seas and ships,a high-frequency method by using graphics processing unit(GPU)parallel acceleration technique is proposed.For the implementation of different electromagnetic methods of physical optics(PO),shooting and bouncing ray(SBR),and physical theory of diffraction(PTD),a parallel computing scheme based on the CPU-GPU parallel computing scheme is realized to balance computing tasks.Finally,a multi-GPU framework is further proposed to solve the computational difficulty caused by the massive number of ray tubes in the ray tracing process.By using the established simulation platform,signals of ships at different seas are simulated and their images are achieved as well.It is shown that the higher sea states degrade the averaged peak signal-to-noise ratio(PSNR)of radar image.
基金supported by the National Natural Science Foundation of China (Nos 40974066 and 40821062)National Basic Research Program of China (No 2007CB209602)
文摘General purpose graphic processing unit (GPU) calculation technology is gradually widely used in various fields. Its mode of single instruction, multiple threads is capable of seismic numerical simulation which has a huge quantity of data and calculation steps. In this study, we introduce a GPU-based parallel calculation method of a precise integration method (PIM) for seismic forward modeling. Compared with CPU single-core calculation, GPU parallel calculating perfectly keeps the features of PIM, which has small bandwidth, high accuracy and capability of modeling complex substructures, and GPU calculation brings high computational efficiency, which means that high-performing GPU parallel calculation can make seismic forward modeling closer to real seismic records.
基金supported by National High Technology R&D project of China(2008AA02Z422)The Instrument Developing Project of The Chinese Academy of Sciences,Institute of Optics and Electronic,Chinese Academy of Sciences.
文摘The signal processing speed of spectral domain optical coherence tomography(SD-OCT)has become a bottleneck in a lot of medical applications.Recently,a time-domain interpolation method was proposed.This method can get better signal-to-noise ratio(SNR)but much-reduced signal processing time in SD-OCT data processing as compared with the commonly used zeropadding interpolation method.Additionally,the resampled data can be obtained by a few data and coefficients in the cutoff window.Thus,a lot of interpolations can be performed simultaneously.So,this interpolation method is suitable for parallel computing.By using graphics processing unit(GPU)and the compute unified device architecture(CUDA)program model,time-domain interpolation can be accelerated significantly.The computing capability can be achieved more than 250,000 A-lines,200,000 A-lines,and 160,000 A-lines in a second for 2,048 pixel OCT when the cutoff length is L=11,L=21,and L=31,respectively.A frame SD-OCT data(400A-lines×2,048 pixel per line)is acquired and processed on GPU in real time.The results show that signal processing time of SD-OCT can befinished in 6.223 ms when the cutoff length L=21,which is much faster than that on central processing unit(CPU).Real-time signal processing of acquired data can be realized.
基金financially supported by the National Natural Science Foundation of China (No.41174085)
文摘Organic reefs, the targets of deep-water petro- leum exploration, developed widely in Xisha area. However, there are concealed igneous rocks undersea, to which organic rocks have nearly equal wave impedance. So the igneous rocks have become interference for future explo- ration by having similar seismic reflection characteristics. Yet, the density and magnetism of organic reefs are very different from igneous rocks. It has obvious advantages to identify organic reefs and igneous rocks by gravity and magnetic data. At first, frequency decomposition was applied to the free-air gravity anomaly in Xisha area to obtain the 2D subdivision of the gravity anomaly and magnetic anomaly in the vertical direction. Thus, the dis- tribution of igneous rocks in the horizontal direction can be acquired according to high-frequency field, low-frequency field, and its physical properties. Then, 3D forward model- ing of gravitational field was carried out to establish the density model of this area by reference to physical properties of rocks based on former researches. Furthermore, 3D inversion of gravity anomaly by genetic algorithm method of the graphic processing unit (GPU) parallel processing in Xisha target area was applied, and 3D density structure of this area was obtained. By this way, we can confine the igneous rocks to the certain depth according to the density of the igneous rocks. The frequency decomposition and 3D inversion of gravity anomaly by genetic algorithm method of the GPU parallel processing proved to be a useful method for recognizing igneous rocks to its 3D geological position. So organic reefs and igneous rocks can be identified, which provide a prescient information for further exploration.
文摘Fluid-structure interaction (FSI) problems in microchannels play a prominent role in many engineering applications. The present study is an effort toward the simulation of flow in microchannel considering FSI. The bottom boundary of the microchannel is simulated by size-dependent beam elements for the finite element method (FEM) based on a modified cou- ple stress theory. The lattice Boltzmann method (LBM) using the D2Q13 LB model is coupled to the FEM in order to solve the fluid part of the FSI problem. Because of the fact that the LBM generally needs only nearest neighbor information, the algorithm is an ideal candidate for parallel computing. The simulations are carried out on graphics processing units (GPUs) using computed unified device architecture (CUDA). In the present study, the governing equations are non-dimensionalized and the set of dimensionless groups is exhibited to show their effects on micro-beam displacement. The numerical results show that the displacements of the micro-beam predicted by the size-dependent beam element are smaller than those by the classical beam element.
基金supported by College of William and Mary,Virginia Institute of Marine Science for the study environment
文摘Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simulate incompressible turbulent cavity flows with the Reynolds numbers up to 1 × 10^7. To improve the computation efficiency of LBM on the numerical simulations of turbulent flows, the massively parallel computing power from a graphic processing unit (GPU) with a computing unified device architecture (CUDA) is introduced into the MRT-LBE-LES model. The model performs well, compared with the results from others, with an increase of 76 times in computation efficiency. It appears that the higher the Reynolds numbers is, the smaller the Smagorinsky constant should be, if the lattice number is fixed. Also, for a selected high Reynolds number and a selected proper Smagorinsky constant, there is a minimum requirement for the lattice number so that the Smagorinsky eddy viscosity will not be excessively large.
基金supported by the National Natural Science Foundation of China (No.11172134)the Funding of Jiangsu Innovation Program for Graduate Education (No.CXLX13_132)
文摘Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/Navier-Stokes solver is developed for 2-D compressible flows by using NVIDIA′s Compute Unified Device Architecture(CUDA)programming model in CUDA Fortran programming language.The techniques of implementation of CUDA kernels,double-layered thread hierarchy and variety memory hierarchy are presented to form the GPU-based algorithm of Euler/Navier-Stokes equations.The resulting parallel solver is validated by a set of typical test flow cases.The numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform,which demonstrates that a GPU desktop can serve as a costeffective parallel computing platform to accelerate computational fluid dynamics(CFD)simulations substantially.
基金National Natural Science Foundation of China (No. 60975084)Natural Science Foundation of Fujian Province,China (No.2011J05159)
文摘A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary network adopted in the paper can overcome the disadvantage of traditional neural network with small inputs. The whole image is considered as the input of the neural network, so the maximal features can be kept for recognition. To speed up the recognition process of the neural network, a fast implementation of the partially connected neural network was conducted on NVIDIA Tesla C1060 using the NVIDIA compute unified device architecture (CUDA) framework. Image sets of eight biological species were obtained to test the GPU implementation and counterpart serial CPU implementation, and experiment results showed GPU implementation works effectively on both recognition rate and speed, and gained 343 speedup over its counterpart CPU implementation. Comparing to feature-based recognition method on the same recognition task, the method also achieved an acceptable correct rate of 84.6% when testing on eight biological species.
基金supported by the National Natural Science Foundation of China under Grant No.61003032/F020207
文摘Graphic processing units (GPUs) have been widely recognized as cost-efficient co-processors with acceptable size, weight, and power consumption. However, adopting GPUs in real-time systems is still challenging, due to the lack in framework for real-time analysis. In order to guarantee real-time requirements while maintaining system utilization ~in modern heterogeneous systems, such as multicore multi-GPU systems, a novel suspension-based k-exclusion real-time locking protocol and the associated suspension-aware schedulability analysis are proposed. The proposed protocol provides a synchronization framework that enables multiple GPUs to be efficiently integrated in multicore real-time systems. Comparative evaluations show that the proposed methods improve upon the existing work in terms of schedulability.
基金supported in part by the Career Catalyst Research Grant from the Susan G.Komen Foundationthe Clinical and Translational Science Pilot Study Award from the National Institutes of Health.
文摘Three-dimensional(3D)image reconstruction involves the computations of an extensive amount of data that leads to tremendous processing time.Therefore,optimization is crucially needed to improve the performance and efficiency.With the widespread use of graphics processing units(GPU),parallel computing is transforming this arduous reconstruction process for numerous imaging modalities,and photoacoustic computed tomography(PACT)is not an exception.Existing works have investigated GPU-based optimization on photoacoustic microscopy(PAM)and PACT reconstruction using compute unified device architecture(CUDA)on either C++or MATLAB only.However,our study is the first that uses cross-platform GPU computation.It maintains the simplicity of MATLAB,while improves the speed through CUDA/C++−based MATLAB converted functions called MEXCUDA.Compared to a purely MATLAB with GPU approach,our cross-platform method improves the speed five times.Because MATLAB is widely used in PAM and PACT,this study will open up new avenues for photoacoustic image reconstruction and relevant real-time imaging applications.
基金the National Nature Science Foundation of China(Nos.61601128,61762053)the Science and Technology Plan Funding of Jiangxi Province of China(No.20151BBE50076)+1 种基金the Research Foundations of Education Bureau of Jiangxi Province(Nos.GJJ151001,GJJ150984)the Open Project Funding of Key Laboratory of Jiangxi Province for Numerical Simulation and Emulation Techniques,China.
文摘Recently,analog visual transmission has attracted considerable attention owing to its graceful performance degradation for various wireless channels.In this study,we propose a novel analog visual communications system,named DVCast,in which an image denoising algorithm is used to fully utilize spatial correlation;moreover,the variable block size Discrete Cosine Transform(DCT)is used to preserve more correlation information in an image.Obviously,there is a tradeoff between system performance and computing complexity.Therefore,to improve the real-time performance of the proposed system,implementation of Block Matching with 3D filtering(BM3D)and DCT by Graphics Processing Units(GPUs)is introduced.According to DCT block size,i.e.,88,1616,and 3232,the schemes DVCast8,DVCast16,and DVCast32,respectively,are designed and implemented.Simulations show that DVCast with larger block size achieves better gain and visual quality than reference schemes.Moreover,it requires less computing time.DVCast32 outperforms conventional digital schemes by approximately 3.51 dB and achieves a 1.12 dB gain over state-of-the-art reference schemes.Furthermore,the analysis shows that DVCast can reduce overhead by at least 75%.
文摘A non-photorealistic rendering technique is a method to show various effects different from those of realistic image generation.Of the various techniques,flow-based image abstraction displays the shape and color features well and performs a stylistic visual abstraction.But real-time rendering is impossible when CPU is used because it applies various filtering and iteration methods.In this paper,we present real-time processing methods of video abstraction using open open computing language(OpenCL),technique of general-purpose computing on graphics processing units(GPGPU).Through the acceleration of general-purpose computing(GPU),16 frame-per-second(FPS)or greater is shown to process video abstraction.