Storm surge is often the marine disaster that poses the greatest threat to life and property in coastal areas.Accurate and timely issuance of storm surge warnings to take appropriate countermeasures is an important me...Storm surge is often the marine disaster that poses the greatest threat to life and property in coastal areas.Accurate and timely issuance of storm surge warnings to take appropriate countermeasures is an important means to reduce storm surge-related losses.Storm surge numerical models are important for storm surge forecasting.To further improve the performance of the storm surge forecast models,we developed a numerical storm surge forecast model based on an unstructured spherical centroidal Voronoi tessellation(SCVT)grid.The model is based on shallow water equations in vector-invariant form,and is discretized by Arakawa C grid.The SCVT grid can not only better describe the coastline information but also avoid rigid transitions,and it has a better global consistency by generating high-resolution grids in the key areas through transition refinement.In addition,the simulation speed of the model is accelerated by using the openACC-based GPU acceleration technology to meet the timeliness requirements of operational ensemble forecast.It only takes 37 s to simulate a day in the coastal waters of China.The newly developed storm surge model was applied to simulate typhoon-induced storm surges in the coastal waters of China.The hindcast experiments on the selected representative typhoon-induced storm surge processes indicate that the model can reasonably simulate the distribution characteristics of storm surges.The simulated maximum storm surges and their occurrence times are consistent with the observed data at the representative tide gauge stations,and the mean absolute errors are 3.5 cm and 0.6 h respectively,showing high accuracy and application prospects.展开更多
Optical coherence tomography(OCT)imaging technology has significant advantages in in situ and noninvasive monitoring of biological tissues.However,it still faces the following challenges:including data processing spee...Optical coherence tomography(OCT)imaging technology has significant advantages in in situ and noninvasive monitoring of biological tissues.However,it still faces the following challenges:including data processing speed,image quality,and improvements in three-dimensional(3D)visualization effects.OCT technology,especially functional imaging techniques like optical coherence tomography angiography(OCTA),requires a long acquisition time and a large data size.Despite the substantial increase in the acquisition speed of swept source optical coherence tomography(SS-OCT),it still poses significant challenges for data processing.Additionally,during in situ acquisition,image artifacts resulting from interface reflections or strong reflections from biological tissues and culturing containers present obstacles to data visualization and further analysis.Firstly,a customized frequency domainfilter with anti-banding suppression parameters was designed to suppress artifact noises.Then,this study proposed a graphics processing unit(GPU)-based real-time data processing pipeline for SS-OCT,achieving a measured line-process rate of 800 kHz for 3D fast and high-quality data visualization.Furthermore,a GPU-based realtime data processing for CC-OCTA was integrated to acquire dynamic information.Moreover,a vascular-like network chip was prepared using extrusion-based 3D printing and sacrificial materials,with sacrificial material being printed at the desired vascular network locations and then removed to form the vascular-like network.OCTA imaging technology was used to monitor the progression of sacrificial material removal and vascular-like network formation.Therefore,GPU-based OCT enables real-time processing and visualization with artifact suppression,making it particularly suitable for in situ noninvasive longitudinal monitoring of 3D bioprinting tissue and vascular-like networks in microfluidic chips.展开更多
This paper presents a comprehensive exploration into the integration of Internet of Things(IoT),big data analysis,cloud computing,and Artificial Intelligence(AI),which has led to an unprecedented era of connectivity.W...This paper presents a comprehensive exploration into the integration of Internet of Things(IoT),big data analysis,cloud computing,and Artificial Intelligence(AI),which has led to an unprecedented era of connectivity.We delve into the emerging trend of machine learning on embedded devices,enabling tasks in resource-limited environ-ments.However,the widespread adoption of machine learning raises significant privacy concerns,necessitating the development of privacy-preserving techniques.One such technique,secure multi-party computation(MPC),allows collaborative computations without exposing private inputs.Despite its potential,complex protocols and communication interactions hinder performance,especially on resource-constrained devices.Efforts to enhance efficiency have been made,but scalability remains a challenge.Given the success of GPUs in deep learning,lever-aging embedded GPUs,such as those offered by NVIDIA,emerges as a promising solution.Therefore,we propose an Embedded GPU-based Secure Two-party Computation(EG-STC)framework for Artificial Intelligence(AI)systems.To the best of our knowledge,this work represents the first endeavor to fully implement machine learning model training based on secure two-party computing on the Embedded GPU platform.Our experimental results demonstrate the effectiveness of EG-STC.On an embedded GPU with a power draw of 5 W,our implementation achieved a secure two-party matrix multiplication throughput of 5881.5 kilo-operations per millisecond(kops/ms),with an energy efficiency ratio of 1176.3 kops/ms/W.Furthermore,leveraging our EG-STC framework,we achieved an overall time acceleration ratio of 5–6 times compared to solutions running on server-grade CPUs.Our solution also exhibited a reduced runtime,requiring only 60%to 70%of the runtime of previously best-known methods on the same platform.In summary,our research contributes to the advancement of secure and efficient machine learning implementations on resource-constrained embedded devices,paving the way for broader adoption of AI technologies in various applications.展开更多
In recent years,graphics processing units(GPUs)have been applied to accelerate Monte Carlo(MC)simulations for proton dose calculation in radiotherapy.Nonetheless,current GPU platforms,such as Compute Unified Device Ar...In recent years,graphics processing units(GPUs)have been applied to accelerate Monte Carlo(MC)simulations for proton dose calculation in radiotherapy.Nonetheless,current GPU platforms,such as Compute Unified Device Architecture(CUDA)and Open Computing Language(OpenCL),suffer from cross-platform limitation or relatively high programming barrier.However,the Taichi toolkit,which was developed to overcome these difficulties,has been successfully applied to high-performance numerical computations.Based on the class II condensed history simulation scheme with various proton-nucleus interactions,we developed a GPU-accelerated MC engine for proton transport using the Taichi toolkit.Dose distributions in homogeneous and heterogeneous geometries were calculated for 110,160,and 200 MeV protons and were compared with those obtained by full MC simulations using TOPAS.The gamma passing rates were greater than 0.99 and 0.95 with criteria of 2 mm,2%and 1 mm,1%,respectively,in all the benchmark tests.Moreover,the calculation speed was at least 5800 times faster than that of TOPAS,and the number of lines of code was approximately 10 times less than those of CUDA or OpenCL.Our study provides a highly accurate,efficient,and easy-to-use proton dose calculation engine for fast prototyping,beamlet calculation,and education purposes.展开更多
A computational fluid dynamics(CFD)solver for a GPU/CPU heterogeneous architecture parallel computing platform is developed to simulate incompressible flows on billion-level grid points.To solve the Poisson equation,t...A computational fluid dynamics(CFD)solver for a GPU/CPU heterogeneous architecture parallel computing platform is developed to simulate incompressible flows on billion-level grid points.To solve the Poisson equation,the conjugate gradient method is used as a basic solver,and a Chebyshev method in combination with a Jacobi sub-preconditioner is used as a preconditioner.The developed CFD solver shows good performance on parallel efficiency,which exceeds 90%in the weak-scalability test when the number of grid points allocated to each GPU card is greater than 2083.In the acceleration test,it is found that running a simulation with 10403 grid points on 125 GPU cards accelerates by 203.6x over the same number of CPU cores.The developed solver is then tested in the context of a two-dimensional lid-driven cavity flow and three-dimensional Taylor-Green vortex flow.The results are consistent with previous results in the literature.展开更多
We proposed an improved graphics processing unit(GPU)acceleration approach for three-dimensional structural topology optimization using the element-free Galerkin(EFG)method.This method can effectively eliminate the ra...We proposed an improved graphics processing unit(GPU)acceleration approach for three-dimensional structural topology optimization using the element-free Galerkin(EFG)method.This method can effectively eliminate the race condition under parallelization.We established a structural topology optimization model by combining the EFG method and the solid isotropic microstructures with penalization model.We explored the GPU parallel algorithm of assembling stiffness matrix,solving discrete equation,analyzing sensitivity,and updating design variables in detail.We also proposed a node pair-wise method for assembling the stiffnessmatrix and a node-wise method for sensitivity analysis to eliminate race conditions during the parallelization.Furthermore,we investigated the effects of the thread block size,the number of degrees of freedom,and the convergence error of preconditioned conjugate gradient(PCG)on GPU computing performance.Finally,the results of the three numerical examples demonstrated the validity of the proposed approach and showed the significant acceleration of structural topology optimization.To save the cost of optimization calculation,we proposed the appropriate thread block size and the convergence error of the PCG method.展开更多
In this paper,stochastic global optimization algorithms,specifically,genetic algorithm and simulated annealing are used for the problem of calibrating the dynamic option pricing model under stochastic volatility to ma...In this paper,stochastic global optimization algorithms,specifically,genetic algorithm and simulated annealing are used for the problem of calibrating the dynamic option pricing model under stochastic volatility to market prices by adopting a hybrid programming approach.The performance of this dynamic option pricing model under the obtained optimal parameters is also discussed.To enhance the model throughput and reduce latency,a heterogeneous hybrid programming approach on GPU was adopted which emphasized a data-parallel implementation of the dynamic option pricing model on a GPU-based system.Kernel offloading to the GPU of the compute-intensive segments of the pricing algorithms was done in OpenCL.The GPU approach was found to significantly reduce latency by an optimum of 541 times faster than a parallel implementation approach on the CPU,reducing the computation time from 46.24 minutes to 5.12 seconds.展开更多
We implemented accurate FFD in terms of triangular Bezier surfaces as matrix multiplications in CUDA and rendered them via OpenGL. Experimental results show that the proposed algorithm is more efficient than the previ...We implemented accurate FFD in terms of triangular Bezier surfaces as matrix multiplications in CUDA and rendered them via OpenGL. Experimental results show that the proposed algorithm is more efficient than the previous GPU acceleration algorithm and tessel- lation shader algorithms.展开更多
Acquiring a set of features that emphasize the differences between normal data points and outliers can drastically facilitate the task of identifying outliers. In our work, we present a novel non-parametric evaluation...Acquiring a set of features that emphasize the differences between normal data points and outliers can drastically facilitate the task of identifying outliers. In our work, we present a novel non-parametric evaluation criterion for filter-based feature selection which has an eye towards the final goal of outlier detection. The proposed method seeks the subset of features that represent the inherent characteristics of the normal dataset while forcing outliers to stand out, making them more easily distinguished by outlier detection algorithms. Experimental results on real datasets show the advantage of our feature selection algorithm compared with popular and state-of-the-art methods. We also show that the proposed algorithm is able to overcome the small sample space problem and perform well on highly imbalanced datasets. Furthermore, due to the highly parallelizable nature of the feature selection, we implement the algorithm on a graphics processing unit (GPU) to gain significant speedup over the serial version. The benefits of the GPU implementation are two-fold, as its performance scales very well in terms of the number of features, as well as the number of data points.展开更多
The cryo-electron microscopy(cryo-EM)is one of the most powerful technologies available today for structural biology.The RELION(Regularized Likelihood Optimization)implements a Bayesian algorithm for cryo-EM structure...The cryo-electron microscopy(cryo-EM)is one of the most powerful technologies available today for structural biology.The RELION(Regularized Likelihood Optimization)implements a Bayesian algorithm for cryo-EM structure determination,which is one of the most widely used software in this field.Many researchers have devoted effort to improve the performance of RELION to satisfy the analysis for the ever-increasing volume of datasets.In this paper,we focus on performance analysis of the most time-consuming computation steps in RELION and identify their performance bottlenecks for specific optimizations.We propose several performance optimization strategies to improve the overall performance of RELION,including optimization of expectation step,parallelization of maximization step,accelerating the computation of symmetries,and memory affinity optimization.The experiment results show that our proposed optimizations achieve significant speedups of RELION across representative datasets.In addition,we perform roofline model analysis to understand the effectiveness of our optimizations.展开更多
In smart phones,vehicles and wearable devices,GPS sensors are ubiquitous and collect a lot of valuable spatial data from the real world.Given a set of weighted points and a rectangle r in the space,a maximizing range ...In smart phones,vehicles and wearable devices,GPS sensors are ubiquitous and collect a lot of valuable spatial data from the real world.Given a set of weighted points and a rectangle r in the space,a maximizing range sum(MaxRS)query is to find the position of r,so as to maximize the total weight of the points covered by r(i.e.,the range sum).It has a wide spectrum of applications in spatial crowdsourcing,facility location and traffic monitoring.Most of the existing research focuses on the Euclidean space;however,in real life,the user’s moving route is constrained by the road network,and the existing MaxRS query algorithms in the road network are inefficient.In this paper,we propose a novel GPU-accelerated algorithm,namely,GAM,to tackle MaxRS queries in road networks in two phases efficiently.In phase 1,we partition the entire road network into many small cells by a grid and theoretically prove the correctness of parallel query results by grid shifting,and then we propose an effective multi-grained pruning technique,by which the majority of cells can be pruned without further checking.In phase 2,we design a GPU-friendly storage structure,cell-based road network(CRN),and a two-level parallel framework to compute the final result in the remaining cells.Finally,we conduct extensive experiments on two real-world road networks,and the experimental results demonstrate that GAM is on average one order faster than state-of-the-art competitors,and the maximum speedup can achieve about 55 times.展开更多
As an important autumn feature,scenes with large numbers of falling leaves are common in movies and games. However,it is a challenge for computer graphics to simulate such scenes in an authentic and efficient manner. ...As an important autumn feature,scenes with large numbers of falling leaves are common in movies and games. However,it is a challenge for computer graphics to simulate such scenes in an authentic and efficient manner. This paper proposes a GPU based approach for simulating the falling motion of many leaves in real time. Firstly,we use a motionsynthesis based method to analyze the falling motion of the leaves,which enables us to describe complex falling trajectories using low-dimensional features. Secondly,we transmit a primitive-motion trajectory dataset together with the low-dimensional features of the falling leaves to video memory,allowing us to execute the appropriate calculations on the GPU.展开更多
In view of the frequent occurrence of floods due to climate change, and the fact that a large calculation domain, with complex land types, is required for solving the problem of the flood simulations, this paper propo...In view of the frequent occurrence of floods due to climate change, and the fact that a large calculation domain, with complex land types, is required for solving the problem of the flood simulations, this paper proposes an optimized non-uniform grid model combined with a high-resolution model based on the graphics processing unit (GPU) acceleration to simulate the surface water flow process. For the grid division, the topographic gradient change is taken as the control variable and different optimization criteria are designed according to different land types. In the numerical model, the Godunov-type method is adopted for the spatial discretization, the TVD-MUSUL and Runge-Kutta methods are used to improve the model’s spatial and temporal calculation accuracies, and the simulation time is reduced by leveraging the GPU acceleration. The model is applied to ideal and actual case studies. The results show that the numerical model based on a non-uniform grid enjoys a good stability. In the simulation of the urban inundation, approximately 40%–50% of the urban average topographic gradient change to be covered is taken as the threshold for the non-uniform grid division, and the calculation efficiency and accuracy can be optimized. In this case, the calculation efficiency of the non-uniform grid based on the optimized parameters is 2–3 times of that of the uniform grid, and the approach can be adopted for the actual flood simulation in large-scale areas.展开更多
The phase field simulation has been actively studied as a powerful method to investigate the microstructural evolution during the solidification.However,it is a great challenge to perform the phase field simulation in...The phase field simulation has been actively studied as a powerful method to investigate the microstructural evolution during the solidification.However,it is a great challenge to perform the phase field simulation in large length and time scale.The developed graphics processing unit(GPU)calculation is used in the phase filed simulation,greatly accelerating the calculation efficiency.The results show that the computation with GPU is about 36 times faster than that with a single Central Processing Unit(CPU)core.It provides the feasibility of the GPU-accelerated phase field simulation on a desktop computer.The GPU-accelerated strategy will bring a new opportunity to the application of phase field simulation.展开更多
Traditional gradient domain seamless image cloning is a time consuming task,requiring the solving of Poisson's equations whenever the shape or position of the cloned region changes.Recently,a more efficient altern...Traditional gradient domain seamless image cloning is a time consuming task,requiring the solving of Poisson's equations whenever the shape or position of the cloned region changes.Recently,a more efficient alternative,the mean-value coordinates(MVCs) based approach,was proposed to interpolate interior pixels by a weighted combination of values along the boundary.However,this approach cannot faithfully preserve the gradient in the cloning region.In this paper,we introduce harmonic cloning,which uses harmonic coordinates(HCs) instead of MVCs in image cloning.Benefiting from the non-negativity and interior locality of HCs,our interpolation generates a more accurate harmonic field across the cloned region,to preserve the results with as high a quality as with Poisson cloning.Furthermore,with optimizations and implementation on a graphic processing unit(GPU),we demonstrate that,compared with the method using MVCs,our harmonic cloning gains better quality while retaining real-time performance.展开更多
Solute transport simulations are important in water pollution events.This paper introduces a finite volume Godunovtype model for solving a 4×4 matrix form of the hyperbolic conservation laws consisting of 2D shal...Solute transport simulations are important in water pollution events.This paper introduces a finite volume Godunovtype model for solving a 4×4 matrix form of the hyperbolic conservation laws consisting of 2D shallow water equations and transport equations.The model adopts the Harten-Lax-van Leer-contact(HLLC)-approximate Riemann solution to calculate the cell interface fluxes.It can deal well with the changes in the dry and wet interfaces in an actual complex terrain,and it has a strong shock-wave capturing ability.Using monotonic upstream-centred scheme for conservation laws(MUSCL)linear reconstruction with finite slope and the Runge-Kutta time integration method can achieve second-order accuracy.At the same time,the introduction of graphics processing unit(GPU)-accelerated computing technology greatly increases the computing speed.The model is validated against multiple benchmarks,and the results are in good agreement with analytical solutions and other published numerical predictions.The third test case uses the GPU and central processing unit(CPU)calculation models which take 3.865 s and 13.865 s,respectively,indicating that the GPU calculation model can increase the calculation speed by 3.6 times.In the fourth test case,comparing the numerical model calculated by GPU with the traditional numerical model calculated by CPU,the calculation efficiencies of the numerical model calculated by GPU under different resolution grids are 9.8–44.6 times higher than those by CPU.Therefore,it has better potential than previous models for large-scale simulation of solute transport in water pollution incidents.It can provide a reliable theoretical basis and strong data support in the rapid assessment and early warning of water pollution accidents.展开更多
In this study,a computational framework in the field of artificial intelligence was applied in computational fluid dynamics(CFD)field.This Framework,which was initially proposed by Google Al department,is called"...In this study,a computational framework in the field of artificial intelligence was applied in computational fluid dynamics(CFD)field.This Framework,which was initially proposed by Google Al department,is called"TensorFlow".An improved CFD model based on this framework was developed with a high-order difference method,which is a constrained interpolation profile(CIP)scheme for the base flow solver of the advection term in the Navier-Stokes equations,and preconditioned conjugate gradient(PCG)method was implemented in the model to solve the Poisson equation.Some new features including the convolution,vectorization,and graphics processing unit(GPU)acceleration were implemented to raise the computational efficiency.The model was tested with several benchmark cases and shows good performance.Compared with our former CIP-based model,the present Tensor Flow-based model also shows significantly higher computational efficiency in large-scale computation.The results indicate TensorFlow could be a promising framework for CFD models due to its ability in the computational acceleration and convenience for programming.展开更多
基金The National Natural Science Foundation of China under contract No.42076214.
文摘Storm surge is often the marine disaster that poses the greatest threat to life and property in coastal areas.Accurate and timely issuance of storm surge warnings to take appropriate countermeasures is an important means to reduce storm surge-related losses.Storm surge numerical models are important for storm surge forecasting.To further improve the performance of the storm surge forecast models,we developed a numerical storm surge forecast model based on an unstructured spherical centroidal Voronoi tessellation(SCVT)grid.The model is based on shallow water equations in vector-invariant form,and is discretized by Arakawa C grid.The SCVT grid can not only better describe the coastline information but also avoid rigid transitions,and it has a better global consistency by generating high-resolution grids in the key areas through transition refinement.In addition,the simulation speed of the model is accelerated by using the openACC-based GPU acceleration technology to meet the timeliness requirements of operational ensemble forecast.It only takes 37 s to simulate a day in the coastal waters of China.The newly developed storm surge model was applied to simulate typhoon-induced storm surges in the coastal waters of China.The hindcast experiments on the selected representative typhoon-induced storm surge processes indicate that the model can reasonably simulate the distribution characteristics of storm surges.The simulated maximum storm surges and their occurrence times are consistent with the observed data at the representative tide gauge stations,and the mean absolute errors are 3.5 cm and 0.6 h respectively,showing high accuracy and application prospects.
基金supported by the National Key Research and Development Program of China(Nos.2022YFA1104600 and 2022YFA1200208)National Natural Science Foundation of China(No.31927801)Key Research and Development Foundation of Zhejiang Province(No.2022C01123).
文摘Optical coherence tomography(OCT)imaging technology has significant advantages in in situ and noninvasive monitoring of biological tissues.However,it still faces the following challenges:including data processing speed,image quality,and improvements in three-dimensional(3D)visualization effects.OCT technology,especially functional imaging techniques like optical coherence tomography angiography(OCTA),requires a long acquisition time and a large data size.Despite the substantial increase in the acquisition speed of swept source optical coherence tomography(SS-OCT),it still poses significant challenges for data processing.Additionally,during in situ acquisition,image artifacts resulting from interface reflections or strong reflections from biological tissues and culturing containers present obstacles to data visualization and further analysis.Firstly,a customized frequency domainfilter with anti-banding suppression parameters was designed to suppress artifact noises.Then,this study proposed a graphics processing unit(GPU)-based real-time data processing pipeline for SS-OCT,achieving a measured line-process rate of 800 kHz for 3D fast and high-quality data visualization.Furthermore,a GPU-based realtime data processing for CC-OCTA was integrated to acquire dynamic information.Moreover,a vascular-like network chip was prepared using extrusion-based 3D printing and sacrificial materials,with sacrificial material being printed at the desired vascular network locations and then removed to form the vascular-like network.OCTA imaging technology was used to monitor the progression of sacrificial material removal and vascular-like network formation.Therefore,GPU-based OCT enables real-time processing and visualization with artifact suppression,making it particularly suitable for in situ noninvasive longitudinal monitoring of 3D bioprinting tissue and vascular-like networks in microfluidic chips.
基金supported in part by Major Science and Technology Demonstration Project of Jiangsu Provincial Key R&D Program under Grant No.BE2023025in part by the National Natural Science Foundation of China under Grant No.62302238+2 种基金in part by the Natural Science Foundation of Jiangsu Province under Grant No.BK20220388in part by the Natural Science Research Project of Colleges and Universities in Jiangsu Province under Grant No.22KJB520004in part by the China Postdoctoral Science Foundation under Grant No.2022M711689.
文摘This paper presents a comprehensive exploration into the integration of Internet of Things(IoT),big data analysis,cloud computing,and Artificial Intelligence(AI),which has led to an unprecedented era of connectivity.We delve into the emerging trend of machine learning on embedded devices,enabling tasks in resource-limited environ-ments.However,the widespread adoption of machine learning raises significant privacy concerns,necessitating the development of privacy-preserving techniques.One such technique,secure multi-party computation(MPC),allows collaborative computations without exposing private inputs.Despite its potential,complex protocols and communication interactions hinder performance,especially on resource-constrained devices.Efforts to enhance efficiency have been made,but scalability remains a challenge.Given the success of GPUs in deep learning,lever-aging embedded GPUs,such as those offered by NVIDIA,emerges as a promising solution.Therefore,we propose an Embedded GPU-based Secure Two-party Computation(EG-STC)framework for Artificial Intelligence(AI)systems.To the best of our knowledge,this work represents the first endeavor to fully implement machine learning model training based on secure two-party computing on the Embedded GPU platform.Our experimental results demonstrate the effectiveness of EG-STC.On an embedded GPU with a power draw of 5 W,our implementation achieved a secure two-party matrix multiplication throughput of 5881.5 kilo-operations per millisecond(kops/ms),with an energy efficiency ratio of 1176.3 kops/ms/W.Furthermore,leveraging our EG-STC framework,we achieved an overall time acceleration ratio of 5–6 times compared to solutions running on server-grade CPUs.Our solution also exhibited a reduced runtime,requiring only 60%to 70%of the runtime of previously best-known methods on the same platform.In summary,our research contributes to the advancement of secure and efficient machine learning implementations on resource-constrained embedded devices,paving the way for broader adoption of AI technologies in various applications.
基金supported by the National Natural Science Foundation of China (Nos.11735003,11975041,and 11961141004)。
文摘In recent years,graphics processing units(GPUs)have been applied to accelerate Monte Carlo(MC)simulations for proton dose calculation in radiotherapy.Nonetheless,current GPU platforms,such as Compute Unified Device Architecture(CUDA)and Open Computing Language(OpenCL),suffer from cross-platform limitation or relatively high programming barrier.However,the Taichi toolkit,which was developed to overcome these difficulties,has been successfully applied to high-performance numerical computations.Based on the class II condensed history simulation scheme with various proton-nucleus interactions,we developed a GPU-accelerated MC engine for proton transport using the Taichi toolkit.Dose distributions in homogeneous and heterogeneous geometries were calculated for 110,160,and 200 MeV protons and were compared with those obtained by full MC simulations using TOPAS.The gamma passing rates were greater than 0.99 and 0.95 with criteria of 2 mm,2%and 1 mm,1%,respectively,in all the benchmark tests.Moreover,the calculation speed was at least 5800 times faster than that of TOPAS,and the number of lines of code was approximately 10 times less than those of CUDA or OpenCL.Our study provides a highly accurate,efficient,and easy-to-use proton dose calculation engine for fast prototyping,beamlet calculation,and education purposes.
基金supported by the National Natural Science Foundation of China (NSFC)Basic Science Center Program for Multiscale Problems in Nonlinear Mechanics’(Grant No. 11988102)NSFC project (Grant No. 11972038)
文摘A computational fluid dynamics(CFD)solver for a GPU/CPU heterogeneous architecture parallel computing platform is developed to simulate incompressible flows on billion-level grid points.To solve the Poisson equation,the conjugate gradient method is used as a basic solver,and a Chebyshev method in combination with a Jacobi sub-preconditioner is used as a preconditioner.The developed CFD solver shows good performance on parallel efficiency,which exceeds 90%in the weak-scalability test when the number of grid points allocated to each GPU card is greater than 2083.In the acceleration test,it is found that running a simulation with 10403 grid points on 125 GPU cards accelerates by 203.6x over the same number of CPU cores.The developed solver is then tested in the context of a two-dimensional lid-driven cavity flow and three-dimensional Taylor-Green vortex flow.The results are consistent with previous results in the literature.
基金This work is supported by the National Natural Science Foundation of China(Nos.51875493,51975503,11802261)The financial support to the first author is gratefully acknowledged.
文摘We proposed an improved graphics processing unit(GPU)acceleration approach for three-dimensional structural topology optimization using the element-free Galerkin(EFG)method.This method can effectively eliminate the race condition under parallelization.We established a structural topology optimization model by combining the EFG method and the solid isotropic microstructures with penalization model.We explored the GPU parallel algorithm of assembling stiffness matrix,solving discrete equation,analyzing sensitivity,and updating design variables in detail.We also proposed a node pair-wise method for assembling the stiffnessmatrix and a node-wise method for sensitivity analysis to eliminate race conditions during the parallelization.Furthermore,we investigated the effects of the thread block size,the number of degrees of freedom,and the convergence error of preconditioned conjugate gradient(PCG)on GPU computing performance.Finally,the results of the three numerical examples demonstrated the validity of the proposed approach and showed the significant acceleration of structural topology optimization.To save the cost of optimization calculation,we proposed the appropriate thread block size and the convergence error of the PCG method.
文摘In this paper,stochastic global optimization algorithms,specifically,genetic algorithm and simulated annealing are used for the problem of calibrating the dynamic option pricing model under stochastic volatility to market prices by adopting a hybrid programming approach.The performance of this dynamic option pricing model under the obtained optimal parameters is also discussed.To enhance the model throughput and reduce latency,a heterogeneous hybrid programming approach on GPU was adopted which emphasized a data-parallel implementation of the dynamic option pricing model on a GPU-based system.Kernel offloading to the GPU of the compute-intensive segments of the pricing algorithms was done in OpenCL.The GPU approach was found to significantly reduce latency by an optimum of 541 times faster than a parallel implementation approach on the CPU,reducing the computation time from 46.24 minutes to 5.12 seconds.
基金Supported by the National Natural Science Foundation of China(61170138 and 61472349)
文摘We implemented accurate FFD in terms of triangular Bezier surfaces as matrix multiplications in CUDA and rendered them via OpenGL. Experimental results show that the proposed algorithm is more efficient than the previous GPU acceleration algorithm and tessel- lation shader algorithms.
文摘Acquiring a set of features that emphasize the differences between normal data points and outliers can drastically facilitate the task of identifying outliers. In our work, we present a novel non-parametric evaluation criterion for filter-based feature selection which has an eye towards the final goal of outlier detection. The proposed method seeks the subset of features that represent the inherent characteristics of the normal dataset while forcing outliers to stand out, making them more easily distinguished by outlier detection algorithms. Experimental results on real datasets show the advantage of our feature selection algorithm compared with popular and state-of-the-art methods. We also show that the proposed algorithm is able to overcome the small sample space problem and perform well on highly imbalanced datasets. Furthermore, due to the highly parallelizable nature of the feature selection, we implement the algorithm on a graphics processing unit (GPU) to gain significant speedup over the serial version. The benefits of the GPU implementation are two-fold, as its performance scales very well in terms of the number of features, as well as the number of data points.
基金the National Key R&D Program of China(2020YFB1506703)the National Natural Science Foundation of China(Grant No.62072018)the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing(2019A12).
文摘The cryo-electron microscopy(cryo-EM)is one of the most powerful technologies available today for structural biology.The RELION(Regularized Likelihood Optimization)implements a Bayesian algorithm for cryo-EM structure determination,which is one of the most widely used software in this field.Many researchers have devoted effort to improve the performance of RELION to satisfy the analysis for the ever-increasing volume of datasets.In this paper,we focus on performance analysis of the most time-consuming computation steps in RELION and identify their performance bottlenecks for specific optimizations.We propose several performance optimization strategies to improve the overall performance of RELION,including optimization of expectation step,parallelization of maximization step,accelerating the computation of symmetries,and memory affinity optimization.The experiment results show that our proposed optimizations achieve significant speedups of RELION across representative datasets.In addition,we perform roofline model analysis to understand the effectiveness of our optimizations.
基金This work was supported in part by the Key Research and Development Plan of National Ministry of Science and Technology under Grant No.2019YFB2101902the National Natural Science Foundation of China under Grant Nos.U19A2059 and 62102119the CCF-Baidu Open Fund CCF-BAIDUunder Grant No.OF2021011。
文摘In smart phones,vehicles and wearable devices,GPS sensors are ubiquitous and collect a lot of valuable spatial data from the real world.Given a set of weighted points and a rectangle r in the space,a maximizing range sum(MaxRS)query is to find the position of r,so as to maximize the total weight of the points covered by r(i.e.,the range sum).It has a wide spectrum of applications in spatial crowdsourcing,facility location and traffic monitoring.Most of the existing research focuses on the Euclidean space;however,in real life,the user’s moving route is constrained by the road network,and the existing MaxRS query algorithms in the road network are inefficient.In this paper,we propose a novel GPU-accelerated algorithm,namely,GAM,to tackle MaxRS queries in road networks in two phases efficiently.In phase 1,we partition the entire road network into many small cells by a grid and theoretically prove the correctness of parallel query results by grid shifting,and then we propose an effective multi-grained pruning technique,by which the majority of cells can be pruned without further checking.In phase 2,we design a GPU-friendly storage structure,cell-based road network(CRN),and a two-level parallel framework to compute the final result in the remaining cells.Finally,we conduct extensive experiments on two real-world road networks,and the experimental results demonstrate that GAM is on average one order faster than state-of-the-art competitors,and the maximum speedup can achieve about 55 times.
基金supported by National High-tech Research and Development Program of China(No.2013AA013903)
文摘As an important autumn feature,scenes with large numbers of falling leaves are common in movies and games. However,it is a challenge for computer graphics to simulate such scenes in an authentic and efficient manner. This paper proposes a GPU based approach for simulating the falling motion of many leaves in real time. Firstly,we use a motionsynthesis based method to analyze the falling motion of the leaves,which enables us to describe complex falling trajectories using low-dimensional features. Secondly,we transmit a primitive-motion trajectory dataset together with the low-dimensional features of the falling leaves to video memory,allowing us to execute the appropriate calculations on the GPU.
基金This work was supported by the Shaanxi International Science and Technology Cooperation and Exchange Program(Grant No.2017KW-014)Projects supported by the National Natural Science Foundation of China (Grant No.51609199)the National Key Research and Development Program of China (Grant No.2016YFC0402704).
文摘In view of the frequent occurrence of floods due to climate change, and the fact that a large calculation domain, with complex land types, is required for solving the problem of the flood simulations, this paper proposes an optimized non-uniform grid model combined with a high-resolution model based on the graphics processing unit (GPU) acceleration to simulate the surface water flow process. For the grid division, the topographic gradient change is taken as the control variable and different optimization criteria are designed according to different land types. In the numerical model, the Godunov-type method is adopted for the spatial discretization, the TVD-MUSUL and Runge-Kutta methods are used to improve the model’s spatial and temporal calculation accuracies, and the simulation time is reduced by leveraging the GPU acceleration. The model is applied to ideal and actual case studies. The results show that the numerical model based on a non-uniform grid enjoys a good stability. In the simulation of the urban inundation, approximately 40%–50% of the urban average topographic gradient change to be covered is taken as the threshold for the non-uniform grid division, and the calculation efficiency and accuracy can be optimized. In this case, the calculation efficiency of the non-uniform grid based on the optimized parameters is 2–3 times of that of the uniform grid, and the approach can be adopted for the actual flood simulation in large-scale areas.
基金supported by the China Postdoctoral Science Foundation(Grant No.2013M540772)the Young Scientists Fund of the National Natural Science Foundation of China(Grant Nos.61203233,51101124,51101125)
文摘The phase field simulation has been actively studied as a powerful method to investigate the microstructural evolution during the solidification.However,it is a great challenge to perform the phase field simulation in large length and time scale.The developed graphics processing unit(GPU)calculation is used in the phase filed simulation,greatly accelerating the calculation efficiency.The results show that the computation with GPU is about 36 times faster than that with a single Central Processing Unit(CPU)core.It provides the feasibility of the GPU-accelerated phase field simulation on a desktop computer.The GPU-accelerated strategy will bring a new opportunity to the application of phase field simulation.
基金supported in part by the National Natural Science Foundation of China (No. 60903037)the National Basic Research Program (973) of China (No. 2009CB320803)
文摘Traditional gradient domain seamless image cloning is a time consuming task,requiring the solving of Poisson's equations whenever the shape or position of the cloned region changes.Recently,a more efficient alternative,the mean-value coordinates(MVCs) based approach,was proposed to interpolate interior pixels by a weighted combination of values along the boundary.However,this approach cannot faithfully preserve the gradient in the cloning region.In this paper,we introduce harmonic cloning,which uses harmonic coordinates(HCs) instead of MVCs in image cloning.Benefiting from the non-negativity and interior locality of HCs,our interpolation generates a more accurate harmonic field across the cloned region,to preserve the results with as high a quality as with Poisson cloning.Furthermore,with optimizations and implementation on a graphic processing unit(GPU),we demonstrate that,compared with the method using MVCs,our harmonic cloning gains better quality while retaining real-time performance.
基金Project supported by the National Natural Science Foundation of China(Nos.52009104 and 52079106)the Shaanxi Provincial Department of Water Resources Project(No.2017slkj-14)the Shaanxi Provincial Department of Science and Technology Project(No.2017JQ3043),China。
文摘Solute transport simulations are important in water pollution events.This paper introduces a finite volume Godunovtype model for solving a 4×4 matrix form of the hyperbolic conservation laws consisting of 2D shallow water equations and transport equations.The model adopts the Harten-Lax-van Leer-contact(HLLC)-approximate Riemann solution to calculate the cell interface fluxes.It can deal well with the changes in the dry and wet interfaces in an actual complex terrain,and it has a strong shock-wave capturing ability.Using monotonic upstream-centred scheme for conservation laws(MUSCL)linear reconstruction with finite slope and the Runge-Kutta time integration method can achieve second-order accuracy.At the same time,the introduction of graphics processing unit(GPU)-accelerated computing technology greatly increases the computing speed.The model is validated against multiple benchmarks,and the results are in good agreement with analytical solutions and other published numerical predictions.The third test case uses the GPU and central processing unit(CPU)calculation models which take 3.865 s and 13.865 s,respectively,indicating that the GPU calculation model can increase the calculation speed by 3.6 times.In the fourth test case,comparing the numerical model calculated by GPU with the traditional numerical model calculated by CPU,the calculation efficiencies of the numerical model calculated by GPU under different resolution grids are 9.8–44.6 times higher than those by CPU.Therefore,it has better potential than previous models for large-scale simulation of solute transport in water pollution incidents.It can provide a reliable theoretical basis and strong data support in the rapid assessment and early warning of water pollution accidents.
基金Supported by the National Natural Science Foundation of China(Grant No.51679212,51979245).
文摘In this study,a computational framework in the field of artificial intelligence was applied in computational fluid dynamics(CFD)field.This Framework,which was initially proposed by Google Al department,is called"TensorFlow".An improved CFD model based on this framework was developed with a high-order difference method,which is a constrained interpolation profile(CIP)scheme for the base flow solver of the advection term in the Navier-Stokes equations,and preconditioned conjugate gradient(PCG)method was implemented in the model to solve the Poisson equation.Some new features including the convolution,vectorization,and graphics processing unit(GPU)acceleration were implemented to raise the computational efficiency.The model was tested with several benchmark cases and shows good performance.Compared with our former CIP-based model,the present Tensor Flow-based model also shows significantly higher computational efficiency in large-scale computation.The results indicate TensorFlow could be a promising framework for CFD models due to its ability in the computational acceleration and convenience for programming.