This study embarks on a comprehensive examination of optimization techniques within GPU-based parallel programming models,pivotal for advancing high-performance computing(HPC).Emphasizing the transition of GPUs from g...This study embarks on a comprehensive examination of optimization techniques within GPU-based parallel programming models,pivotal for advancing high-performance computing(HPC).Emphasizing the transition of GPUs from graphic-centric processors to versatile computing units,it delves into the nuanced optimization of memory access,thread management,algorithmic design,and data structures.These optimizations are critical for exploiting the parallel processing capabilities of GPUs,addressingboth the theoretical frameworks and practical implementations.By integrating advanced strategies such as memory coalescing,dynamic scheduling,and parallel algorithmic transformations,this research aims to significantly elevate computational efficiency and throughput.The findings underscore the potential of optimized GPU programming to revolutionize computational tasks across various domains,highlighting a pathway towards achieving unparalleled processing power and efficiency in HPC environments.The paper not only contributes to the academic discourse on GPU optimization but also provides actionable insights for developers,fostering advancements in computational sciences and technology.展开更多
The ice resistance on a ship hull affects the safety of the hull structure and the ship maneuvering performance in icecovered regions.In this paper,the discrete element method(DEM)is adopted to simulate the interactio...The ice resistance on a ship hull affects the safety of the hull structure and the ship maneuvering performance in icecovered regions.In this paper,the discrete element method(DEM)is adopted to simulate the interaction between level ice and ship hull.The level ice is modeled with 3D bonded spherical elements considering the buoyancy and drag force of the water.The parallel bonding approach and the de-bonding criterion are adopted to model the freezing and breakage of level ice.The ship hull is constructed with rigid triangle elements.To improve computational efficiency,the GPU-based parallel computational algorithm was developed for the DEM simulations.During the interaction between the ship hull and level ice,the ice cover is broken into small blocks when the interparticle stress approaches the bonding strength.The global ice resistance on the hull is calculated through the contacts between ice elements and hull elements during the navigation process.The influences of the ice thickness and navigation speed on the dynamic ice force are analyzed considering the breakage mechanism of ice cover.The Lindqvist and Riska formulas for the determination of ice resistance on ship hull are employed to validate the DEM simulation.The comparison of results of DEM,Lindqvist,and Riska formula show that the DEM result is between those the Lindqvist formula and Riska formula.Therefore the proposed DEM is an effective approach to determine the ice resistance on the ship hull.This work can be aided in the hull structure design and the navigation operation in ice-covered fields.展开更多
Hardware/software partitioning is an essential step in hardware/software co-design.For large size problems,it is difficult to consider both solution quality and time.This paper presents an efficient GPU-based parallel...Hardware/software partitioning is an essential step in hardware/software co-design.For large size problems,it is difficult to consider both solution quality and time.This paper presents an efficient GPU-based parallel tabu search algorithm(GPTS)for HW/SW partitioning.A single GPU kernel of compacting neighborhood is proposed to reduce the amount of GPU global memory accesses theoretically.A kernel fusion strategy is further proposed to reduce the amount of GPU global memory accesses of GPTS.To further minimize the transfer overhead of GPTS between CPU and GPU,an optimized transfer strategy for GPU-based tabu evaluation is proposed,which considers that all the candidates do not satisfy the given constraint.Experiments show that GPTS outperforms state-of-the-art work of tabu search and is competitive with other methods for HW/SW partitioning.The proposed parallelization is significant when considering the ordinary GPU platform.展开更多
The sense of being within a three-dimensional (3D) space and interacting with virtual 3D objects in a computer-generated virtual environment (VE) often requires essential image, vision and sensor signal processing...The sense of being within a three-dimensional (3D) space and interacting with virtual 3D objects in a computer-generated virtual environment (VE) often requires essential image, vision and sensor signal processing techniques such as differentiating and denoising. This paper describes novel implementations of the Gaussian filtering for characteristic signal extraction and waveletbased image denoising algorithms that run on the graphics processing unit (GPU). While significant acceleration over standard CPU implementations is obtained through exploiting data parallelism provided by the modern programmable graphics hardware, the CPU can be freed up to run other computations more efficiently such as artificial intelligence (AI) and physics. The proposed GPU-based Gaussian filtering can extract surface information from a real object and provide its material features for rendering and illumination. The wavelet-based signal denoising for large size digital images realized in this project provided better realism for VE visualization without sacrificing real-time and interactive performances of an application.展开更多
Previous collision detection methods for virtual disassembly mainly detect collisions at discrete time intervals and use oriented bounding boxes to speed up the process. However, these discrete methods cannot guarante...Previous collision detection methods for virtual disassembly mainly detect collisions at discrete time intervals and use oriented bounding boxes to speed up the process. However, these discrete methods cannot guarantee no penetration occurs when the components move. Meanwhile, because some of the components are embedded into each other, these components cannot be separated in the subsequent process. To solve these problems, we propose an approach for real-time collision handling by utilizing the computational power of modern GPUs. First we present a novel GPU-based collision handling framework for virtual disassembly. Second we use a collision-streams based continuous collision detection to guarantee no collision missed. Finally we introduce a triangle intersection detection algorithm to solve the problem that collision cannot be detected when the components are embedded into each other at the initial configuration. The experimental results show that our method can improve the overall performance of collision detection and achieve real-time simulation.展开更多
Vehicle wading is a complex fluid-structure interaction(FSI) problem and has attracted great attention recently from the automotive industry, especially for electric vehicles. As a meshless Lagrangian particle method,...Vehicle wading is a complex fluid-structure interaction(FSI) problem and has attracted great attention recently from the automotive industry, especially for electric vehicles. As a meshless Lagrangian particle method, smoothed particle hydrodynamics(SPH) is one of the most suitable candidates for simulations of vehicle wading due to its inherent advantages in modeling free surface flows, splash, and moving interfaces. Nevertheless, the inevitable neighbor query for the nearest adjacent particles among the support domain leads to considerable computational cost and thus limits its application in 3D large-scale simulations. In this work, a GPU-based SPH method is developed with an adaptive spatial sort technology for simulations of vehicle wading. In addition, a fast, easy-to-implement particle generator is presented for isotropic initialization of the complex vehicle geometry with optimal interpolation properties. A comparative study of vehicle wading on a puddle between the GPUbased SPH with two pieces of commercial software is used to verify the capability of the GPU-based SPH method in terms of convergence analysis, kinematic characteristics, and computing performance. Finally, different conditions of vehicle speeds, water depths, and puddle widths are tested to investigate the vehicle wading numerically. The results demonstrate that the adaptive spatial sort technology can significantly improve the computing performance of the GPU-based SPH method and meanwhile promotes the GPU-based SPH method to be a competitive tool for the study of 3D large-scale FSI problems including vehicle wading. Some helpful findings of the critical vehicle speed, water depth as well as boundary wall effect are also reported in this work.展开更多
The advent of CUDA-enabled GPU makes it possible to provide cloud applications with high-performance data security services.Unfortunately,recent studies have shown that GPU-based applications are also susceptible to s...The advent of CUDA-enabled GPU makes it possible to provide cloud applications with high-performance data security services.Unfortunately,recent studies have shown that GPU-based applications are also susceptible to side-channel attacks.These published work studied the side-channel vulnerabilities of GPU-based AES implementations by taking the advantage of the cache sharing among multiple threads or high parallelism of GPUs.Therefore,for GPU-based bitsliced cryptographic implementations,which are immune to the cache-based attacks referred to above,only a power analysis method based on the high-parallelism of GPUs may be effective.However,the leakage model used in the power analysis is not efficient at all in practice.In light of this,we investigate electro-magnetic(EM)side-channel vulnerabilities of a GPU-based bitsliced AES implementation from the perspective of bit-level parallelism and thread-level parallelism in order to make the best of the localization effect of EM leakage with parallelism.Specifically,we propose efficient multi-bit and multi-thread combinational analysis techniques based on the intrinsic properties of bitsliced ciphers and the effect of multi-thread parallelism of GPUs,respectively.The experimental result shows that the proposed combinational analysis methods perform better than non-combinational and intuitive ones.Our research suggests that multi-thread leakages can be used to improve attacks if the multi-thread leakages are not synchronous in the time domain.展开更多
The advent of CUDA-enabled GPU makes it possible to provide cloud applications with high-performance data security services.Unfortunately,recent studies have shown that GPU-based applications are also susceptible to s...The advent of CUDA-enabled GPU makes it possible to provide cloud applications with high-performance data security services.Unfortunately,recent studies have shown that GPU-based applications are also susceptible to side-channel attacks.These published work studied the side-channel vulnerabilities of GPU-based AES implementations by taking the advantage of the cache sharing among multiple threads or high parallelism of GPUs.Therefore,for GPU-based bitsliced cryptographic implementations,which are immune to the cache-based attacks referred to above,only a power analysis method based on the high-parallelism of GPUs may be effective.However,the leakage model used in the power analysis is not efficient at all in practice.In light of this,we investigate electro-magnetic(EM)side-channel vulnerabilities of a GPU-based bitsliced AES implementation from the perspective of bit-level parallelism and thread-level parallelism in order to make the best of the localization effect of EM leakage with parallelism.Specifically,we propose efficient multi-bit and multi-thread combinational analysis techniques based on the intrinsic properties of bitsliced ciphers and the effect of multi-thread parallelism of GPUs,respectively.The experimental result shows that the proposed combinational analysis methods perform better than non-combinational and intuitive ones.Our research suggests that multi-thread leakages can be used to improve attacks if the multi-thread leakages are not synchronous in the time domain.展开更多
文摘This study embarks on a comprehensive examination of optimization techniques within GPU-based parallel programming models,pivotal for advancing high-performance computing(HPC).Emphasizing the transition of GPUs from graphic-centric processors to versatile computing units,it delves into the nuanced optimization of memory access,thread management,algorithmic design,and data structures.These optimizations are critical for exploiting the parallel processing capabilities of GPUs,addressingboth the theoretical frameworks and practical implementations.By integrating advanced strategies such as memory coalescing,dynamic scheduling,and parallel algorithmic transformations,this research aims to significantly elevate computational efficiency and throughput.The findings underscore the potential of optimized GPU programming to revolutionize computational tasks across various domains,highlighting a pathway towards achieving unparalleled processing power and efficiency in HPC environments.The paper not only contributes to the academic discourse on GPU optimization but also provides actionable insights for developers,fostering advancements in computational sciences and technology.
基金This study is financially supported by the National Key Research and Development Program of China(Grant Nos.2017YFE0111400 and 2016YCF1401505)the National Natural Science Foundation of China(Grant Nos.41576179 and 51639004)the China Postdoctoral Science Foundation(Grant No.2020M670746).
文摘The ice resistance on a ship hull affects the safety of the hull structure and the ship maneuvering performance in icecovered regions.In this paper,the discrete element method(DEM)is adopted to simulate the interaction between level ice and ship hull.The level ice is modeled with 3D bonded spherical elements considering the buoyancy and drag force of the water.The parallel bonding approach and the de-bonding criterion are adopted to model the freezing and breakage of level ice.The ship hull is constructed with rigid triangle elements.To improve computational efficiency,the GPU-based parallel computational algorithm was developed for the DEM simulations.During the interaction between the ship hull and level ice,the ice cover is broken into small blocks when the interparticle stress approaches the bonding strength.The global ice resistance on the hull is calculated through the contacts between ice elements and hull elements during the navigation process.The influences of the ice thickness and navigation speed on the dynamic ice force are analyzed considering the breakage mechanism of ice cover.The Lindqvist and Riska formulas for the determination of ice resistance on ship hull are employed to validate the DEM simulation.The comparison of results of DEM,Lindqvist,and Riska formula show that the DEM result is between those the Lindqvist formula and Riska formula.Therefore the proposed DEM is an effective approach to determine the ice resistance on the ship hull.This work can be aided in the hull structure design and the navigation operation in ice-covered fields.
基金This paper was supported by the National Natural Science Foundation of China(Grant No.61472289)National Key Research and Development Project(2016YFC0106305).We also would like to thank the anonymous reviewers for their valuable and constructive comments.
文摘Hardware/software partitioning is an essential step in hardware/software co-design.For large size problems,it is difficult to consider both solution quality and time.This paper presents an efficient GPU-based parallel tabu search algorithm(GPTS)for HW/SW partitioning.A single GPU kernel of compacting neighborhood is proposed to reduce the amount of GPU global memory accesses theoretically.A kernel fusion strategy is further proposed to reduce the amount of GPU global memory accesses of GPTS.To further minimize the transfer overhead of GPTS between CPU and GPU,an optimized transfer strategy for GPU-based tabu evaluation is proposed,which considers that all the candidates do not satisfy the given constraint.Experiments show that GPTS outperforms state-of-the-art work of tabu search and is competitive with other methods for HW/SW partitioning.The proposed parallelization is significant when considering the ordinary GPU platform.
基金supported by Research Funding of Huddersfield University:GPU-based High Performance Computing for Signal Processing (No. 1008/REU117)
文摘The sense of being within a three-dimensional (3D) space and interacting with virtual 3D objects in a computer-generated virtual environment (VE) often requires essential image, vision and sensor signal processing techniques such as differentiating and denoising. This paper describes novel implementations of the Gaussian filtering for characteristic signal extraction and waveletbased image denoising algorithms that run on the graphics processing unit (GPU). While significant acceleration over standard CPU implementations is obtained through exploiting data parallelism provided by the modern programmable graphics hardware, the CPU can be freed up to run other computations more efficiently such as artificial intelligence (AI) and physics. The proposed GPU-based Gaussian filtering can extract surface information from a real object and provide its material features for rendering and illumination. The wavelet-based signal denoising for large size digital images realized in this project provided better realism for VE visualization without sacrificing real-time and interactive performances of an application.
基金This work was supported by the National Natural Science Foundation of China under Grant No. 61472111, the Zhejiang Provincial Natural Science Foundation of China under Grant No. LQ13F020016, the Foundation of Zhejiang Educational Committee under Grant No. Y201224034, and the Scientific Research Start Foundation of Hangzhou Dianzi University under Grant No. KYS225613032.
文摘Previous collision detection methods for virtual disassembly mainly detect collisions at discrete time intervals and use oriented bounding boxes to speed up the process. However, these discrete methods cannot guarantee no penetration occurs when the components move. Meanwhile, because some of the components are embedded into each other, these components cannot be separated in the subsequent process. To solve these problems, we propose an approach for real-time collision handling by utilizing the computational power of modern GPUs. First we present a novel GPU-based collision handling framework for virtual disassembly. Second we use a collision-streams based continuous collision detection to guarantee no collision missed. Finally we introduce a triangle intersection detection algorithm to solve the problem that collision cannot be detected when the components are embedded into each other at the initial configuration. The experimental results show that our method can improve the overall performance of collision detection and achieve real-time simulation.
基金supported by the Laoshan Laboratory(Grant No.LSKJ202202000)National Natural Science Foundation of China(Grant Nos.12032002,and U22A20256)Natural Science Foundation of Beijing(Grant No.L212023)。
文摘Vehicle wading is a complex fluid-structure interaction(FSI) problem and has attracted great attention recently from the automotive industry, especially for electric vehicles. As a meshless Lagrangian particle method, smoothed particle hydrodynamics(SPH) is one of the most suitable candidates for simulations of vehicle wading due to its inherent advantages in modeling free surface flows, splash, and moving interfaces. Nevertheless, the inevitable neighbor query for the nearest adjacent particles among the support domain leads to considerable computational cost and thus limits its application in 3D large-scale simulations. In this work, a GPU-based SPH method is developed with an adaptive spatial sort technology for simulations of vehicle wading. In addition, a fast, easy-to-implement particle generator is presented for isotropic initialization of the complex vehicle geometry with optimal interpolation properties. A comparative study of vehicle wading on a puddle between the GPUbased SPH with two pieces of commercial software is used to verify the capability of the GPU-based SPH method in terms of convergence analysis, kinematic characteristics, and computing performance. Finally, different conditions of vehicle speeds, water depths, and puddle widths are tested to investigate the vehicle wading numerically. The results demonstrate that the adaptive spatial sort technology can significantly improve the computing performance of the GPU-based SPH method and meanwhile promotes the GPU-based SPH method to be a competitive tool for the study of 3D large-scale FSI problems including vehicle wading. Some helpful findings of the critical vehicle speed, water depth as well as boundary wall effect are also reported in this work.
基金This work was supported in part by National Natural Science Foundation of China(No.61632020,UI936209)Beijing National Science Foundation(No.4192067).
文摘The advent of CUDA-enabled GPU makes it possible to provide cloud applications with high-performance data security services.Unfortunately,recent studies have shown that GPU-based applications are also susceptible to side-channel attacks.These published work studied the side-channel vulnerabilities of GPU-based AES implementations by taking the advantage of the cache sharing among multiple threads or high parallelism of GPUs.Therefore,for GPU-based bitsliced cryptographic implementations,which are immune to the cache-based attacks referred to above,only a power analysis method based on the high-parallelism of GPUs may be effective.However,the leakage model used in the power analysis is not efficient at all in practice.In light of this,we investigate electro-magnetic(EM)side-channel vulnerabilities of a GPU-based bitsliced AES implementation from the perspective of bit-level parallelism and thread-level parallelism in order to make the best of the localization effect of EM leakage with parallelism.Specifically,we propose efficient multi-bit and multi-thread combinational analysis techniques based on the intrinsic properties of bitsliced ciphers and the effect of multi-thread parallelism of GPUs,respectively.The experimental result shows that the proposed combinational analysis methods perform better than non-combinational and intuitive ones.Our research suggests that multi-thread leakages can be used to improve attacks if the multi-thread leakages are not synchronous in the time domain.
基金supported in part by National Natural Science Foundation of China(No.61632020,UI936209)Beijing National Science Foundation(No.4192067).
文摘The advent of CUDA-enabled GPU makes it possible to provide cloud applications with high-performance data security services.Unfortunately,recent studies have shown that GPU-based applications are also susceptible to side-channel attacks.These published work studied the side-channel vulnerabilities of GPU-based AES implementations by taking the advantage of the cache sharing among multiple threads or high parallelism of GPUs.Therefore,for GPU-based bitsliced cryptographic implementations,which are immune to the cache-based attacks referred to above,only a power analysis method based on the high-parallelism of GPUs may be effective.However,the leakage model used in the power analysis is not efficient at all in practice.In light of this,we investigate electro-magnetic(EM)side-channel vulnerabilities of a GPU-based bitsliced AES implementation from the perspective of bit-level parallelism and thread-level parallelism in order to make the best of the localization effect of EM leakage with parallelism.Specifically,we propose efficient multi-bit and multi-thread combinational analysis techniques based on the intrinsic properties of bitsliced ciphers and the effect of multi-thread parallelism of GPUs,respectively.The experimental result shows that the proposed combinational analysis methods perform better than non-combinational and intuitive ones.Our research suggests that multi-thread leakages can be used to improve attacks if the multi-thread leakages are not synchronous in the time domain.