期刊文献+
共找到942篇文章
< 1 2 48 >
每页显示 20 50 100
Volumetric lattice Boltzmann method for pore-scale mass diffusionadvection process in geopolymer porous structures 被引量:1
1
作者 Xiaoyu Zhang Zirui Mao +6 位作者 Floyd W.Hilty Yulan Li Agnes Grandjean Robert Montgomery Hans-Conrad zur Loye Huidan Yu Shenyang Hu 《Journal of Rock Mechanics and Geotechnical Engineering》 SCIE CSCD 2024年第6期2126-2136,共11页
Porous materials present significant advantages for absorbing radioactive isotopes in nuclear waste streams.To improve absorption efficiency in nuclear waste treatment,a thorough understanding of the diffusion-advecti... Porous materials present significant advantages for absorbing radioactive isotopes in nuclear waste streams.To improve absorption efficiency in nuclear waste treatment,a thorough understanding of the diffusion-advection process within porous structures is essential for material design.In this study,we present advancements in the volumetric lattice Boltzmann method(VLBM)for modeling and simulating pore-scale diffusion-advection of radioactive isotopes within geopolymer porous structures.These structures are created using the phase field method(PFM)to precisely control pore architectures.In our VLBM approach,we introduce a concentration field of an isotope seamlessly coupled with the velocity field and solve it by the time evolution of its particle population function.To address the computational intensity inherent in the coupled lattice Boltzmann equations for velocity and concentration fields,we implement graphics processing unit(GPU)parallelization.Validation of the developed model involves examining the flow and diffusion fields in porous structures.Remarkably,good agreement is observed for both the velocity field from VLBM and multiphysics object-oriented simulation environment(MOOSE),and the concentration field from VLBM and the finite difference method(FDM).Furthermore,we investigate the effects of background flow,species diffusivity,and porosity on the diffusion-advection behavior by varying the background flow velocity,diffusion coefficient,and pore volume fraction,respectively.Notably,all three parameters exert an influence on the diffusion-advection process.Increased background flow and diffusivity markedly accelerate the process due to increased advection intensity and enhanced diffusion capability,respectively.Conversely,increasing the porosity has a less significant effect,causing a slight slowdown of the diffusion-advection process due to the expanded pore volume.This comprehensive parametric study provides valuable insights into the kinetics of isotope uptake in porous structures,facilitating the development of porous materials for nuclear waste treatment applications. 展开更多
关键词 Volumetric lattice Boltzmann method(VLBM) Phase field method(PFM) Pore-scale diffusion-advection Nuclear waste treatment Porous media flow graphics processing unit(GPU) parallelization
下载PDF
Optimization of a precise integration method for seismic modeling based on graphic processing unit 被引量:2
2
作者 Jingyu Li Genyang Tang Tianyue Hu 《Earthquake Science》 CSCD 2010年第4期387-393,共7页
General purpose graphic processing unit (GPU) calculation technology is gradually widely used in various fields. Its mode of single instruction, multiple threads is capable of seismic numerical simulation which has ... General purpose graphic processing unit (GPU) calculation technology is gradually widely used in various fields. Its mode of single instruction, multiple threads is capable of seismic numerical simulation which has a huge quantity of data and calculation steps. In this study, we introduce a GPU-based parallel calculation method of a precise integration method (PIM) for seismic forward modeling. Compared with CPU single-core calculation, GPU parallel calculating perfectly keeps the features of PIM, which has small bandwidth, high accuracy and capability of modeling complex substructures, and GPU calculation brings high computational efficiency, which means that high-performing GPU parallel calculation can make seismic forward modeling closer to real seismic records. 展开更多
关键词 precise integration method seismic modeling general purpose GPU graphic processing unit
下载PDF
TIME-DOMAIN INTERPOLATION ON GRAPHICS PROCESSING UNIT 被引量:1
3
作者 XIQI LI GUOHUA SHI YUDONG ZHANG 《Journal of Innovative Optical Health Sciences》 SCIE EI CAS 2011年第1期89-95,共7页
The signal processing speed of spectral domain optical coherence tomography(SD-OCT)has become a bottleneck in a lot of medical applications.Recently,a time-domain interpolation method was proposed.This method can get ... The signal processing speed of spectral domain optical coherence tomography(SD-OCT)has become a bottleneck in a lot of medical applications.Recently,a time-domain interpolation method was proposed.This method can get better signal-to-noise ratio(SNR)but much-reduced signal processing time in SD-OCT data processing as compared with the commonly used zeropadding interpolation method.Additionally,the resampled data can be obtained by a few data and coefficients in the cutoff window.Thus,a lot of interpolations can be performed simultaneously.So,this interpolation method is suitable for parallel computing.By using graphics processing unit(GPU)and the compute unified device architecture(CUDA)program model,time-domain interpolation can be accelerated significantly.The computing capability can be achieved more than 250,000 A-lines,200,000 A-lines,and 160,000 A-lines in a second for 2,048 pixel OCT when the cutoff length is L=11,L=21,and L=31,respectively.A frame SD-OCT data(400A-lines×2,048 pixel per line)is acquired and processed on GPU in real time.The results show that signal processing time of SD-OCT can befinished in 6.223 ms when the cutoff length L=21,which is much faster than that on central processing unit(CPU).Real-time signal processing of acquired data can be realized. 展开更多
关键词 Optical coherence tomography real-time signal processing graphics processing unit GPU CUDA
下载PDF
The inversion of density structure by graphic processing unit(GPU) and identification of igneous rocks in Xisha area 被引量:1
4
作者 Lei Yu Jian Zhang +2 位作者 Wei Lin Rongqiang Wei Shiguo Wu 《Earthquake Science》 2014年第1期117-125,共9页
Organic reefs, the targets of deep-water petro- leum exploration, developed widely in Xisha area. However, there are concealed igneous rocks undersea, to which organic rocks have nearly equal wave impedance. So the ig... Organic reefs, the targets of deep-water petro- leum exploration, developed widely in Xisha area. However, there are concealed igneous rocks undersea, to which organic rocks have nearly equal wave impedance. So the igneous rocks have become interference for future explo- ration by having similar seismic reflection characteristics. Yet, the density and magnetism of organic reefs are very different from igneous rocks. It has obvious advantages to identify organic reefs and igneous rocks by gravity and magnetic data. At first, frequency decomposition was applied to the free-air gravity anomaly in Xisha area to obtain the 2D subdivision of the gravity anomaly and magnetic anomaly in the vertical direction. Thus, the dis- tribution of igneous rocks in the horizontal direction can be acquired according to high-frequency field, low-frequency field, and its physical properties. Then, 3D forward model- ing of gravitational field was carried out to establish the density model of this area by reference to physical properties of rocks based on former researches. Furthermore, 3D inversion of gravity anomaly by genetic algorithm method of the graphic processing unit (GPU) parallel processing in Xisha target area was applied, and 3D density structure of this area was obtained. By this way, we can confine the igneous rocks to the certain depth according to the density of the igneous rocks. The frequency decomposition and 3D inversion of gravity anomaly by genetic algorithm method of the GPU parallel processing proved to be a useful method for recognizing igneous rocks to its 3D geological position. So organic reefs and igneous rocks can be identified, which provide a prescient information for further exploration. 展开更多
关键词 Xisha area Organic reefs and igneous rocks -Frequency decomposition of potential field 3D inversionof the graphic processing unit (GPU) parallel processing
下载PDF
Graphical Processing Unit Based Time-Parallel Numerical Method for Ordinary Differential Equations 被引量:1
5
作者 Sumathi Lakshmiranganatha Suresh S. Muknahallipatna 《Journal of Computer and Communications》 2020年第2期39-63,共25页
On-line transient stability analysis of a power grid is crucial in determining whether the power grid will traverse to a steady state stable operating point after a disturbance. The transient stability analysis involv... On-line transient stability analysis of a power grid is crucial in determining whether the power grid will traverse to a steady state stable operating point after a disturbance. The transient stability analysis involves computing the solutions of the algebraic equations modeling the grid network and the ordinary differential equations modeling the dynamics of the electrical components like synchronous generators, exciters, governors, etc., of the grid in near real-time. In this research, we investigate the use of time-parallel approach in particular the Parareal algorithm implementation on Graphical Processing Unit using Compute Unified Device Architecture to compute solutions of ordinary differential equations. The numerical solution accuracy and computation time of the Parareal algorithm executing on the GPU are demonstrated on the single machine infinite bus test system. Two types of dynamic model of the single synchronous generator namely the classical and detailed models are studied. The numerical solutions of the ordinary differential equations computed by the Parareal algorithm are compared to that computed using the modified Euler’s method demonstrating the accuracy of the Parareal algorithm executing on GPU. Simulations are performed with varying numerical integration time steps, and the suitability of Parareal algorithm in computing near real-time solutions of ordinary different equations is presented. A speedup of 25× and 31× is achieved with the Parareal algorithm for classical and detailed dynamic models of the synchronous generator respectively compared to the sequential modified Euler’s method. The weak scaling efficiency of the Parareal algorithm when required to solve a large number of ordinary differential equations at each time step due to the increase in sequential computations and associated memory transfer latency between the CPU and GPU is discussed. 展开更多
关键词 Time-Parallel DIFFERENTIAL Equation Numerical Integration graphic processing unit
下载PDF
Simulation of fluid-structure interaction in a microchannel using the lattice Boltzmann method and size-dependent beam element on a graphics processing unit
6
作者 Vahid Esfahanian Esmaeil Dehdashti Amir Mehdi Dehrouye-Semnani 《Chinese Physics B》 SCIE EI CAS CSCD 2014年第8期389-395,共7页
Fluid-structure interaction (FSI) problems in microchannels play a prominent role in many engineering applications. The present study is an effort toward the simulation of flow in microchannel considering FSI. The b... Fluid-structure interaction (FSI) problems in microchannels play a prominent role in many engineering applications. The present study is an effort toward the simulation of flow in microchannel considering FSI. The bottom boundary of the microchannel is simulated by size-dependent beam elements for the finite element method (FEM) based on a modified cou- ple stress theory. The lattice Boltzmann method (LBM) using the D2Q13 LB model is coupled to the FEM in order to solve the fluid part of the FSI problem. Because of the fact that the LBM generally needs only nearest neighbor information, the algorithm is an ideal candidate for parallel computing. The simulations are carried out on graphics processing units (GPUs) using computed unified device architecture (CUDA). In the present study, the governing equations are non-dimensionalized and the set of dimensionless groups is exhibited to show their effects on micro-beam displacement. The numerical results show that the displacements of the micro-beam predicted by the size-dependent beam element are smaller than those by the classical beam element. 展开更多
关键词 fluid-structure interaction graphics processing unit lattice Boltzmann method size-dependentbeam element
下载PDF
Multi-relaxation-time lattice Boltzmann simulations of lid driven flows using graphics processing unit
7
作者 Chenggong LI J.P.Y.MAA 《Applied Mathematics and Mechanics(English Edition)》 SCIE EI CSCD 2017年第5期707-722,共16页
Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simul... Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simulate incompressible turbulent cavity flows with the Reynolds numbers up to 1 × 10^7. To improve the computation efficiency of LBM on the numerical simulations of turbulent flows, the massively parallel computing power from a graphic processing unit (GPU) with a computing unified device architecture (CUDA) is introduced into the MRT-LBE-LES model. The model performs well, compared with the results from others, with an increase of 76 times in computation efficiency. It appears that the higher the Reynolds numbers is, the smaller the Smagorinsky constant should be, if the lattice number is fixed. Also, for a selected high Reynolds number and a selected proper Smagorinsky constant, there is a minimum requirement for the lattice number so that the Smagorinsky eddy viscosity will not be excessively large. 展开更多
关键词 large eddy simulation (LES) multi-relaxation-time (MRT) lattice Boltzmann equation (LBE) two-dimensional nine velocity components (D2Q9) Smagorinskymodel graphic processing unit (GPU) computing unified device architecture (CUDA)
下载PDF
Compute Unified Device Architecture Implementation of Euler/Navier-Stokes Solver on Graphics Processing Unit Desktop Platform for 2-D Compressible Flows
8
作者 Zhang Jiale Chen Hongquan 《Transactions of Nanjing University of Aeronautics and Astronautics》 EI CSCD 2016年第5期536-545,共10页
Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/N... Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/Navier-Stokes solver is developed for 2-D compressible flows by using NVIDIA′s Compute Unified Device Architecture(CUDA)programming model in CUDA Fortran programming language.The techniques of implementation of CUDA kernels,double-layered thread hierarchy and variety memory hierarchy are presented to form the GPU-based algorithm of Euler/Navier-Stokes equations.The resulting parallel solver is validated by a set of typical test flow cases.The numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform,which demonstrates that a GPU desktop can serve as a costeffective parallel computing platform to accelerate computational fluid dynamics(CFD)simulations substantially. 展开更多
关键词 graphics processing unit(GPU) GPU parallel computing compute unified device architecture(CUDA)Fortran finite volume method(FVM) acceleration
下载PDF
Graphic Processing Unit-Accelerated Neural Network Model for Biological Species Recognition
9
作者 温程璐 潘伟 +1 位作者 陈晓熹 祝青园 《Journal of Donghua University(English Edition)》 EI CAS 2012年第1期5-8,共4页
A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary netw... A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary network adopted in the paper can overcome the disadvantage of traditional neural network with small inputs. The whole image is considered as the input of the neural network, so the maximal features can be kept for recognition. To speed up the recognition process of the neural network, a fast implementation of the partially connected neural network was conducted on NVIDIA Tesla C1060 using the NVIDIA compute unified device architecture (CUDA) framework. Image sets of eight biological species were obtained to test the GPU implementation and counterpart serial CPU implementation, and experiment results showed GPU implementation works effectively on both recognition rate and speed, and gained 343 speedup over its counterpart CPU implementation. Comparing to feature-based recognition method on the same recognition task, the method also achieved an acceptable correct rate of 84.6% when testing on eight biological species. 展开更多
关键词 graphic processing unit(GPU) compute unified device architecture (CUDA) neural network species recognition
下载PDF
Graphic Processing Unit-Accelerated Mutual Information-Based 3D Image Rigid Registration
10
作者 李冠华 欧宗瑛 +1 位作者 苏铁明 韩军 《Transactions of Tianjin University》 EI CAS 2009年第5期375-380,共6页
Mutual information(MI)-based image registration is effective in registering medical images,but it is computationally expensive.This paper accelerates MI-based image registration by dividing computation of mutual infor... Mutual information(MI)-based image registration is effective in registering medical images,but it is computationally expensive.This paper accelerates MI-based image registration by dividing computation of mutual information into spatial transformation and histogram-based calculation,and performing 3D spatial transformation and trilinear interpolation on graphic processing unit(GPU) .The 3D floating image is downloaded to GPU as flat 3D texture,and then fetched and interpolated for each new voxel location in fragment shader.The transformed re-sults are rendered to textures by using frame buffer object(FBO) extension,and then read to the main memory used for the remaining computation on CPU.Experimental results show that GPU-accelerated method can achieve speedup about an order of magnitude with better registration result compared with the software implementation on a single-core CPU. 展开更多
关键词 图形处理单元 三维图像 注册登记 加速比 互信息 基础 刚性 线性插值
下载PDF
Parallel Image Processing: Taking Grayscale Conversion Using OpenMP as an Example
11
作者 Bayan AlHumaidan Shahad Alghofaily +2 位作者 Maitha Al Qhahtani Sara Oudah Naya Nagy 《Journal of Computer and Communications》 2024年第2期1-10,共10页
In recent years, the widespread adoption of parallel computing, especially in multi-core processors and high-performance computing environments, ushered in a new era of efficiency and speed. This trend was particularl... In recent years, the widespread adoption of parallel computing, especially in multi-core processors and high-performance computing environments, ushered in a new era of efficiency and speed. This trend was particularly noteworthy in the field of image processing, which witnessed significant advancements. This parallel computing project explored the field of parallel image processing, with a focus on the grayscale conversion of colorful images. Our approach involved integrating OpenMP into our framework for parallelization to execute a critical image processing task: grayscale conversion. By using OpenMP, we strategically enhanced the overall performance of the conversion process by distributing the workload across multiple threads. The primary objectives of our project revolved around optimizing computation time and improving overall efficiency, particularly in the task of grayscale conversion of colorful images. Utilizing OpenMP for concurrent processing across multiple cores significantly reduced execution times through the effective distribution of tasks among these cores. The speedup values for various image sizes highlighted the efficacy of parallel processing, especially for large images. However, a detailed examination revealed a potential decline in parallelization efficiency with an increasing number of cores. This underscored the importance of a carefully optimized parallelization strategy, considering factors like load balancing and minimizing communication overhead. Despite challenges, the overall scalability and efficiency achieved with parallel image processing underscored OpenMP’s effectiveness in accelerating image manipulation tasks. 展开更多
关键词 Parallel Computing Image processing OPENMP Parallel Programming High Performance Computing GPU (graphic processing unit)
下载PDF
多光源照射下目标图像实时生成方法
12
作者 张玉双 谢晓钢 +2 位作者 苏华 王锐 张飞舟 《强激光与粒子束》 CAS CSCD 北大核心 2024年第6期41-47,共7页
由于地理位置、太阳、大气环境等因素限制,无法获取空间目标在各种姿态、光照条件、特别是激光、太阳和背景光共同作用下的实际成像。提出一种多光源照射下目标图像实时生成方法。该方法基于计算机图形学中纹理映射思想,采用现代图形显... 由于地理位置、太阳、大气环境等因素限制,无法获取空间目标在各种姿态、光照条件、特别是激光、太阳和背景光共同作用下的实际成像。提出一种多光源照射下目标图像实时生成方法。该方法基于计算机图形学中纹理映射思想,采用现代图形显卡编程技术和帧缓存对象特性,在GPU(Graphics Processing Unit)端采用着色器语言实现多光源作用下目标亮度值高效计算和真实感增强;采用开源三维图形引擎OSG(Open SceneGraph)支持多种格式三维模型文件,提高与国产麒麟操作系统及常用战场态势显示软件的兼容性。仿真实验验证了该方法的有效性和优越性。 展开更多
关键词 多光源 图像生成 GPU编程 OSG
下载PDF
NM-SpMM:面向国产异构向量处理器的半结构化稀疏矩阵乘算法
13
作者 姜晶菲 何源宏 +2 位作者 许金伟 许诗瑶 钱希福 《计算机工程与科学》 CSCD 北大核心 2024年第7期1141-1150,共10页
深度神经网络在自然语言处理、计算机视觉等领域取得了优异的成果,由于智能应用处理数据规模的增长和大模型的快速发展,对深度神经网络的推理性能要求越来越高,N∶M半结构化稀疏化技术成为平衡算力需求和应用效果的热点技术之一。国产... 深度神经网络在自然语言处理、计算机视觉等领域取得了优异的成果,由于智能应用处理数据规模的增长和大模型的快速发展,对深度神经网络的推理性能要求越来越高,N∶M半结构化稀疏化技术成为平衡算力需求和应用效果的热点技术之一。国产异构向量处理器FT-M7032为智能模型处理中的数据并行和指令并行开发提供了较大空间。针对N∶M半结构化稀疏模型计算稀疏模式多样性,提出了一种面向FT-M7032的可灵活配置的稀疏矩阵乘算法NM-SpMM。NM-SpMM设计了一种高效的压缩偏移地址稀疏编码格式COA,避免了半结构化参数配置对稀疏数据访存计算的影响。基于COA编码,NM-SpMM对不同维度稀疏矩阵计算进行了细粒度优化。在FT-M7032单核上的实验结果表明,相较于稠密矩阵乘,NM-SpMM能获得1.73~21.00倍的加速,相较于采用CuSPARSE稀疏计算库的NVIDIA V100 GPU,能获得0.04~1.04倍的加速。 展开更多
关键词 深度神经网络 图形处理器 向量处理器 稀疏矩阵乘 流水线
下载PDF
融合GPU的拟单层覆盖近似集计算方法
14
作者 吴正江 吕成功 王梦松 《计算机工程》 CAS CSCD 北大核心 2024年第5期71-82,共12页
拟单层覆盖粗糙集是一种匹配集值信息系统且有高质量和高效率的粗糙集模型。拟单层覆盖近似集的计算过程中存在大量计算密集且逻辑简单的运算,为此,提出拟单层覆盖近似集的矩阵化表示方法,以利用图形处理器(GPU)强大的计算性能加速计算... 拟单层覆盖粗糙集是一种匹配集值信息系统且有高质量和高效率的粗糙集模型。拟单层覆盖近似集的计算过程中存在大量计算密集且逻辑简单的运算,为此,提出拟单层覆盖近似集的矩阵化表示方法,以利用图形处理器(GPU)强大的计算性能加速计算过程。为了实现这一目标,使用布尔矩阵表示拟单层覆盖近似空间中的元素,引入与集合运算对应的布尔矩阵算子,提出拟单层覆盖粗糙近似集(DE、DA、DE0与DA0)的矩阵表示,并设计矩阵化拟单层覆盖近似集算法(M_SMC)。同时,相应的定理证明了拟单层覆盖近似集的矩阵表示形式与原始定义的等价性。然而,M_SMC运行过程中出现了矩阵存储和计算步骤的内存消耗过多问题。为了将算法部署到显存有限的GPU上,优化矩阵存储和计算步骤,提出分批处理的矩阵化拟单层覆盖近似集算法(BM_SMC)。在10个数据集上的实验结果表明,融合GPU的BM_SMC算法与单纯使用中央处理器(CPU)的BM_SMC算法相比计算效率提高2.16~11.3倍,BM_SMC算法可以在有限的存储空间条件下充分利用GPU,能够有效地提高拟单层覆盖近似集的计算效率。 展开更多
关键词 拟单层覆盖近似集 集值信息系统 矩阵化 GPU加速 分批处理
下载PDF
TEB:GPU上矩阵分解重构的高效SpMV存储格式
15
作者 王宇华 张宇琪 +2 位作者 何俊飞 徐悦竹 崔环宇 《计算机科学与探索》 CSCD 北大核心 2024年第4期1094-1108,共15页
稀疏矩阵向量乘法(SpMV)是科学与工程领域中一个至关重要的计算过程,CSR(compressed sparse row)格式是最常用的稀疏矩阵存储格式之一,在图形处理器(GPU)平台上实现并行SpMV的过程中,其只存储稀疏矩阵的非零元,避免零元素填充所带来的... 稀疏矩阵向量乘法(SpMV)是科学与工程领域中一个至关重要的计算过程,CSR(compressed sparse row)格式是最常用的稀疏矩阵存储格式之一,在图形处理器(GPU)平台上实现并行SpMV的过程中,其只存储稀疏矩阵的非零元,避免零元素填充所带来的计算冗余,节约存储空间,但存在着负载不均衡的问题,浪费了计算资源。针对上述问题,对近年来效果良好的存储格式进行了研究,提出了一种逐行分解重组存储格式——TEB(threshold-exchangeorder block)格式。该格式采用启发式阈值选择算法确定合适分割阈值,并结合基于重排序的行归并算法,对稀疏矩阵进行重构分解,使得块与块之间非零元个数尽可能得相近,其次结合CUDA(computer unified device architecture)线程技术,提出了基于TEB存储格式的子块间并行SpMV算法,能够合理分配计算资源,解决负载不均衡问题,从而提高SpMV并行计算效率。为了验证TEB存储格式的有效性,在NVIDIA Tesla V100平台上进行实验,结果表明TEB相较于PBC(partition-block-CSR)、AMF-CSR(adaptive multi-row folding of CSR)、CSR-Scalar(compressed sparse row-scalar)和CSR5(compressed sparse row 5)存储格式,在SpMV的时间性能方面平均可提升3.23、5.83、2.33和2.21倍;在浮点计算性能方面,平均可提高3.36、5.95、2.29和2.13倍。 展开更多
关键词 稀疏矩阵向量乘法(SpMV) 重新排序 CSR格式 负载均衡 存储格式 图形处理器(GPU)
下载PDF
GNNSched:面向GPU的图神经网络推理任务调度框架 被引量:1
16
作者 孙庆骁 刘轶 +4 位作者 杨海龙 王一晴 贾婕 栾钟治 钱德沛 《计算机工程与科学》 CSCD 北大核心 2024年第1期1-11,共11页
由于频繁的显存访问,图神经网络GNN在GPU上运行时往往资源利用率较低。现有的推理框架由于没有考虑GNN输入的不规则性,直接适用到GNN进行推理任务共置时可能会超出显存容量导致任务失败。对于GNN推理任务,需要根据其输入特点预先分析并... 由于频繁的显存访问,图神经网络GNN在GPU上运行时往往资源利用率较低。现有的推理框架由于没有考虑GNN输入的不规则性,直接适用到GNN进行推理任务共置时可能会超出显存容量导致任务失败。对于GNN推理任务,需要根据其输入特点预先分析并发任务的显存占用情况,以确保并发任务在GPU上的成功共置。此外,多租户场景提交的推理任务亟需灵活的调度策略,以满足并发推理任务的服务质量要求。为了解决上述问题,提出了GNNSched,其在GPU上高效管理GNN推理任务的共置运行。具体来说,GNNSched将并发推理任务组织为队列,并在算子粒度上根据成本函数估算每个任务的显存占用情况。GNNSched实现了多种调度策略来生成任务组,这些任务组被迭代地提交到GPU并发执行。实验结果表明,GNNSched能够满足并发GNN推理任务的服务质量并降低推理任务的响应时延。 展开更多
关键词 图神经网络 图形处理器 推理框架 任务调度 估计模型
下载PDF
大幅宽SAR图像嵌入式舰船实时检测系统设计 被引量:2
17
作者 陆天宇 徐湛 +2 位作者 崔红元 龚昊 王琤 《计算机工程与应用》 CSCD 北大核心 2024年第1期301-309,共9页
针对星载或机载高分辨率合成孔径雷达(synthetic aperture radar,SAR)实时成像后的大幅宽SAR图像舰船实时检测的应用需求,传统的基于FPGA+DSP的嵌入式系统很难同时实现SAR成像处理和基于人工智能技术的大幅宽SAR图像舰船实时检测,为此... 针对星载或机载高分辨率合成孔径雷达(synthetic aperture radar,SAR)实时成像后的大幅宽SAR图像舰船实时检测的应用需求,传统的基于FPGA+DSP的嵌入式系统很难同时实现SAR成像处理和基于人工智能技术的大幅宽SAR图像舰船实时检测,为此设计了一种基于3U VPX FPGA+GPU架构的大幅宽SAR图像嵌入式舰船实时检测系统;提出了一种基于YOLOv5s的舰船检测模型,采用基于L2-范数稀疏性惩罚的缩放因子控制法进行轻量化,轻量化舰船检测模型的参数量减小了47.39%,计算量减少了18.67%,平均检测精度为0.968;将轻量化舰船检测模型应用于大幅宽SAR图像嵌入式舰船实时检测系统,并针对典型的10 km×10 km的大幅宽图像应用场景,设计开发基于多线程技术和基于GPU的众核并行计算技术的大幅宽SAR图像嵌入式实时检测系统软件;通过公开的SAR数据集进行功能验证和性能评估,该系统能够满足不同分辨率的大幅宽SAR图像舰船实时检测需求。 展开更多
关键词 合成孔径雷达(SAR) YOLOv5s 轻量化 图形处理器(GPU) 实时舰船检测
下载PDF
隐私计算环境下深度学习的GPU加速技术综述
18
作者 秦智翔 杨洪伟 +2 位作者 郝萌 何慧 张伟哲 《信息安全研究》 CSCD 北大核心 2024年第7期586-593,共8页
随着深度学习技术的不断发展,神经网络模型的训练时间越来越长,使用GPU计算对神经网络训练进行加速便成为一项关键技术.此外,数据隐私的重要性也推动了隐私计算技术的发展.首先介绍了深度学习、GPU计算的概念以及安全多方计算、同态加密... 随着深度学习技术的不断发展,神经网络模型的训练时间越来越长,使用GPU计算对神经网络训练进行加速便成为一项关键技术.此外,数据隐私的重要性也推动了隐私计算技术的发展.首先介绍了深度学习、GPU计算的概念以及安全多方计算、同态加密2种隐私计算技术,而后探讨了明文环境与隐私计算环境下深度学习的GPU加速技术.在明文环境下,介绍了数据并行和模型并行2种基本的深度学习并行训练模式,分析了重计算和显存交换2种不同的内存优化技术,并介绍了分布式神经网络训练过程中的梯度压缩技术.介绍了在隐私计算环境下安全多方计算和同态加密2种不同隐私计算场景下的深度学习GPU加速技术.简要分析了2种环境下GPU加速深度学习方法的异同. 展开更多
关键词 深度学习 GPU计算 隐私计算 安全多方计算 同态加密
下载PDF
面向GPU并行编程的线程同步综述
19
作者 高岚 赵雨晨 +2 位作者 张伟功 王晶 钱德沛 《软件学报》 EI CSCD 北大核心 2024年第2期1028-1047,共20页
并行计算已成为主流趋势.在并行计算系统中,同步是关键设计之一,对硬件性能的充分利用至关重要.近年来,GPU(graphic processing unit,图形处理器)作为应用最为广加速器得到了快速发展,众多应用也对GPU线程同步提出更高要求.然而,现有GP... 并行计算已成为主流趋势.在并行计算系统中,同步是关键设计之一,对硬件性能的充分利用至关重要.近年来,GPU(graphic processing unit,图形处理器)作为应用最为广加速器得到了快速发展,众多应用也对GPU线程同步提出更高要求.然而,现有GPU系统却难以高效地支持真实应用中复杂的线程同步.研究者虽然提出了很多支持GPU线程同步的方法并取得了较大进展,但GPU独特的体系结构及并行模式导致GPU线程同步的研究仍然面临很多挑战.根据不同的线程同步目的和粒度对GPU并行编程中的线程同步进行分类.在此基础上,围绕GPU线程同步的表达和执行,首先分析总结GPU线程同步存在的难以高效表达、错误频发、执行效率低的关键问题及挑战;而后依据不同的GPU线程同步粒度,从线程同步表达方法和性能优化方法两个方面入手,介绍近年来学术界和产业界对GPU线程竞争同步及合作同步的研究,对现有研究方法进行分析与总结.最后,指出GPU线程同步未来的研究趋势和发展前景,并给出可能的研究思路,从而为该领域的研究人员提供参考. 展开更多
关键词 通用图形处理器(GPGPU) 并行编程 线程同步 性能优化
下载PDF
基于图形处理器的水下目标传递函数多频点处理方法
20
作者 钱浩然 王斌 《舰船科学技术》 北大核心 2024年第14期153-157,共5页
为了提高水下目标宽带回波的计算速度,本文提出一种基于图形处理器GPU的散射传递函数多频点快速计算解决方案。相较于传统算法中逐个频率点计算的方式,CUDA快速算法充分利用各频点处目标强度的相对独立性,基于GPU的硬件特点,同时计算宽... 为了提高水下目标宽带回波的计算速度,本文提出一种基于图形处理器GPU的散射传递函数多频点快速计算解决方案。相较于传统算法中逐个频率点计算的方式,CUDA快速算法充分利用各频点处目标强度的相对独立性,基于GPU的硬件特点,同时计算宽带内的散射声场,从而显著提高了计算效率。本文以潜航器模型为算例,对不同网格数量下模型的目标散射传递函数计算速度进行对比分析。仿真结果表明,相较于传统的CPU串行计算,采用CUDA快速算法能够实现超过80的加速比,有效提高了计算速度。 展开更多
关键词 板块元方法 图像处理器 计算统一设备架构 并行计算
下载PDF
上一页 1 2 48 下一页 到第
使用帮助 返回顶部