期刊文献+
共找到3,751篇文章
< 1 2 188 >
每页显示 20 50 100
基于GPUs可视化技术的心脏辅助诊断系统研究
1
作者 陈宇珂 吴效明 +2 位作者 杨荣骞 欧陕兴 郑理华 《医疗卫生装备》 CAS 2011年第10期16-18,共3页
目的:实现基于GPUs的心脏断层图像的精确分割和三维可视化,完成心脏辅助诊断系统的设计。方法:结合临床专家诊断经验、心脏CT图像先验特征和图像分割算法模型,采用GPUs并行数据处理技术实现心脏结构的分割和三维可视化。结果:完成了CT... 目的:实现基于GPUs的心脏断层图像的精确分割和三维可视化,完成心脏辅助诊断系统的设计。方法:结合临床专家诊断经验、心脏CT图像先验特征和图像分割算法模型,采用GPUs并行数据处理技术实现心脏结构的分割和三维可视化。结果:完成了CT心脏序列图像的精确、快速、鲁棒分割和三维可视化,初步实现了基于GPUs的可视化技术的心脏辅助诊断系统。结论:研究充分利用计算机图形处理单元GPU强大的并行计算能力,解决了医学图像处理和分割中的问题,提高了程序的运行效率,改善了用户体验。 展开更多
关键词 专家系统 心脏 双源CT CUDA gpus
下载PDF
Efficient Concurrent L1-Minimization Solvers on GPUs 被引量:1
2
作者 Xinyue Chu Jiaquan Gao Bo Sheng 《Computer Systems Science & Engineering》 SCIE EI 2021年第9期305-320,共16页
Given that the concurrent L1-minimization(L1-min)problem is often required in some real applications,we investigate how to solve it in parallel on GPUs in this paper.First,we propose a novel self-adaptive warp impleme... Given that the concurrent L1-minimization(L1-min)problem is often required in some real applications,we investigate how to solve it in parallel on GPUs in this paper.First,we propose a novel self-adaptive warp implementation of the matrix-vector multiplication(Ax)and a novel self-adaptive thread implementation of the matrix-vector multiplication(ATx),respectively,on the GPU.The vector-operation and inner-product decision trees are adopted to choose the optimal vector-operation and inner-product kernels for vectors of any size.Second,based on the above proposed kernels,the iterative shrinkage-thresholding algorithm is utilized to present two concurrent L1-min solvers from the perspective of the streams and the thread blocks on a GPU,and optimize their performance by using the new features of GPU such as the shuffle instruction and the read-only data cache.Finally,we design a concurrent L1-min solver on multiple GPUs.The experimental results have validated the high effectiveness and good performance of our proposed methods. 展开更多
关键词 Concurrent L1-minimization problem dense matrix-vector multiplication fast iterative shrinkage-thresholding algorithm CUDA gpus
下载PDF
Accelerating the discontinuous Galerkin method for seismic wave propagation simulations using multiple GPUs with CUDA and MPI 被引量:3
3
作者 Dawei Mu Po Chen Liqiang Wang 《Earthquake Science》 2013年第6期377-393,共17页
We have successfully ported an arbitrary highorder discontinuous Galerkin method for solving the threedimensional isotropic elastic wave equation on unstructured tetrahedral meshes to multiple Graphic Processing Units... We have successfully ported an arbitrary highorder discontinuous Galerkin method for solving the threedimensional isotropic elastic wave equation on unstructured tetrahedral meshes to multiple Graphic Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) of NVIDIA and Message Passing Interface (MPI) and obtained a speedup factor of about 28.3 for the single-precision version of our codes and a speedup factor of about 14.9 for the double-precision version. The GPU used in the comparisons is NVIDIA Tesla C2070 Fermi, and the CPU used is Intel Xeon W5660. To effectively overlap inter-process communication with computation, we separate the elements on each subdomain into inner and outer elements and complete the computation on outer elements and fill the MPI buffer first. While the MPI messages travel across the network, the GPU performs computation on inner elements, and all other calculations that do not use information of outer elements from neighboring subdomains. A significant portion of the speedup also comes from a customized matrix-matrix multiplication kernel, which is used extensively throughout our program. Preliminary performance analysis on our parallel GPU codes shows favorable strong and weak scalabilities. 展开更多
关键词 Seismic wave propagation DiscontinuousGalerkin method GPU
下载PDF
An Approach to Parallelization of SIFT Algorithm on GPUs for Real-Time Applications 被引量:4
4
作者 Raghu Raj Prasanna Kumar Suresh Muknahallipatna John McInroy 《Journal of Computer and Communications》 2016年第17期18-50,共33页
Scale Invariant Feature Transform (SIFT) algorithm is a widely used computer vision algorithm that detects and extracts local feature descriptors from images. SIFT is computationally intensive, making it infeasible fo... Scale Invariant Feature Transform (SIFT) algorithm is a widely used computer vision algorithm that detects and extracts local feature descriptors from images. SIFT is computationally intensive, making it infeasible for single threaded im-plementation to extract local feature descriptors for high-resolution images in real time. In this paper, an approach to parallelization of the SIFT algorithm is demonstrated using NVIDIA’s Graphics Processing Unit (GPU). The parallel-ization design for SIFT on GPUs is divided into two stages, a) Algorithm de-sign-generic design strategies which focuses on data and b) Implementation de-sign-architecture specific design strategies which focuses on optimally using GPU resources for maximum occupancy. Increasing memory latency hiding, eliminating branches and data blocking achieve a significant decrease in aver-age computational time. Furthermore, it is observed via Paraver tools that our approach to parallelization while optimizing for maximum occupancy allows GPU to execute memory bound SIFT algorithm at optimal levels. 展开更多
关键词 Scale Invariant Feature Transform (SIFT) Parallel Computing GPU GPU Occupancy Portable Parallel Programming CUDA
下载PDF
Performance Prediction Based on Statistics of Sparse Matrix-Vector Multiplication on GPUs 被引量:1
5
作者 Ruixing Wang Tongxiang Gu Ming Li 《Journal of Computer and Communications》 2017年第6期65-83,共19页
As one of the most essential and important operations in linear algebra, the performance prediction of sparse matrix-vector multiplication (SpMV) on GPUs has got more and more attention in recent years. In 2012, Guo a... As one of the most essential and important operations in linear algebra, the performance prediction of sparse matrix-vector multiplication (SpMV) on GPUs has got more and more attention in recent years. In 2012, Guo and Wang put forward a new idea to predict the performance of SpMV on GPUs. However, they didn’t consider the matrix structure completely, so the execution time predicted by their model tends to be inaccurate for general sparse matrix. To address this problem, we proposed two new similar models, which take into account the structure of the matrices and make the performance prediction model more accurate. In addition, we predict the execution time of SpMV for CSR-V, CSR-S, ELL and JAD sparse matrix storage formats by the new models on the CUDA platform. Our experimental results show that the accuracy of prediction by our models is 1.69 times better than Guo and Wang’s model on average for most general matrices. 展开更多
关键词 SPARSE Matrix-Vector MULTIPLICATION Performance Prediction GPU Normal DISTRIBUTION UNIFORM DISTRIBUTION
下载PDF
Implementation of a Particle Accelerator Beam Dynamics Code on Multi-Node GPUs
6
作者 Zhicong Liu Ji Qiang 《Journal of Software Engineering and Applications》 2019年第9期321-338,共18页
Particle accelerators play an important role in a wide range of scientific discoveries and industrial applications. The self-consistent multi-particle simulation based on the particle-in-cell (PIC) method has been use... Particle accelerators play an important role in a wide range of scientific discoveries and industrial applications. The self-consistent multi-particle simulation based on the particle-in-cell (PIC) method has been used to study charged particle beam dynamics inside those accelerators. However, the PIC simulation is time-consuming and needs to use modern parallel computers for high-resolution applications. In this paper, we implemented a parallel beam dynamics PIC code on multi-node hybrid architecture computers with multiple Graphics Processing Units (GPUs). We used two methods to parallelize the PIC code on multiple GPUs and observed that the replication method is a better choice for moderate problem size and current computer hardware while the domain decomposition method might be a better choice for large problem size and more advanced computer hardware that allows direct communications among multiple GPUs. Using the multi-node hybrid architectures at Oak Ridge Leadership Computing Facility (OLCF), the optimized GPU PIC code achieves a reasonable parallel performance and scales up to 64 GPUs with 16 million particles. 展开更多
关键词 PARTICLE ACCELERATOR PARTICLE-IN-CELL GPU Parallel BEAM Dynamics Simulation
下载PDF
Real-Time Scheduling Using GPUs--Advanced and More Accurate Proof of Feasibility
7
作者 Peter Fodrek L'udovit Farkas +3 位作者 Michal Blahol Martin Foltin Juraj Hn'it Tomas Murgas 《通讯和计算机(中英文版)》 2012年第8期863-871,共9页
关键词 实时调度 GPU 图形处理器 DDR内存 证明 评估报告 调度子系统 Linux
下载PDF
PELLR: A Permutated ELLPACK-R Format for SpMV on GPUs
8
作者 Zhiqi Wang Tongxiang Gu 《Journal of Computer and Communications》 2020年第4期44-58,共15页
The sparse matrix vector multiplication (SpMV) is inevitable in almost all kinds of scientific computation, such as iterative methods for solving linear systems and eigenvalue problems. With the emergence and developm... The sparse matrix vector multiplication (SpMV) is inevitable in almost all kinds of scientific computation, such as iterative methods for solving linear systems and eigenvalue problems. With the emergence and development of Graphics Processing Units (GPUs), high efficient formats for SpMV should be constructed. The performance of SpMV is mainly determinted by the storage format for sparse matrix. Based on the idea of JAD format, this paper improved the ELLPACK-R format, reduced the waiting time between different threads in a warp, and the speed up achieved about 1.5 in our experimental results. Compared with other formats, such as CSR, ELL, BiELL and so on, our format performance of SpMV is optimal over 70 percent of the test matrix. We proposed a method based on parameters to analyze the performance impact on different formats. In addition, a formula was constructed to count the computation and the number of iterations. 展开更多
关键词 SpMV GPU STORAGE FORMAT HIGH PERFORMANCE
下载PDF
Acceleration of Points to Convex Region Correspondence Pose Estimation Algorithm on GPUs for Real-Time Applications
9
作者 Raghu Raj P. Kumar Suresh S. Muknahallipatna John E. McInroy 《Journal of Computer and Communications》 2016年第17期1-17,共18页
In our previous work, a novel algorithm to perform robust pose estimation was presented. The pose was estimated using points on the object to regions on image correspondence. The laboratory experiments conducted in th... In our previous work, a novel algorithm to perform robust pose estimation was presented. The pose was estimated using points on the object to regions on image correspondence. The laboratory experiments conducted in the previous work showed that the accuracy of the estimated pose was over 99% for position and 84% for orientation estimations respectively. However, for larger objects, the algorithm requires a high number of points to achieve the same accuracy. The requirement of higher number of points makes the algorithm, computationally intensive resulting in the algorithm infeasible for real-time computer vision applications. In this paper, the algorithm is parallelized to run on NVIDIA GPUs. The results indicate that even for objects having more than 2000 points, the algorithm can estimate the pose in real time for each frame of high-resolution videos. 展开更多
关键词 Pose Estimation Parallel Computing GPU CUDA Real Time Image Processing
下载PDF
Kohn–Sham time-dependent density functional theory with Tamm–Dancoff approximation on massively parallel GPUs
10
作者 Inkoo Kim Daun Jeong +7 位作者 Won-Joon Son Hyung-Jin Kim Young Min Rhee Yongsik Jung Hyeonho Choi Jinkyu Yim Inkook Jang Dae Sin Kim 《npj Computational Materials》 SCIE EI CSCD 2023年第1期1556-1567,共12页
We report a high-performance multi graphics processing unit(GPU)implementation of the Kohn–Sham time-dependent density functional theory(TDDFT)within the Tamm–Dancoff approximation.Our algorithm on massively paralle... We report a high-performance multi graphics processing unit(GPU)implementation of the Kohn–Sham time-dependent density functional theory(TDDFT)within the Tamm–Dancoff approximation.Our algorithm on massively parallel computing systems using multiple parallel models in tandem scales optimally with material size,considerably reducing the computational wall time.A benchmark TDDFT study was performed on a green fluorescent protein complex composed of 4353 atoms with 40,518 atomic orbitals represented by Gaussian-type functions,demonstrating the effect of distant protein residues on the excitation.As the largest molecule attempted to date to the best of our knowledge,the proposed strategy demonstrated reasonably high efficiencies up to 256 GPUs on a custom-built state-of-the-art GPU computing system with Nvidia A100 GPUs.We believe that our GPU-oriented algorithms,which empower first-principles simulation for very large-scale applications,may render deeper understanding of the molecular basis of material behaviors,eventually revealing new possibilities for breakthrough designs on new material systems. 展开更多
关键词 gpus GRAPHICS MASSIVE
原文传递
Efficient Knowledge Graph Embedding Training Framework with Multiple GPUs 被引量:1
11
作者 Ding Sun Zhen Huang +1 位作者 Dongsheng Li Min Guo 《Tsinghua Science and Technology》 SCIE EI CAS CSCD 2023年第1期167-175,共9页
When training a large-scale knowledge graph embedding(KGE)model with multiple graphics processing units(GPUs),the partition-based method is necessary for parallel training.However,existing partition-based training met... When training a large-scale knowledge graph embedding(KGE)model with multiple graphics processing units(GPUs),the partition-based method is necessary for parallel training.However,existing partition-based training methods suffer from low GPU utilization and high input/output(IO)overhead between the memory and disk.For a high IO overhead between the disk and memory problem,we optimized the twice partitioning with fine-grained GPU scheduling to reduce the IO overhead between the CPU memory and disk.For low GPU utilization caused by the GPU load imbalance problem,we proposed balanced partitioning and dynamic scheduling methods to accelerate the training speed in different cases.With the above methods,we proposed fine-grained partitioning KGE,an efficient KGE training framework with multiple GPUs.We conducted experiments on some benchmarks of the knowledge graph,and the results show that our method achieves speedup compared to existing framework on the training of KGE. 展开更多
关键词 knowledge graph embedding parallel algorithm partitioning graph framework graphics processing unit(GPU)
原文传递
优化的传输线有限元法在电磁场中的分析及应用
12
作者 方锦 阎秀恪 +2 位作者 钟立国 任自艳 张殿海 《东北电力技术》 2024年第1期37-42,共6页
为提高传输线有限元法(transmission line model-finite element method,TLM-FEM)的求解效率,对该方法的入射阶段和反射阶段的求解过程进行了优化。在反射阶段采用优化的松弛方法加速求解非线性端口电压,将单元系数矩阵的计算以及全局... 为提高传输线有限元法(transmission line model-finite element method,TLM-FEM)的求解效率,对该方法的入射阶段和反射阶段的求解过程进行了优化。在反射阶段采用优化的松弛方法加速求解非线性端口电压,将单元系数矩阵的计算以及全局矩阵的分解在GPU上并行实现进一步提升计算效率。将优化的TLM-FEM用于单相变压器电磁场的计算中,通过C++自编程分析与商用软件ANSYS的求解结果进行对比,验证了算法的准确性。对同一模型不同网格数量的计算时间进行对比,可知提出的方法可用于大型电气设备电磁场分析。 展开更多
关键词 优化松弛方法 并行计算 传输线法 有限元 gpus
下载PDF
多光源照射下目标图像实时生成方法
13
作者 张玉双 谢晓钢 +2 位作者 苏华 王锐 张飞舟 《强激光与粒子束》 CAS CSCD 北大核心 2024年第6期41-47,共7页
由于地理位置、太阳、大气环境等因素限制,无法获取空间目标在各种姿态、光照条件、特别是激光、太阳和背景光共同作用下的实际成像。提出一种多光源照射下目标图像实时生成方法。该方法基于计算机图形学中纹理映射思想,采用现代图形显... 由于地理位置、太阳、大气环境等因素限制,无法获取空间目标在各种姿态、光照条件、特别是激光、太阳和背景光共同作用下的实际成像。提出一种多光源照射下目标图像实时生成方法。该方法基于计算机图形学中纹理映射思想,采用现代图形显卡编程技术和帧缓存对象特性,在GPU(Graphics Processing Unit)端采用着色器语言实现多光源作用下目标亮度值高效计算和真实感增强;采用开源三维图形引擎OSG(Open SceneGraph)支持多种格式三维模型文件,提高与国产麒麟操作系统及常用战场态势显示软件的兼容性。仿真实验验证了该方法的有效性和优越性。 展开更多
关键词 多光源 图像生成 GPU编程 OSG
下载PDF
eMD:基于异构计算的大规模分子动力学模拟软件
14
作者 徐顺 张宝花 +1 位作者 刘倩 金钟 《数据与计算发展前沿》 CSCD 2024年第1期21-34,共14页
【目的】异构计算已经成为高性能计算的重要组成部分,GPU异构计算可显著提速计算密集型的分子动力学模拟应用,本文介绍自研分子动力学模拟软件eMD的系统设计及其异构计算应用。【方法】首先介绍eMD软件的目标定位,包括应用功能和计算性... 【目的】异构计算已经成为高性能计算的重要组成部分,GPU异构计算可显著提速计算密集型的分子动力学模拟应用,本文介绍自研分子动力学模拟软件eMD的系统设计及其异构计算应用。【方法】首先介绍eMD软件的目标定位,包括应用功能和计算性能两方面;然后介绍软件概要设计,包括框架、模块和接口等组成部分;重点围绕面向异构计算的软件架构设计和移植优化技术进行阐述。【结果】eMD软件系统基于GPU异构计算可实现大规模体系模拟,同时提供特色的分子动力学模拟算法和模型。【结论】eMD将充分发挥GPU异构计算算力,以提升分子动力学模拟应用效率,助力分子建模理论方法的创新应用和分子科学问题的研究。 展开更多
关键词 分子动力学 GPU异构计算 并行计算 国产超算
下载PDF
DRM:基于迭代归并策略的GPU并行SpMV存储格式
15
作者 王宇华 何俊飞 +2 位作者 张宇琪 徐悦竹 崔环宇 《计算机工程与科学》 CSCD 北大核心 2024年第3期381-394,共14页
稀疏矩阵向量乘(SpMV)在线性系统的求解问题中具有重要意义,是科学计算和工程实践中的核心问题之一,其性能高度依赖于稀疏矩阵的非零分布。稀疏对角矩阵是一类特殊的稀疏矩阵,其非零元素按照对角线的形式密集排列。针对稀疏对角矩阵,在... 稀疏矩阵向量乘(SpMV)在线性系统的求解问题中具有重要意义,是科学计算和工程实践中的核心问题之一,其性能高度依赖于稀疏矩阵的非零分布。稀疏对角矩阵是一类特殊的稀疏矩阵,其非零元素按照对角线的形式密集排列。针对稀疏对角矩阵,在GPU平台上提出的多种存储格式虽然使SpMV性能有所提升,但仍存在零填充和负载不平衡的问题。针对上述问题,提出了一种DRM存储格式,利用基于固定阈值的矩阵划分策略和基于迭代归并的矩阵重构策略,实现了少量零填充和块间负载平衡。实验结果表明,在NVIDIA■ Tesla■ V100平台上,相比于DIA、HDC、HDIA和DIA-Adaptive格式,在时间性能方面,该存储格式分别取得了20.76,1.94,1.13和2.26倍加速;在浮点计算性能方面,分别提高了1.54,5.28,1.13和1.94倍。 展开更多
关键词 GPU SpMV 稀疏对角矩阵 零填充 负载平衡
下载PDF
融合GPU的拟单层覆盖近似集计算方法
16
作者 吴正江 吕成功 王梦松 《计算机工程》 CAS CSCD 北大核心 2024年第5期71-82,共12页
拟单层覆盖粗糙集是一种匹配集值信息系统且有高质量和高效率的粗糙集模型。拟单层覆盖近似集的计算过程中存在大量计算密集且逻辑简单的运算,为此,提出拟单层覆盖近似集的矩阵化表示方法,以利用图形处理器(GPU)强大的计算性能加速计算... 拟单层覆盖粗糙集是一种匹配集值信息系统且有高质量和高效率的粗糙集模型。拟单层覆盖近似集的计算过程中存在大量计算密集且逻辑简单的运算,为此,提出拟单层覆盖近似集的矩阵化表示方法,以利用图形处理器(GPU)强大的计算性能加速计算过程。为了实现这一目标,使用布尔矩阵表示拟单层覆盖近似空间中的元素,引入与集合运算对应的布尔矩阵算子,提出拟单层覆盖粗糙近似集(DE、DA、DE0与DA0)的矩阵表示,并设计矩阵化拟单层覆盖近似集算法(M_SMC)。同时,相应的定理证明了拟单层覆盖近似集的矩阵表示形式与原始定义的等价性。然而,M_SMC运行过程中出现了矩阵存储和计算步骤的内存消耗过多问题。为了将算法部署到显存有限的GPU上,优化矩阵存储和计算步骤,提出分批处理的矩阵化拟单层覆盖近似集算法(BM_SMC)。在10个数据集上的实验结果表明,融合GPU的BM_SMC算法与单纯使用中央处理器(CPU)的BM_SMC算法相比计算效率提高2.16~11.3倍,BM_SMC算法可以在有限的存储空间条件下充分利用GPU,能够有效地提高拟单层覆盖近似集的计算效率。 展开更多
关键词 拟单层覆盖近似集 集值信息系统 矩阵化 GPU加速 分批处理
下载PDF
TEB:GPU上矩阵分解重构的高效SpMV存储格式
17
作者 王宇华 张宇琪 +2 位作者 何俊飞 徐悦竹 崔环宇 《计算机科学与探索》 CSCD 北大核心 2024年第4期1094-1108,共15页
稀疏矩阵向量乘法(SpMV)是科学与工程领域中一个至关重要的计算过程,CSR(compressed sparse row)格式是最常用的稀疏矩阵存储格式之一,在图形处理器(GPU)平台上实现并行SpMV的过程中,其只存储稀疏矩阵的非零元,避免零元素填充所带来的... 稀疏矩阵向量乘法(SpMV)是科学与工程领域中一个至关重要的计算过程,CSR(compressed sparse row)格式是最常用的稀疏矩阵存储格式之一,在图形处理器(GPU)平台上实现并行SpMV的过程中,其只存储稀疏矩阵的非零元,避免零元素填充所带来的计算冗余,节约存储空间,但存在着负载不均衡的问题,浪费了计算资源。针对上述问题,对近年来效果良好的存储格式进行了研究,提出了一种逐行分解重组存储格式——TEB(threshold-exchangeorder block)格式。该格式采用启发式阈值选择算法确定合适分割阈值,并结合基于重排序的行归并算法,对稀疏矩阵进行重构分解,使得块与块之间非零元个数尽可能得相近,其次结合CUDA(computer unified device architecture)线程技术,提出了基于TEB存储格式的子块间并行SpMV算法,能够合理分配计算资源,解决负载不均衡问题,从而提高SpMV并行计算效率。为了验证TEB存储格式的有效性,在NVIDIA Tesla V100平台上进行实验,结果表明TEB相较于PBC(partition-block-CSR)、AMF-CSR(adaptive multi-row folding of CSR)、CSR-Scalar(compressed sparse row-scalar)和CSR5(compressed sparse row 5)存储格式,在SpMV的时间性能方面平均可提升3.23、5.83、2.33和2.21倍;在浮点计算性能方面,平均可提高3.36、5.95、2.29和2.13倍。 展开更多
关键词 稀疏矩阵向量乘法(SpMV) 重新排序 CSR格式 负载均衡 存储格式 图形处理器(GPU)
下载PDF
基于GPU并行计算和WebGIS的潖江蓄滞洪区洪水预报系统研究
18
作者 陈丕翔 叶志恒 +1 位作者 叶利娜 王扬 《广东水利水电》 2024年第6期69-72,79,共5页
洪水预报所采用的数值模拟涉及大量计算,模拟的结果需经多种专用软件处理后才能展示给用户,操作繁琐,无法满足蓄滞洪区防洪调度及应急抢险处置中迅速做出响应的需求。该文提出了基于GPU并行计算和WebGIS的洪水预报系统,旨在提高洪水计... 洪水预报所采用的数值模拟涉及大量计算,模拟的结果需经多种专用软件处理后才能展示给用户,操作繁琐,无法满足蓄滞洪区防洪调度及应急抢险处置中迅速做出响应的需求。该文提出了基于GPU并行计算和WebGIS的洪水预报系统,旨在提高洪水计算的效率,延长预见期,并实现洪水演进的可视化。该系统基于最新的GPU加速的计算方法,利用GPU强大的浮点数运算能大幅提高洪水计算的效率,结合WebGIS技术,将水文-洪水演进模型的计算结果与水利底图无缝连接,以图表、图像和动画等形式直观展示洪水演变过程,使决策人员能直观地掌握蓄滞洪区洪水的演变过程,可为潖江蓄滞洪区的调度运用和防洪抢险提供帮助。 展开更多
关键词 GPU WEBGIS 潖江蓄滞洪区 洪水预报系统
下载PDF
GPU异构计算环境中长短时记忆网络模型的应用及优化
19
作者 梁桂才 梁思成 陆莹 《计算机应用文摘》 2024年第10期37-41,共5页
随着深度学习的广泛应用及算力资源的异构化,在GPU异构计算环境下的深度学习加速成为又一研究热点。文章探讨了在GPU异构计算环境中如何应用长短时记忆网络模型,并通过优化策略提高其性能。首先,介绍了长短时记忆网络模型的基本结构(包... 随着深度学习的广泛应用及算力资源的异构化,在GPU异构计算环境下的深度学习加速成为又一研究热点。文章探讨了在GPU异构计算环境中如何应用长短时记忆网络模型,并通过优化策略提高其性能。首先,介绍了长短时记忆网络模型的基本结构(包括门控循环单元、丢弃法、Adam与双向长短时记忆网络等);其次,提出了在GPU上执行的一系列优化方法,如CuDNN库的应用及并行计算的设计等。最终,通过实验分析了以上优化方法在训练时间、验证集性能、测试集性能、超参数和硬件资源使用等方面的差异。 展开更多
关键词 GPU异构 长短时记忆网络 门控循环单元 ADAM DROPOUT CuDNN
下载PDF
以子图融合为最小单位的混合精度推理
20
作者 崔丽群 胡磊 《软件导刊》 2024年第6期44-52,共9页
近几年卷积神经网络作为深度学习最重要的技术,在图像分类、物体检测、语音识别等领域均有所建树。在此期间,由多层卷积神经网络组成的深度神经网络横空出世,在各种任务准确性方面具有显著提升。然而,神经网络的权重往往被限定在单精度... 近几年卷积神经网络作为深度学习最重要的技术,在图像分类、物体检测、语音识别等领域均有所建树。在此期间,由多层卷积神经网络组成的深度神经网络横空出世,在各种任务准确性方面具有显著提升。然而,神经网络的权重往往被限定在单精度类型,使网络体积相较于特定硬件平台上的内存空间更大,且floating point 16、INT 8等单精度类型已无法满足现在一些模型推理的现实需求。为此,提出一种以子图为最小单位,通过判断相邻结点之间的融合关系,添加了丰富比特位的混合精度推理算法。首先,在原有单精度量化设计的搜索空间中增加floating point 16半精度的比特配置,使最终搜索空间变大,为寻找最优解提供更多机会。其次,使用子图融合的思想,通过整数线性规划将融合后的不同子图精度配置,根据模型大小、推理延迟和位宽操作数3个约束对计算图进行划分,使最后累积的扰动误差减少。最终,在ResNet系列网络上验证发现,所提模型精度相较于HAWQ V3的损失没超过1%的同时,相较于其他混合精度量化方法在推理速度方面得到了提升,在ResNet18网络中推理速度分别提升18.15%、19.21%,在ResNet50网络中推理速度分别提升13.15%、13.70%。 展开更多
关键词 子图融合 混合精度推理 约束问题最优化求解 GPU加速
下载PDF
上一页 1 2 188 下一页 到第
使用帮助 返回顶部