期刊文献+
共找到411篇文章
< 1 2 21 >
每页显示 20 50 100
Parallel Image Processing: Taking Grayscale Conversion Using OpenMP as an Example
1
作者 Bayan AlHumaidan Shahad Alghofaily +2 位作者 Maitha Al Qhahtani Sara Oudah Naya Nagy 《Journal of Computer and Communications》 2024年第2期1-10,共10页
In recent years, the widespread adoption of parallel computing, especially in multi-core processors and high-performance computing environments, ushered in a new era of efficiency and speed. This trend was particularl... In recent years, the widespread adoption of parallel computing, especially in multi-core processors and high-performance computing environments, ushered in a new era of efficiency and speed. This trend was particularly noteworthy in the field of image processing, which witnessed significant advancements. This parallel computing project explored the field of parallel image processing, with a focus on the grayscale conversion of colorful images. Our approach involved integrating OpenMP into our framework for parallelization to execute a critical image processing task: grayscale conversion. By using OpenMP, we strategically enhanced the overall performance of the conversion process by distributing the workload across multiple threads. The primary objectives of our project revolved around optimizing computation time and improving overall efficiency, particularly in the task of grayscale conversion of colorful images. Utilizing OpenMP for concurrent processing across multiple cores significantly reduced execution times through the effective distribution of tasks among these cores. The speedup values for various image sizes highlighted the efficacy of parallel processing, especially for large images. However, a detailed examination revealed a potential decline in parallelization efficiency with an increasing number of cores. This underscored the importance of a carefully optimized parallelization strategy, considering factors like load balancing and minimizing communication overhead. Despite challenges, the overall scalability and efficiency achieved with parallel image processing underscored OpenMP’s effectiveness in accelerating image manipulation tasks. 展开更多
关键词 Parallel Computing Image processing OPENMP Parallel Programming High Performance Computing GPU (graphic processing unit)
下载PDF
Optimization of a precise integration method for seismic modeling based on graphic processing unit 被引量:2
2
作者 Jingyu Li Genyang Tang Tianyue Hu 《Earthquake Science》 CSCD 2010年第4期387-393,共7页
General purpose graphic processing unit (GPU) calculation technology is gradually widely used in various fields. Its mode of single instruction, multiple threads is capable of seismic numerical simulation which has ... General purpose graphic processing unit (GPU) calculation technology is gradually widely used in various fields. Its mode of single instruction, multiple threads is capable of seismic numerical simulation which has a huge quantity of data and calculation steps. In this study, we introduce a GPU-based parallel calculation method of a precise integration method (PIM) for seismic forward modeling. Compared with CPU single-core calculation, GPU parallel calculating perfectly keeps the features of PIM, which has small bandwidth, high accuracy and capability of modeling complex substructures, and GPU calculation brings high computational efficiency, which means that high-performing GPU parallel calculation can make seismic forward modeling closer to real seismic records. 展开更多
关键词 precise integration method seismic modeling general purpose GPU graphic processing unit
下载PDF
Compute Unified Device Architecture Implementation of Euler/Navier-Stokes Solver on Graphics Processing Unit Desktop Platform for 2-D Compressible Flows
3
作者 Zhang Jiale Chen Hongquan 《Transactions of Nanjing University of Aeronautics and Astronautics》 EI CSCD 2016年第5期536-545,共10页
Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/N... Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/Navier-Stokes solver is developed for 2-D compressible flows by using NVIDIA′s Compute Unified Device Architecture(CUDA)programming model in CUDA Fortran programming language.The techniques of implementation of CUDA kernels,double-layered thread hierarchy and variety memory hierarchy are presented to form the GPU-based algorithm of Euler/Navier-Stokes equations.The resulting parallel solver is validated by a set of typical test flow cases.The numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform,which demonstrates that a GPU desktop can serve as a costeffective parallel computing platform to accelerate computational fluid dynamics(CFD)simulations substantially. 展开更多
关键词 graphics processing unit(GPU) GPU parallel computing compute unified device architecture(CUDA)Fortran finite volume method(FVM) acceleration
下载PDF
Multi-relaxation-time lattice Boltzmann simulations of lid driven flows using graphics processing unit
4
作者 Chenggong LI J.P.Y.MAA 《Applied Mathematics and Mechanics(English Edition)》 SCIE EI CSCD 2017年第5期707-722,共16页
Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simul... Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simulate incompressible turbulent cavity flows with the Reynolds numbers up to 1 × 10^7. To improve the computation efficiency of LBM on the numerical simulations of turbulent flows, the massively parallel computing power from a graphic processing unit (GPU) with a computing unified device architecture (CUDA) is introduced into the MRT-LBE-LES model. The model performs well, compared with the results from others, with an increase of 76 times in computation efficiency. It appears that the higher the Reynolds numbers is, the smaller the Smagorinsky constant should be, if the lattice number is fixed. Also, for a selected high Reynolds number and a selected proper Smagorinsky constant, there is a minimum requirement for the lattice number so that the Smagorinsky eddy viscosity will not be excessively large. 展开更多
关键词 large eddy simulation (LES) multi-relaxation-time (MRT) lattice Boltzmann equation (LBE) two-dimensional nine velocity components (D2Q9) Smagorinskymodel graphic processing unit (GPU) computing unified device architecture (CUDA)
下载PDF
Graphic Processing Unit-Accelerated Neural Network Model for Biological Species Recognition
5
作者 温程璐 潘伟 +1 位作者 陈晓熹 祝青园 《Journal of Donghua University(English Edition)》 EI CAS 2012年第1期5-8,共4页
A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary netw... A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary network adopted in the paper can overcome the disadvantage of traditional neural network with small inputs. The whole image is considered as the input of the neural network, so the maximal features can be kept for recognition. To speed up the recognition process of the neural network, a fast implementation of the partially connected neural network was conducted on NVIDIA Tesla C1060 using the NVIDIA compute unified device architecture (CUDA) framework. Image sets of eight biological species were obtained to test the GPU implementation and counterpart serial CPU implementation, and experiment results showed GPU implementation works effectively on both recognition rate and speed, and gained 343 speedup over its counterpart CPU implementation. Comparing to feature-based recognition method on the same recognition task, the method also achieved an acceptable correct rate of 84.6% when testing on eight biological species. 展开更多
关键词 graphic processing unit(GPU) compute unified device architecture (CUDA) neural network species recognition
下载PDF
Exploiting Parallelism in the Simulation of General Purpose Graphics Processing Unit Program
6
作者 赵夏 马胜 +1 位作者 陈微 王志英 《Journal of Shanghai Jiaotong university(Science)》 EI 2016年第3期280-288,共9页
The simulation is an important means of performance evaluation of the computer architecture. Nowadays, the serial simulation of general purpose graphics processing unit(GPGPU) architecture is the main bottleneck for t... The simulation is an important means of performance evaluation of the computer architecture. Nowadays, the serial simulation of general purpose graphics processing unit(GPGPU) architecture is the main bottleneck for the simulation speed. To address this issue, we propose the intra-kernel parallelization on a multicore processor and the inter-kernel parallelization on a multiple-machine platform. We apply these two methods to the GPGPU-sim simulator. The intra-kernel parallelization method firstly parallelizes the serial simulation of multiple compute units in one cycle. Then it parallelizes the timing and functional simulation to reduce the performance loss caused by the synchronization between different compute units. The inter-kernel parallelization method divides multiple kernels of a CUDA program into several groups and distributes these groups across multiple simulation hosts to perform the simulation. Experimental results show that the intra-kernel parallelization method achieves a speed-up of up to 12 with a maximum error rate of 0.009 4% on a 32-core machine, and the inter-kernel parallelization method can accelerate the simulation by a factor of up to 3.9 with a maximum error rate of 0.11% on four simulation hosts. The orthogonality between these two methods allows us to combine them together on multiple multi-core hosts to get further performance improvements. 展开更多
关键词 general purpose graphics processing unit(GPGPU) MULTICORE intra-kernel inter-kernel parallel
原文传递
Optimizing photoacoustic image reconstruction using cross-platform parallel computation
7
作者 Tri Vu Yuehang Wang Jun Xia 《Visual Computing for Industry,Biomedicine,and Art》 2018年第1期12-17,共6页
Three-dimensional(3D)image reconstruction involves the computations of an extensive amount of data that leads to tremendous processing time.Therefore,optimization is crucially needed to improve the performance and eff... Three-dimensional(3D)image reconstruction involves the computations of an extensive amount of data that leads to tremendous processing time.Therefore,optimization is crucially needed to improve the performance and efficiency.With the widespread use of graphics processing units(GPU),parallel computing is transforming this arduous reconstruction process for numerous imaging modalities,and photoacoustic computed tomography(PACT)is not an exception.Existing works have investigated GPU-based optimization on photoacoustic microscopy(PAM)and PACT reconstruction using compute unified device architecture(CUDA)on either C++or MATLAB only.However,our study is the first that uses cross-platform GPU computation.It maintains the simplicity of MATLAB,while improves the speed through CUDA/C++−based MATLAB converted functions called MEXCUDA.Compared to a purely MATLAB with GPU approach,our cross-platform method improves the speed five times.Because MATLAB is widely used in PAM and PACT,this study will open up new avenues for photoacoustic image reconstruction and relevant real-time imaging applications. 展开更多
关键词 Photoacoustic computed tomography graphics processing units Parallel computation Focal-line backprojection algorithm MATLAB Optical imaging
下载PDF
An Efficient Acceleration of Solving Heat and Mass Transfer Equations with the Second Kind Boundary Conditions in Capillary Porous Composite Cylinder Using Programmable Graphics Hardware
8
作者 Hira Narang Fan Wu Abdul Rafae Mohammed 《Journal of Computer and Communications》 2018年第9期24-38,共15页
With the recent developments in computing technology, increased efforts have gone into simulation of various scientific methods and phenomenon in engineering fields. One such case is the simulation of heat and mass tr... With the recent developments in computing technology, increased efforts have gone into simulation of various scientific methods and phenomenon in engineering fields. One such case is the simulation of heat and mass transfer in capillary porous media, which is becoming more and more important in analysing various scenarios in engineering applications. Analysing such heat and mass transfer phenomenon in a given environment requires us to simulate it. This entails simulation of coupled heat mass transfer equations. However, this process of numerical solution of heat and mass transfer equations is very much time consuming. Therefore, this paper aims at utilizing one of the acceleration techniques developed in the graphics community that exploits a graphics processing unit (GPU) which is applied to the numerical solutions of heat and mass transfer equations. The nVidia Compute Unified Device Architecture (CUDA) programming model caters a good method of applying parallel computing to program the graphical processing unit. This paper shows a good improvement in the performance while solving the heat and mass transfer equations for capillary porous composite cylinder with the second kind of boundary conditions numerically running on GPU. This heat and mass transfer simulation is implemented using CUDA platform on nVidia Quadro FX 4800 graphics card. Our experimental results depict the drastic performance improvement when GPU is used to perform heat and mass transfer simulation. GPU can significantly accelerate the performance with a maximum observed speedup of more than 7-fold times. Therefore, the GPU is a good approach to accelerate the heat and mass transfer simulation. 展开更多
关键词 Numerical Solution Heat and Mass Transfer general purpose graphicS processing unit (GPGPU) CUDA
下载PDF
An Efficient Acceleration of Solving Heat and Mass Transfer Equations with the First Kind Boundary Conditions in Capillary Porous Radially Composite Cylinder Using Programmable Graphics Hardware
9
作者 Hira Narang Fan Wu Abdul Rafae Mohammed 《Journal of Computer and Communications》 2019年第7期267-281,共15页
With the latest advances in computing technology, a huge amount of efforts have gone into simulation of a range of scientific phenomena in engineering fields. One such case is the simulation of heat and mass transfer ... With the latest advances in computing technology, a huge amount of efforts have gone into simulation of a range of scientific phenomena in engineering fields. One such case is the simulation of heat and mass transfer in capillary porous media, which is becoming more and more necessary in analyzing a number of eventualities in science and engineering applications. However, this procedure of numerical solution of heat and mass transfer equations for capillary porous media is very time consuming. Therefore, this paper pursuit is at making use of one of the acceleration methods developed in the graphics community that exploits a graphical processing unit (GPU), which is applied to the numerical solutions of such heat and mass transfer equations. The nVidia Compute Unified Device Architecture (CUDA) programming model offers a correct approach of applying parallel computing to applications with graphical processing unit. This paper suggests a true improvement in the performance while solving the heat and mass transfer equations for capillary porous radially composite cylinder with the first type of boundary conditions. This heat and mass transfer simulation is carried out through the usage of CUDA platform on nVidia Quadro FX 4800 graphics card. Our experimental outcomes exhibit the drastic overall performance enhancement when GPU is used to illustrate heat and mass transfer simulation. GPU can considerably accelerate the performance with a maximum found speedup of more than 5-fold times. Therefore, the GPU is a good strategy to accelerate the heat and mass transfer simulation in porous media. 展开更多
关键词 Numerical Solution Heat and Mass Transfer general purpose graphicS processing unit (GPGPU) CUDA
下载PDF
A Computational Comparison of Basis Updating Schemes for the Simplex Algorithm on a CPU-GPU System
10
作者 Nikolaos Ploskas Nikolaos Samaras 《American Journal of Operations Research》 2013年第6期497-505,共9页
The computation of the basis inverse is the most time-consuming step in simplex type algorithms. This inverse does not have to be computed from scratch at any iteration, but updating schemes can be applied to accelera... The computation of the basis inverse is the most time-consuming step in simplex type algorithms. This inverse does not have to be computed from scratch at any iteration, but updating schemes can be applied to accelerate this calculation. In this paper, we perform a computational comparison in which the basis inverse is computed with five different updating schemes. Then, we propose a parallel implementation of two updating schemes on a CPU-GPU System using MATLAB and CUDA environment. Finally, a computational study on randomly generated full dense linear programs is preented to establish the practical value of GPU-based implementation. 展开更多
关键词 SIMPLEX Algorithm BASIS INVERSE graphicS processing unit MATLAB Compute UNIFIED Device Architecture
下载PDF
面向GPU并行编程的线程同步综述
11
作者 高岚 赵雨晨 +2 位作者 张伟功 王晶 钱德沛 《软件学报》 EI CSCD 北大核心 2024年第2期1028-1047,共20页
并行计算已成为主流趋势.在并行计算系统中,同步是关键设计之一,对硬件性能的充分利用至关重要.近年来,GPU(graphic processing unit,图形处理器)作为应用最为广加速器得到了快速发展,众多应用也对GPU线程同步提出更高要求.然而,现有GP... 并行计算已成为主流趋势.在并行计算系统中,同步是关键设计之一,对硬件性能的充分利用至关重要.近年来,GPU(graphic processing unit,图形处理器)作为应用最为广加速器得到了快速发展,众多应用也对GPU线程同步提出更高要求.然而,现有GPU系统却难以高效地支持真实应用中复杂的线程同步.研究者虽然提出了很多支持GPU线程同步的方法并取得了较大进展,但GPU独特的体系结构及并行模式导致GPU线程同步的研究仍然面临很多挑战.根据不同的线程同步目的和粒度对GPU并行编程中的线程同步进行分类.在此基础上,围绕GPU线程同步的表达和执行,首先分析总结GPU线程同步存在的难以高效表达、错误频发、执行效率低的关键问题及挑战;而后依据不同的GPU线程同步粒度,从线程同步表达方法和性能优化方法两个方面入手,介绍近年来学术界和产业界对GPU线程竞争同步及合作同步的研究,对现有研究方法进行分析与总结.最后,指出GPU线程同步未来的研究趋势和发展前景,并给出可能的研究思路,从而为该领域的研究人员提供参考. 展开更多
关键词 通用图形处理器(GPGPU) 并行编程 线程同步 性能优化
下载PDF
Developing Extensible Lattice-Boltzmann Simulators for General-Purpose Graphics-Processing Units
12
作者 Stuart D.C.Walsh Martin O.Saar 《Communications in Computational Physics》 SCIE 2013年第3期867-879,共13页
Lattice-Boltzmann methods are versatile numerical modeling techniques capable of reproducing a wide variety of fluid-mechanical behavior.These methods are well suited to parallel implementation,particularly on the sin... Lattice-Boltzmann methods are versatile numerical modeling techniques capable of reproducing a wide variety of fluid-mechanical behavior.These methods are well suited to parallel implementation,particularly on the single-instruction multiple data(SIMD)parallel processing environments found in computer graphics processing units(GPUs).Although recent programming tools dramatically improve the ease with which GPUbased applications can be written,the programming environment still lacks the flexibility available to more traditional CPU programs.In particular,it may be difficult to develop modular and extensible programs that require variable on-device functionality with current GPU architectures.This paper describes a process of automatic code generation that overcomes these difficulties for lattice-Boltzmann simulations.It details the development of GPU-based modules for an extensible lattice-Boltzmann simulation package-LBHydra.The performance of the automatically generated code is compared to equivalent purpose written codes for both single-phase,multiphase,and multicomponent flows.The flexibility of the new method is demonstrated by simulating a rising,dissolving droplet moving through a porous medium with user generated lattice-Boltzmann models and subroutines. 展开更多
关键词 Lattice-Boltzmann methods graphics processing units computational fluid dynamics
原文传递
基于GPU的LBM迁移模块算法优化
13
作者 黄斌 柳安军 +3 位作者 潘景山 田敏 张煜 朱光慧 《计算机工程》 CAS CSCD 北大核心 2024年第2期232-238,共7页
格子玻尔兹曼方法(LBM)是一种基于介观模拟尺度的计算流体力学方法,其在计算时设置大量的离散格点,具有适合并行的特性。图形处理器(GPU)中有大量的算术逻辑单元,适合大规模的并行计算。基于GPU设计LBM的并行算法,能够提高计算效率。但... 格子玻尔兹曼方法(LBM)是一种基于介观模拟尺度的计算流体力学方法,其在计算时设置大量的离散格点,具有适合并行的特性。图形处理器(GPU)中有大量的算术逻辑单元,适合大规模的并行计算。基于GPU设计LBM的并行算法,能够提高计算效率。但是LBM算法迁移模块中每个格点的计算都需要与其他格点进行通信,存在较强的数据依赖。提出一种基于GPU的LBM迁移模块算法优化策略。首先分析迁移部分的实现逻辑,通过模型降维,将三维模型按照速度分量离散为多个二维模型,降低模型的复杂度;然后分析迁移模块计算前后格点中的数据差异,通过数据定位找到迁移模块的通信规律,并对格点之间的数据交换方式进行分类;最后使用分类的交换方式对离散的二维模型进行区域划分,设计新的数据通信方式,由此消除数据依赖的影响,将迁移模块完全并行化。对并行算法进行测试,结果显示:该算法在1.3×10^(8)规模网格下能达到1.92的加速比,表明算法具有良好的并行效果;同时对比未将迁移模块并行化的算法,所提优化策略能提升算法30%的并行计算效率。 展开更多
关键词 高性能计算 格子玻尔兹曼方法 图形处理器 并行优化 数据重排
下载PDF
基于异构平台的图像中值滤波的OpenCL加速算法
14
作者 肖诗洋 王镭 +1 位作者 杜莹 肖汉 《河北大学学报(自然科学版)》 CAS 北大核心 2024年第1期92-103,共12页
图像噪声降低了图像信噪比和质量,去噪是图像处理工作的重要环节之一.本文提出了一种基于开放式计算语言(OpenCL)架构的图像中值滤波快速降噪并行算法.介绍了OpenCL体系结构特点和中值滤波处理流程.根据图形处理器(GPU)的并发结构特点,... 图像噪声降低了图像信噪比和质量,去噪是图像处理工作的重要环节之一.本文提出了一种基于开放式计算语言(OpenCL)架构的图像中值滤波快速降噪并行算法.介绍了OpenCL体系结构特点和中值滤波处理流程.根据图形处理器(GPU)的并发结构特点,对图像中值滤波功能模块进行了并行优化,降低了算法复杂度.通过充分激活NDRange索引空间中的工作组和工作项来提高数据访问效率,优化内核工作组配置参数,实现了中值滤波器的并行处理.实验结果表明,在图像质量保持不变的情况下,与基于CPU的串行算法、基于开放多处理(OpenMP)并行算法和基于统一计算设备架构(CUDA)并行算法性能相比,图像中值滤波并行算法在OpenCL架构下NVIDIA GPU计算平台上分别获得了29.74、17.29、1.15倍的加速比.验证了算法的有效性和平台的可移植性,基本满足应用的实时性处理要求. 展开更多
关键词 中值滤波 椒盐噪声 图形处理器 开放式计算语言 并行算法
下载PDF
基于GPU的实景三维模型裁剪算法研究
15
作者 马东岭 李铭通 朱悦凯 《山东建筑大学学报》 2024年第1期108-116,共9页
图形处理器(Graphic Processing Unit,GPU)作为主流高性能计算的加速设备,已越来越多地应用于诸多领域的并行计算中,利用GPU的并行计算能力,可以极大地提高传统算法的计算效率。文章主要研究GPU多线程计算方法与统一计算架构(Compute Un... 图形处理器(Graphic Processing Unit,GPU)作为主流高性能计算的加速设备,已越来越多地应用于诸多领域的并行计算中,利用GPU的并行计算能力,可以极大地提高传统算法的计算效率。文章主要研究GPU多线程计算方法与统一计算架构(Compute Unified Device Architecture,CUDA)技术在实景三维模型裁剪中的应用,提出了一种基于GPU的实景三维模型裁剪算法,包括设计了基于面拓扑的多级索引结构,以实现线程内重复交点快速查找;提出了一种轻量多边形三角化方法,优化算法流程;使用多种优化策略,在不影响裁剪网格质量的情况下进一步提高算法的性能。结果表明:根据模型大小与裁剪次数的不同,相较于传统算法,所提方法在单次裁剪的情况下加速比可达13.93,在多次裁剪的情况下加速比可达35.85,显著地提高了模型的裁剪效率。 展开更多
关键词 图形处理器 实景三维模型 三角网裁剪 并行计算
下载PDF
人工智能芯片技术演进与发展策略研究
16
作者 宋艳飞 孙佳琪 王睿哲 《信息技术与标准化》 2024年第6期94-98,共5页
为加快抢占人工智能技术高地,围绕人工智能芯片产业需求,阐述了人工智能芯片技术特点、技术路线和发展趋势,并分析了人工智能芯片与先进制造、软件和整机应用等产业生态深度耦合态势。根据人工智能芯片技术与产业发展特点,提出了强化统... 为加快抢占人工智能技术高地,围绕人工智能芯片产业需求,阐述了人工智能芯片技术特点、技术路线和发展趋势,并分析了人工智能芯片与先进制造、软件和整机应用等产业生态深度耦合态势。根据人工智能芯片技术与产业发展特点,提出了强化统筹协调、释放市场优势、深化国际合作策略建议,以期推动我国人工智能芯片高质量发展。 展开更多
关键词 人工智能芯片 图形处理器 统一计算设备架构
下载PDF
A GPU-Based Parallel Algorithm for 2D Large Deformation Contact Problems Using the Finite Particle Method 被引量:1
17
作者 Wei Wang Yanfeng Zheng +2 位作者 Jingzhe Tang Chao Yang Yaozhi Luo 《Computer Modeling in Engineering & Sciences》 SCIE EI 2021年第11期595-626,共32页
Large deformation contact problems generally involve highly nonlinear behaviors,which are very time-consuming and may lead to convergence issues.The finite particle method(FPM)effectively separates pure deformation fr... Large deformation contact problems generally involve highly nonlinear behaviors,which are very time-consuming and may lead to convergence issues.The finite particle method(FPM)effectively separates pure deformation from total motion in large deformation problems.In addition,the decoupled procedures of the FPM make it suitable for parallel computing,which may provide an approach to solve time-consuming issues.In this study,a graphics processing unit(GPU)-based parallel algorithm is proposed for two-dimensional large deformation contact problems.The fundamentals of the FPM for planar solids are first briefly introduced,including the equations of motion of particles and the internal forces of quadrilateral elements.Subsequently,a linked-list data structure suitable for parallel processing is built,and parallel global and local search algorithms are presented for contact detection.The contact forces are then derived and directly exerted on particles.The proposed method is implemented with main solution procedures executed in parallel on a GPU.Two verification problems comprising large deformation frictional contacts are presented,and the accuracy of the proposed algorithm is validated.Furthermore,the algorithm’s performance is investigated via a large-scale contact problem,and the maximum speedups of total computational time and contact calculation reach 28.5 and 77.4,respectively,relative to commercial finite element software Abaqus/Explicit running on a single-core central processing unit(CPU).The contact calculation time percentage of the total calculation time is only 18%with the FPM,much smaller than that(50%)with Abaqus/Explicit,demonstrating the efficiency of the proposed method. 展开更多
关键词 Finite particle method graphics processing unit(GPU) parallel computing contact algorithm LARGE
下载PDF
Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units 被引量:6
18
作者 XIONG QinGang LI Bo +5 位作者 XU Ji FANG XiaoJian WANG XiaoWei WANG LiMin HE XianFeng GE Wei 《Chinese Science Bulletin》 SCIE EI CAS 2012年第7期707-715,共9页
Many-core processors, such as graphic processing units (GPUs), are promising platforms for intrinsic parallel algorithms such as the lattice Boltzmann method (LBM). Although tremendous speedup has been obtained on a s... Many-core processors, such as graphic processing units (GPUs), are promising platforms for intrinsic parallel algorithms such as the lattice Boltzmann method (LBM). Although tremendous speedup has been obtained on a single GPU compared with mainstream CPUs, the performance of the LBM for multiple GPUs has not been studied extensively and systematically. In this article, we carry out LBM simulation on a GPU cluster with many nodes, each having multiple Fermi GPUs. Asynchronous execution with CUDA stream functions, OpenMP and non-blocking MPI communication are incorporated to improve efficiency. The algorithm is tested for two-dimensional Couette flow and the results are in good agreement with the analytical solution. For both the oneand two-dimensional decomposition of space, the algorithm performs well as most of the communication time is hidden. Direct numerical simulation of a two-dimensional gas-solid suspension containing more than one million solid particles and one billion gas lattice cells demonstrates the potential of this algorithm in large-scale engineering applications. The algorithm can be directly extended to the three-dimensional decomposition of space and other modeling methods including explicit grid-based methods. 展开更多
关键词 格子BOLTZMANN方法 图形处理单元 并行算法 集群 COUETTE流 LBM模拟 OPENMP 直接数值模拟
原文传递
Efficient parallel implementation of a density peaks clustering algorithm on graphics processing unit 被引量:2
19
作者 Ke-shi GE Hua-you SU +1 位作者 Dong-sheng LI Xi-cheng LU 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2017年第7期915-927,共13页
基于密度峰值的聚类方法 DP(density peak)由于其新颖有效的特点而广泛应用于科学研究。然而,当确定集群中心时,DP会对每对数据点操作多次,从而导致较高的计算复杂度。在本文中,我们提出了一种基于GPU(graphics processing unit)的高效... 基于密度峰值的聚类方法 DP(density peak)由于其新颖有效的特点而广泛应用于科学研究。然而,当确定集群中心时,DP会对每对数据点操作多次,从而导致较高的计算复杂度。在本文中,我们提出了一种基于GPU(graphics processing unit)的高效并行密度峰值算法。我们分析密度峰值聚类算法的原理来研究其计算瓶颈,并评估其并行的潜力。根据分析,我们提出了CUDA-DP(compute unified device architecture-DP),一种针对GPU架构的高效并行密度峰值聚类算法,并用CUDA实现了这种并行方法。具体来说,我们使用共享内存减少了全局内存访问量。更进一步,为了利用GPU的合并访问机制,我们将CUDA-DP程序的数据结构从AOS(array of structures)重构为SOA(structure of arrays)。另外,我们分别引入二进制搜索方法和采样方法,以避免对距离矩阵进行排序造成的计算开销。实验结果表明,与基于CPU的密度峰值实现相比,CUDA-DP可以实现超过45倍的加速。 展开更多
关键词 GPU 密度峰值 聚类 并行计算
原文传递
基于GPU的B-S模型下改进的Crank Nicolson算法
20
作者 王文浩 邬春学 《上海理工大学学报》 CAS 北大核心 2013年第2期147-151,156,共6页
针对Black-Scholes模型及其公式特点进行了理论分析与数学处理,给出了优化的Crank-Nicolson算法,提高了实际期权交易效率.通过使用GPU作为计算平台,结合CUDA架构技术,验证改进后算法的有效性和适用性.在CPU平台下进行横向测试,验证GPU... 针对Black-Scholes模型及其公式特点进行了理论分析与数学处理,给出了优化的Crank-Nicolson算法,提高了实际期权交易效率.通过使用GPU作为计算平台,结合CUDA架构技术,验证改进后算法的有效性和适用性.在CPU平台下进行横向测试,验证GPU平台运行环境优势.实验表明,改进后的算法在GPU平台下运行所提升的效果显著,运算精度和效率得到提高. 展开更多
关键词 金融期权计算 B—S模型 改进的C—N算法 GPU CUDA构架 HPC
下载PDF
上一页 1 2 21 下一页 到第
使用帮助 返回顶部