期刊文献+
共找到17篇文章
< 1 >
每页显示 20 50 100
Compute Unified Device Architecture Implementation of Euler/Navier-Stokes Solver on Graphics Processing Unit Desktop Platform for 2-D Compressible Flows
1
作者 Zhang Jiale Chen Hongquan 《Transactions of Nanjing University of Aeronautics and Astronautics》 EI CSCD 2016年第5期536-545,共10页
Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/N... Personal desktop platform with teraflops peak performance of thousands of cores is realized at the price of conventional workstations using the programmable graphics processing units(GPUs).A GPU-based parallel Euler/Navier-Stokes solver is developed for 2-D compressible flows by using NVIDIA′s Compute Unified Device Architecture(CUDA)programming model in CUDA Fortran programming language.The techniques of implementation of CUDA kernels,double-layered thread hierarchy and variety memory hierarchy are presented to form the GPU-based algorithm of Euler/Navier-Stokes equations.The resulting parallel solver is validated by a set of typical test flow cases.The numerical results show that dozens of times speedup relative to a serial CPU implementation can be achieved using a single GPU desktop platform,which demonstrates that a GPU desktop can serve as a costeffective parallel computing platform to accelerate computational fluid dynamics(CFD)simulations substantially. 展开更多
关键词 graphics processing unit(GPU) GPU parallel computing compute unified device architecture(CUDA)Fortran finite volume method(FVM) acceleration
下载PDF
Multi-relaxation-time lattice Boltzmann simulations of lid driven flows using graphics processing unit
2
作者 Chenggong LI J.P.Y.MAA 《Applied Mathematics and Mechanics(English Edition)》 SCIE EI CSCD 2017年第5期707-722,共16页
Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simul... Large eddy simulation (LES) using the Smagorinsky eddy viscosity model is added to the two-dimensional nine velocity components (D2Q9) lattice Boltzmann equation (LBE) with multi-relaxation-time (MRT) to simulate incompressible turbulent cavity flows with the Reynolds numbers up to 1 × 10^7. To improve the computation efficiency of LBM on the numerical simulations of turbulent flows, the massively parallel computing power from a graphic processing unit (GPU) with a computing unified device architecture (CUDA) is introduced into the MRT-LBE-LES model. The model performs well, compared with the results from others, with an increase of 76 times in computation efficiency. It appears that the higher the Reynolds numbers is, the smaller the Smagorinsky constant should be, if the lattice number is fixed. Also, for a selected high Reynolds number and a selected proper Smagorinsky constant, there is a minimum requirement for the lattice number so that the Smagorinsky eddy viscosity will not be excessively large. 展开更多
关键词 large eddy simulation (LES) multi-relaxation-time (MRT) lattice Boltzmann equation (LBE) two-dimensional nine velocity components (D2Q9) Smagorinskymodel graphic processing unit (GPU) computing unified device architecture (CUDA)
下载PDF
GPU-based leaves contour generation algorithm
3
作者 张景峤 王廷婷 《Journal of Shanghai University(English Edition)》 CAS 2011年第5期375-380,共6页
The implementation and optimization of the traditional contour generation algorithms are always proposed for the common processor. When processing high resolution images, the performance often exists low efficiency. A... The implementation and optimization of the traditional contour generation algorithms are always proposed for the common processor. When processing high resolution images, the performance often exists low efficiency. A new graphics processing unit (GPU)-based algorithm is proposed to get the clear and integrated contour of leaves. Firstly we implement the classic Sobel operator of edge detection in GPU. Then a simple and effective method is designed to remove the fake edge and a heuristic algorithm is used to repair the broken edge. It is proved by the experiments that the results of our algorithm are natural and realistic in terms of morphology and can be good materials for the virtual plant. 展开更多
关键词 graphics processing unit (GPU) computer unified device architecture (CUDA) edge detection contour generation
下载PDF
Hybrid domain multipactor prediction algorithm and its CUDA parallel implementation
4
作者 WU Peiyu XIE Yongjun +1 位作者 NIU Liqiang JIANG Haolin 《Journal of Systems Engineering and Electronics》 SCIE EI CSCD 2020年第6期1097-1104,共8页
Based on the finite element method(FEM)in the frequency domain and particle-in-cell approach in the time domain,a hybrid domain multipactor threshold prediction algorithm is proposed in this paper.The proposed algorit... Based on the finite element method(FEM)in the frequency domain and particle-in-cell approach in the time domain,a hybrid domain multipactor threshold prediction algorithm is proposed in this paper.The proposed algorithm has the advantages of the frequency domain and the time domain algorithms at the same time in terms of high computational accuracy and considerable computational efficiency.In addition,the compute unified device architecture(CUDA)acceleration technique also can be employed to further enhance its simulation efficiency.Numerical examples are carried out to demonstrate the effectiveness of the proposed algorithm.The results indicate that the multipactor threshold can be accurately predicted and the computational efficiency can be improved. 展开更多
关键词 compute unified device architecture(CUDA) finite element method(FEM) hybrid domain multipactor threshold prediction particle-in-cell(PIC)
下载PDF
Graphic Processing Unit-Accelerated Neural Network Model for Biological Species Recognition
5
作者 温程璐 潘伟 +1 位作者 陈晓熹 祝青园 《Journal of Donghua University(English Edition)》 EI CAS 2012年第1期5-8,共4页
A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary netw... A graphic processing unit (GPU)-accelerated biological species recognition method using partially connected neural evolutionary network model is introduced in this paper. The partial connected neural evolutionary network adopted in the paper can overcome the disadvantage of traditional neural network with small inputs. The whole image is considered as the input of the neural network, so the maximal features can be kept for recognition. To speed up the recognition process of the neural network, a fast implementation of the partially connected neural network was conducted on NVIDIA Tesla C1060 using the NVIDIA compute unified device architecture (CUDA) framework. Image sets of eight biological species were obtained to test the GPU implementation and counterpart serial CPU implementation, and experiment results showed GPU implementation works effectively on both recognition rate and speed, and gained 343 speedup over its counterpart CPU implementation. Comparing to feature-based recognition method on the same recognition task, the method also achieved an acceptable correct rate of 84.6% when testing on eight biological species. 展开更多
关键词 graphic processing unit(GPU) compute unified device architecture (CUDA) neural network species recognition
下载PDF
An enhanced GPU reduction at the warp-level
6
作者 Hou Neng He Fazhi Zhou Yi 《Computer Aided Drafting,Design and Manufacturing》 2016年第2期43-52,共10页
In recent years, graphical processing unit (GPU)-accelerated intelligent algorithms have been widely utilized for solving combination optimization problems, which are NP-hard, These intelligent algorithms involves a... In recent years, graphical processing unit (GPU)-accelerated intelligent algorithms have been widely utilized for solving combination optimization problems, which are NP-hard, These intelligent algorithms involves a common operation, namely reduction, in which the best suitable candidate solution in the neighborhood is selected. As one of the main procedures, it is necessary to optimize the reduction on the GPU. In this paper, we propose an enhanced warp-based reduction on the GPU. Compared with existing block-based reduction methods, our method exploit efficiently the potential of implementation at warp level, which better matches the characteristics of current GPU architecture. Firstly, in order to improve the global memory access performance, the vectoring accessing is utilized. Secondly, at the level of thread block reduction, an enhanced warp-based reduction on the shared memory are presented to form partial results. Thirdly, for the configuration of the number of thread blocks, the number of thread blocks can be obtained by maximizing the size of thread block and the maximum size of threads per stream multi-processor on GPU. Finally, the proposed method is evaluated on three generations of NVIDIA GPUs with the better performances than previous methods. 展开更多
关键词 REDUCTION graphical processing unit computing unified device architecture warp-level reduction
下载PDF
Flexible devices:from materials, architectures to applications 被引量:8
7
作者 Mingzhi Zou Yue Ma +3 位作者 Xin Yuan Yi Hu Jie Liu Zhong Jin 《Journal of Semiconductors》 EI CAS CSCD 2018年第1期135-152,共18页
Flexible devices, such as flexible electronic devices and flexible energy storage devices, have attracted a significant amount of attention in recent years for their potential applications in modern human lives. The d... Flexible devices, such as flexible electronic devices and flexible energy storage devices, have attracted a significant amount of attention in recent years for their potential applications in modern human lives. The development of flexible devices is moving forward rapidly, as the innovation of methods and manufacturing processes has greatly encouraged the research of flexible devices. This review focuses on advanced materials, architecture designs and abundant applications of flexible devices, and discusses the problems and challenges in current situations of flexible devices. We summarize the discovery of novel materials and the design of new architectures for improving the performance of flexible devices. Finally, we introduce the applications of flexible devices as key components in real life. 展开更多
关键词 flexible devices flexible architectures nanomaterials stretchability
原文传递
High-precision parallel computing model of solute transport based on GPU acceleration
8
作者 Shang-hong Zhang Rong-qi Zhang +2 位作者 Wen-da Li Xi-yan Yang Yang Zhou 《Journal of Hydrodynamics》 SCIE EI CSCD 2024年第1期202-212,共11页
The scenario simulation analysis of water environmental emergencies is very important for risk prevention and control,and emergency response.To quickly and accurately simulate the transport and diffusion process of hi... The scenario simulation analysis of water environmental emergencies is very important for risk prevention and control,and emergency response.To quickly and accurately simulate the transport and diffusion process of high-intensity pollutants during sudden environmental water pollution events,in this study,a high-precision pollution transport and diffusion model for unstructured grids based on Compute Unified Device Architecture(CUDA)is proposed.The finite volume method of a total variation diminishing limiter with the Kong proposed r-factor is used to reduce numerical diffusion and oscillation errors in the simulation of pollutants under sharp concentration conditions,and graphics processing unit acceleration technology is used to improve computational efficiency.The advection diffusion process of the model is verified numerically using two benchmark cases,and the efficiency of the model is evaluated using an engineering example.The results demonstrate that the model perform well in the simulation of material transport in the presence of sharp concentration.Additionally,it has high computational efficiency.The acceleration ratio is 46 times the single-thread acceleration effect of the original model.The efficiency of the accelerated model meet the requirements of an engineering application,and the rapid early warning and assessment of water pollution accidents is achieved. 展开更多
关键词 Pollution transport and diffusion model parallel computing Compute Unified device architecture(CUDA) pollution event
原文传递
MoO_(x) and V_(2)O_(x) as hole and electron transport layers through functionalized intercalation in normal and inverted organic optoelectronic devices 被引量:2
9
作者 Xinchen Li Fengxian Xie +2 位作者 Shaoqing Zhang Jianhui Hou Wallace CH Choy 《Light(Science & Applications)》 SCIE EI CAS CSCD 2015年第1期415-421,共7页
To achieve fabrication and cost competitiveness in organic optoelectronic devices that include organic solar cells(OSCs)and organic light-emitting diodes(OLEDs),it is desirable to have one type of material that can si... To achieve fabrication and cost competitiveness in organic optoelectronic devices that include organic solar cells(OSCs)and organic light-emitting diodes(OLEDs),it is desirable to have one type of material that can simultaneously function as both the electron and hole transport layers(ETLs and HTLs)of the organic devices in all device architectures(i.e.,normal and inverted architectures).We address this issue by proposing and demonstrating Cs-intercalated metal oxides(with various Cs mole ratios)as both the ETL and HTL of an organic optoelectronic device with normal and inverted device architectures.Our results demonstrate that the new approach works well for widely used transition metal oxides of molybdenum oxide(MoOx)and vanadium oxide(V_(2)O_(x)).Moreover,the Cs-intercalated metaloxide-based ETL and HTL can be easily formed under the conditions of a room temperature,water-free and solution-based process.These conditions favor practical applications of OSCs and OLEDs.Notably,with the analyses of the Kelvin Probe System,our approach of Cs-intercalated metal oxides with a wide mole ratio range of transition metals(Mo or V)/Cs from 1:0 to 1:0.75 can offer significant and continuous work function tuning as large as 1.31 eV for functioning as both an ETL and HTL.Consequently,our method of intercalated metal oxides can contribute to the emerging large-scale and low-cost organic optoelectronic devices. 展开更多
关键词 metal oxides carrier transport layers normal and inverted device architectures organic light-emitting diodes organic solar cells room-temperature solution process
原文传递
HXPY: A High-Performance Data Processing Package for Financial Time-Series Data
10
作者 郭家栋 彭靖姝 +1 位作者 苑航 倪明选 《Journal of Computer Science & Technology》 SCIE EI CSCD 2023年第1期3-24,共22页
A tremendous amount of data has been generated by global financial markets everyday,and such time-series data needs to be analyzed in real time to explore its potential value.In recent years,we have witnessed the succ... A tremendous amount of data has been generated by global financial markets everyday,and such time-series data needs to be analyzed in real time to explore its potential value.In recent years,we have witnessed the successful adoption of machine learning models on financial data,where the importance of accuracy and timeliness demands highly effective computing frameworks.However,traditional financial time-series data processing frameworks have shown performance degradation and adaptation issues,such as the outlier handling with stock suspension in Pandas and TA-Lib.In this paper,we propose HXPY,a high-performance data processing package with a C++/Python interface for financial time-series data.HXPY supports miscellaneous acceleration techniques such as the streaming algorithm,the vectorization instruction set,and memory optimization,together with various functions such as time window functions,group operations,down-sampling operations,cross-section operations,row-wise or column-wise operations,shape transformations,and alignment functions.The results of benchmark and incremental analysis demonstrate the superior performance of HXPY compared with its counterparts.From MiBs to GiBs data,HXPY significantly outperforms other in-memory dataframe computing rivals even up to hundreds of times. 展开更多
关键词 dataframe time-series data SIMD(single instruction multiple data) CUDA(Compute Unified device architecture)
原文传递
基于CUDA的JPCG并行算法求解三维DDA方程组 被引量:1
11
作者 王占学 杨军 +1 位作者 倪克松 甯尤军 《岩石力学与工程学报》 EI CAS CSCD 北大核心 2020年第6期1231-1241,共11页
非连续变形分析(discontinuous deformation analysis,DDA)方法已被广泛应用于岩土工程领域。不同于二维DDA,三维DDA更具备分析节理岩体变形和稳定性实际问题的能力。三维DDA块体间接触的复杂化,未知数规模的大幅增加,以及程序中数据和... 非连续变形分析(discontinuous deformation analysis,DDA)方法已被广泛应用于岩土工程领域。不同于二维DDA,三维DDA更具备分析节理岩体变形和稳定性实际问题的能力。三维DDA块体间接触的复杂化,未知数规模的大幅增加,以及程序中数据和内存的管理,对总体平衡方程组的求解提出了更高的稳定性和效率要求。对于原DDA程序中采用的超松弛(successive over-relaxation,SOR)求解算法,当超松弛因子选取不合适时,会造成方程组求解的不收敛。基于GPU,采用compute unified device architecture(CUDA)并行计算架构,实现了三维DDA总体平衡方程组的雅可比预处理共轭梯度法(jacobi-preconditioned conjugate gradient,JPCG)并行求解,通过算例展示了JPCG算法与GPU技术相结合的加速效果。相较于原有串行SOR算法,不仅避免了超松弛因子选取对求解收敛性的影响,而且提高了求解效率,为采用三维DDA求解实际岩石力学与工程问题创造了有利条件。 展开更多
关键词 岩石力学 三维非连续变形分析方法 compute unified device architecture(CUDA) 雅可比预处理共轭梯度法 超松弛算法
原文传递
Hybrid Parallel Bundle Adjustment for 3D Scene Reconstruction with Massive Points 被引量:4
12
作者 刘鑫 高伟 胡占义 《Journal of Computer Science & Technology》 SCIE EI CSCD 2012年第6期1269-1280,共12页
Bundle adjustment (BA) is a crucial but time consuming step in 3D reconstruction. In this paper, we intend to tackle a special class of BA problems where the reconstructed 3D points are much more numerous than the c... Bundle adjustment (BA) is a crucial but time consuming step in 3D reconstruction. In this paper, we intend to tackle a special class of BA problems where the reconstructed 3D points are much more numerous than the camera parameters, called Massive-Points BA (MPBA) problems. This is often the case when high-resolution images are used. We present a design and implementation of a new bundle adjustment algorithm for efficiently solving the MPBA problems. The use of hardware parallelism, the multi-core CPUs as well as GPUs, is explored. By careful memory-usage design, the graphic-memory limitation is effectively alleviated. Several modern acceleration strategies for bundle adjustment, such as the mixed-precision arithmetics, the embedded point iteration, and the preconditioned conjugate gradients, are explored and compared. By using several high-resolution image datasets, we generate a variety of MFBA problems, with which the performance of five bundle adjustment algorithms are evaluated. The experimental results show that our algorithm is up to 40 times faster than classical Sparse Bundle Adjustment, while maintaining comparable precision. 展开更多
关键词 sparse bundle adjustment GPU compute unified device architecture structure from motion
原文传递
Challenges of 22 nm and beyond CMOS technology 被引量:8
13
作者 HUANG Ru WU HanMing +8 位作者 KANG JinFeng XIAO DeYuan SHI XueLong AN Xia TIAN Yu WANG RunSheng ZHANG LiangLiang ZHANG Xing WANG YangYuan 《Science in China(Series F)》 2009年第9期1491-1533,共43页
It is predicted that CMOS technology will probably enter into 22 nm node around 2012. Scaling of CMOS logic technology from 32 to 22 nm node meets more critical issues and needs some significant changes of the technol... It is predicted that CMOS technology will probably enter into 22 nm node around 2012. Scaling of CMOS logic technology from 32 to 22 nm node meets more critical issues and needs some significant changes of the technology, as well as integration of the advanced processes. This paper will review the key processing technologies which can be potentially integrated into 22 nm and beyond technology nodes, including double patterning technology with high NA water immersion lithography and EUV lithography, new device architectures, high K/metal gate (HK/MG) stack and integration technology, mobility enhancement technologies, source/drain engineering and advanced copper interconnect technology with ultra-low-k process. 展开更多
关键词 CMOS technology 22 nm technology node device architectures metal gate^high K dielectrics ultra low K dielectrics
原文传递
Fast OBJ file importing and parsing in CUDA 被引量:2
14
作者 Aidan L.Possemiers Ickjai Lee 《Computational Visual Media》 2015年第3期229-238,共10页
Alias – Wavefront OBJ meshes are a common text file type for transferring 3D mesh data between applications made by different vendors.However, as the mesh complexity gets higher and denser, the files become larger an... Alias – Wavefront OBJ meshes are a common text file type for transferring 3D mesh data between applications made by different vendors.However, as the mesh complexity gets higher and denser, the files become larger and slower to import.This paper explores the use of GPUs to accelerate the importing and parsing of OBJ files by studying file read-time, runtime, and load resistance. We propose a new method of reading and parsing that circumvents GPU architecture limitations and improves performance, seeing the new GPU method outperforms CPU methods with a 6×– 8× speedup. When running on a heavily loaded system, the new method only received an 80% performance hit, compared to the160% that the CPU methods received. The loaded GPU speedup compared to unloaded CPU methods was3.5×, and, when compared to loaded CPU methods,8×. These results demonstrate that the time is right for further research into the use of data-parallel GPU acceleration beyond that of computer graphics and high performance computing. 展开更多
关键词 PARSING OBJ vertex buffer object(VBO) general-purpose programming on the graphics processing unit(GPGPU) compute unified device architecture(CUDA)
原文传递
Comparison of Parallelization Strategies for Min-Sum Decoding of Irregular LDPC Codes 被引量:1
15
作者 Hua Xu Wei Wan +3 位作者 Wei Wang Jun Wang Jiadong Yang Yun Wen 《Tsinghua Science and Technology》 SCIE EI CAS 2013年第6期577-587,共11页
Low-Density Parity-Check (LDPC) codes are powerful error correcting codes. LDPC decoders have been implemented as efficient error correction codes on dedicated VLSI hardware architectures in recent years. This paper... Low-Density Parity-Check (LDPC) codes are powerful error correcting codes. LDPC decoders have been implemented as efficient error correction codes on dedicated VLSI hardware architectures in recent years. This paper describes two strategies to parallelize min-sum decoding of irregular LDPC codes. The first implements min-sum LDPC decoders on multicore platforms using OpenMP, while the other uses the Compute Unified Device Architecture (CUDA) to parallelize LDPC decoding on Graphics Processing Units (GPUs). Empirical studies on data with various scales show that the performance of these decoding processes is improved by these parallel strategies and the GPUs provide more efficient, fast implementation decoder. 展开更多
关键词 Low-Density Parity-Check (LDPC) codes MULTICORE OPENMP Graphic Processor Unit (GPU) ComputeUnified device architecture (CUDA)
原文传递
High-performance solutions of geographically weighted regression in R 被引量:1
16
作者 Binbin Lu Yigong Hu +4 位作者 Daisuke Murakami Chris Brunsdon Alexis Comber Martin Charlton Paul Harris 《Geo-Spatial Information Science》 SCIE EI CSCD 2022年第4期536-549,共14页
As an established spatial analytical tool,Geographically Weighted Regression(GWR)has been applied across a variety of disciplines.However,its usage can be challenging for large datasets,which are increasingly prevalen... As an established spatial analytical tool,Geographically Weighted Regression(GWR)has been applied across a variety of disciplines.However,its usage can be challenging for large datasets,which are increasingly prevalent in today’s digital world.In this study,we propose two high-performance R solutions for GWR via Multi-core Parallel(MP)and Compute Unified Device Architecture(CUDA)techniques,respectively GWR-MP and GWR-CUDA.We compared GWR-MP and GWR-CUDA with three existing solutions available in Geographically Weighted Models(GWmodel),Multi-scale GWR(MGWR)and Fast GWR(FastGWR).Results showed that all five solutions perform differently across varying sample sizes,with no single solution a clear winner in terms of computational efficiency.Specifically,solutions given in GWmodel and MGWR provided acceptable computational costs for GWR studies with a relatively small sample size.For a large sample size,GWR-MP and FastGWR provided coherent solutions on a Personal Computer(PC)with a common multi-core configuration,GWR-MP provided more efficient computing capacity for each core or thread than FastGWR.For cases when the sample size was very large,and for these cases only,GWR-CUDA provided the most efficient solution,but should note its I/O cost with small samples.In summary,GWR-MP and GWR-CUDA provided complementary high-performance R solutions to existing ones,where for certain data-rich GWR studies,they should be preferred. 展开更多
关键词 Non-stationarity big data parallel computing Compute Unified device architecture(CUDA) Geographically Weighted models(GWmodel)
原文传递
Graphic Processing Unit Based Phase Retrieval and CT Reconstruction for Differential X-Ray Phase Contrast Imaging
17
作者 陈晓庆 王宇杰 孙建奇 《Journal of Shanghai Jiaotong university(Science)》 EI 2014年第5期550-554,共5页
Compared with the conventional X-ray absorption imaging, the X-ray phase-contrast imaging shows higher contrast on samples with low attenuation coefficient like blood vessels and soft tissues. Among the modalities of ... Compared with the conventional X-ray absorption imaging, the X-ray phase-contrast imaging shows higher contrast on samples with low attenuation coefficient like blood vessels and soft tissues. Among the modalities of phase-contrast imaging, the grating-based phase contrast imaging has been widely accepted owing to the advantage of wide range of sample selections and exemption of coherent source. However, the downside is the substantially larger amount of data generated from the phase-stepping method which slows down the reconstruction process. Graphic processing unit(GPU) has the advantage of allowing parallel computing which is very useful for large quantity data processing. In this paper, a compute unified device architecture(CUDA) C program based on GPU is introduced to accelerate the phase retrieval and filtered back projection(FBP) algorithm for grating-based tomography. Depending on the size of the data, the CUDA C program shows different amount of speed-up over the standard C program on the same Visual Studio 2010 platform. Meanwhile, the speed-up ratio increases as the size of data increases. 展开更多
关键词 grating-based phase contrast imaging parallel computing graphic processing unit(GPU) compute unified device architecture(CUDA) filtered back projection(FBP)
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部