期刊文献+
共找到3篇文章
< 1 >
每页显示 20 50 100
A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs 被引量:1
1
作者 Yang Yang Hui-Min Cui +1 位作者 Xiao-Bing Feng Jing-Ling Xue 《Journal of Computer Science & Technology》 SCIE EI CSCD 2012年第1期57-74,共18页
In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods ... In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented by both shared memory and registers effectively in a balanced manner. We describe a framework that automatically finds the best placement of data in registers and shared memory in order to maximize the performance of stencil computations. Validation using four different types of stencils on three different GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory only. 展开更多
关键词 stencil computation circular queue GPU OCCUPANCY REGISTER
原文传递
HW/SW Co-optimization for Stencil Computation:Beginning with a Customizable Core
2
作者 Yanhua Li Youhui Zhang Weiming Zheng 《Tsinghua Science and Technology》 SCIE EI CAS CSCD 2016年第5期570-580,共11页
Energy efficiency is one of the most important issues for High Performance Computing(HPC) today.Heterogeneous HPC platform with some energy-efficient customizable cores(as application-specific accelerators)is beli... Energy efficiency is one of the most important issues for High Performance Computing(HPC) today.Heterogeneous HPC platform with some energy-efficient customizable cores(as application-specific accelerators)is believed as one of the promising solutions to meet ever-increasing computing needs and to overcome power density limitations. In this paper, we focus on using customizable processor cores to optimize the typical stencil computations—— the kernel of many high-performance applications. We develop a series of effective software/hardware co-optimization strategies to exploit the instruction-level and memory-computation parallelism,as well as to decrease the energy consumption. These optimizations include loop tiling, prefetching, cache customization, Single Instruction Multiple Data(SIMD), and Direct Memory Access(DMA), as well as necessary ISA extensions. Detailed tests of power-efficiency are given to evaluate the effect of all these optimizations comprehensively. The results are impressive: the combination of these optimizations has improved the application performance by 341% while the energy consumption has been decreased by 35%; a preliminary comparison with X86, GPU, and FPGA platforms also showed that the design could achieve an order of magnitude higher performance efficiency. We believe this work can help understand sources of inefficiency in general-purpose chips and can be used as a beginning to customize an energy efficient CMP for further improvement. 展开更多
关键词 energy efficiency customizable processor stencil computation software and hardware co-optimization
原文传递
A case study of 3D RTM-TTI algorithm on multicore and many-core platforms
3
作者 张秀霞 Tan Guangming +1 位作者 Chen Mingyu Yao Erlin 《High Technology Letters》 EI CAS 2017年第2期185-190,共6页
3D reverse time migration in tiled transversly isotropic(3D RTM-TTI) is the most precise model for complex seismic imaging.However,vast computing time of 3D RTM-TTI prevents it from being widely used,which is addresse... 3D reverse time migration in tiled transversly isotropic(3D RTM-TTI) is the most precise model for complex seismic imaging.However,vast computing time of 3D RTM-TTI prevents it from being widely used,which is addressed by providing parallel solutions for 3D RTM-TTI on multicores and many-cores.After data parallelism and memory optimization,the hot spot function of 3D RTMTTI gains 35.99 X speedup on two Intel Xeon CPUs,89.75 X speedup on one Intel Xeon Phi,89.92 X speedup on one NVIDIA K20 GPU compared with serial CPU baseline.This study makes RTM-TTI practical in industry.Since the computation pattern in RTM is stencil,the approaches also benefit a wide range of stencil-based applications. 展开更多
关键词 3D RTM-TTI Intel Xeon Phi NVIDIA K20 GPU stencil computing manycore MULTICORE seismic imaging
下载PDF
上一页 1 下一页 到第
使用帮助 返回顶部