Computation reuse is known as an effective optimization technique. However, due to the complexity of modern GPU architectures, there is yet not enough understanding regarding the intriguing implications of the interpl...Computation reuse is known as an effective optimization technique. However, due to the complexity of modern GPU architectures, there is yet not enough understanding regarding the intriguing implications of the interplay of compu- ration reuse and hardware specifics on application performance. In this paper, we propose an automatic code generator for a class of stencil codes with inherent computation reuse on CPUs. For such applications, the proper reuse of intermediate results, combined with careful register and on-chip local memory usage, has profound implications on performance. Current state of the art does not address this problem in depth, partially due to the lack of a good program representation that can expose all potential computation reuse. In this paper, we leverage the computation overlap graph (COG), a simple representation of data dependence and data reuse with "element view", to expose potential reuse opportunities. Using COG, we propose a portable code generation and tuning framework for GPUs. Compared with current state-of-the-art code generators, our experimental results show up to 56.7% performance improvement on modern GPUs such as NVIDIA C2050.展开更多
基金This work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2012AA010902, and the National Natural Science Foundation of China under Grant No. 61303059.
文摘Computation reuse is known as an effective optimization technique. However, due to the complexity of modern GPU architectures, there is yet not enough understanding regarding the intriguing implications of the interplay of compu- ration reuse and hardware specifics on application performance. In this paper, we propose an automatic code generator for a class of stencil codes with inherent computation reuse on CPUs. For such applications, the proper reuse of intermediate results, combined with careful register and on-chip local memory usage, has profound implications on performance. Current state of the art does not address this problem in depth, partially due to the lack of a good program representation that can expose all potential computation reuse. In this paper, we leverage the computation overlap graph (COG), a simple representation of data dependence and data reuse with "element view", to expose potential reuse opportunities. Using COG, we propose a portable code generation and tuning framework for GPUs. Compared with current state-of-the-art code generators, our experimental results show up to 56.7% performance improvement on modern GPUs such as NVIDIA C2050.