Recent years have witnessed a processor develop- ment trend that integrates central processing unit (CPU) and graphic processing unit (GPU) into a single chip. The inte- gration helps to save some host-device data...Recent years have witnessed a processor develop- ment trend that integrates central processing unit (CPU) and graphic processing unit (GPU) into a single chip. The inte- gration helps to save some host-device data copying that a discrete GPU usually requires, but also introduces deep re- source sharing and possible interference between CPU and GPU. This work investigates the performance implications of independently co-running CPU and GPU programs on these platforms. First, we perform a comprehensive measurement that covers a wide variety of factors, including processor ar- chitectures, operating systems, benchmarks, timing mecha- nisms, inputs, and power management schemes. These mea- surements reveal a number of surprising observations. We an- alyze these observations and produce a list of novel insights, including the important roles of operating system (OS) con- text switching and power management in determining the program performance, and the subtle effect of CPU-GPU data copying. Finally, we confirm those insights through case studies, and point out some promising directions to mitigate anomalous performance degradation on integrated heteroge- neous processors.展开更多
The emerging integrated CPU-GPU architectures facilitate short computational kernels to utilize GPU acceleration. Evidence has shown that, on such systems, the GPU control responsiveness (how soon the host program fi...The emerging integrated CPU-GPU architectures facilitate short computational kernels to utilize GPU acceleration. Evidence has shown that, on such systems, the GPU control responsiveness (how soon the host program finds out about the completion of a GPU kernel) is essential for the overall performance. This study identifies the GPU responsiveness dilemma: host busy polling responds quickly, but at the expense of high energy consumption and interference with co-running CPU programs; interrupt-based notification minimizes energy and CPU interference costs, but suffers from substantial response delay. We present a programlevel solution that wakes up the host program in anticipation of GPU kernel completion. We systematically explore the design space of an anticipatory wakeup scheme through a timerdelayed wakeup or kernel splitting-based pre-completion notification. Experiments show that our proposed technique can achieve the best of both worlds, high responsiveness with low power and CPU costs, for a wide range of GPU workloads.展开更多
基金We thank the constructive comments from the anony- mous referees. This material was based upon work supported by DOE Early Career Award, the National Science Foundation (NSF) (1455404 and 1525609), and NSF CAREER Award. This work is also supported partly by the NSF (CNS-1217372, CNS-1239423, CCF-1255729, CNS-1319353, and CNS-1319417) and the National Natural Science Foundation of China (NSFC) (Grant Nos. 61272143, 61272144, and 61472431). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DOE, NSF, or NSFC.
文摘Recent years have witnessed a processor develop- ment trend that integrates central processing unit (CPU) and graphic processing unit (GPU) into a single chip. The inte- gration helps to save some host-device data copying that a discrete GPU usually requires, but also introduces deep re- source sharing and possible interference between CPU and GPU. This work investigates the performance implications of independently co-running CPU and GPU programs on these platforms. First, we perform a comprehensive measurement that covers a wide variety of factors, including processor ar- chitectures, operating systems, benchmarks, timing mecha- nisms, inputs, and power management schemes. These mea- surements reveal a number of surprising observations. We an- alyze these observations and produce a list of novel insights, including the important roles of operating system (OS) con- text switching and power management in determining the program performance, and the subtle effect of CPU-GPU data copying. Finally, we confirm those insights through case studies, and point out some promising directions to mitigate anomalous performance degradation on integrated heteroge- neous processors.
基金We thank the constructive comments from the anonymous referees. This material is based upon work supported by DOE Early Career Award (DE-SC0013700), the National Science Foundation (NSF) (1455404, 1455733 (CAREER), 1525609, 1464216, and 1618912). This work is also supported partly by the National Natural Science Foundation of China (NSFC) (Grant Nos. 61272143, 61272144, 61472431), and National Science and Technology Major Project (NSTMP) (2017ZX01028-101 ). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DOE, NSF, NSFC or NSTMP.
文摘The emerging integrated CPU-GPU architectures facilitate short computational kernels to utilize GPU acceleration. Evidence has shown that, on such systems, the GPU control responsiveness (how soon the host program finds out about the completion of a GPU kernel) is essential for the overall performance. This study identifies the GPU responsiveness dilemma: host busy polling responds quickly, but at the expense of high energy consumption and interference with co-running CPU programs; interrupt-based notification minimizes energy and CPU interference costs, but suffers from substantial response delay. We present a programlevel solution that wakes up the host program in anticipation of GPU kernel completion. We systematically explore the design space of an anticipatory wakeup scheme through a timerdelayed wakeup or kernel splitting-based pre-completion notification. Experiments show that our proposed technique can achieve the best of both worlds, high responsiveness with low power and CPU costs, for a wide range of GPU workloads.