In analytical queries,a number of important operators like JOIN and GROUP BY are suitable for parallelization,and GPU is an ideal accelerator considering its power of parallel computing.However,when data size increase...In analytical queries,a number of important operators like JOIN and GROUP BY are suitable for parallelization,and GPU is an ideal accelerator considering its power of parallel computing.However,when data size increases to hundreds of gigabytes,one GPU card becomes insufficient due to the small capacity of global memory and the slow data transfer between host and device.A straightforward solution is to equip more GPUs linked with high-bandwidth connectors,but the cost will be highly increased.We utilize unified memory(UM)produced by NVIDIA CUDA(Compute Unified Device Architecture)to make it possible to accelerate large-scale queries on just one GPU,but we notice that the transfer performance between host and UM,which happens before kernel execution,is often significantly slower than the theoretical bandwidth.An important reason is that,in singleGPU environment,data processing systems usually invoke only one or a static number of threads for data copy,leading to an inefficient transfer which slows down the overall performance heavily.In this paper,we present D-Cubicle,a runtime module to accelerate data transfer between host-managed memory and unified memory.D-Cubicle boosts the actual transfer speed dynamically through a self-adaptive approach.In our experiments,taking data transfer into account,D-Cubicle processes 200 GB of data on a single GPU with 32 GB of global memory,achieving 1.43x averagely and 2.09x maximally the performance of the baseline system.展开更多
基金supported by the National Natural Science Foundation of China(Grant Nos.61732014 and 62141214)the National Key Research and Development Programof China(2018YFB1003400).
文摘In analytical queries,a number of important operators like JOIN and GROUP BY are suitable for parallelization,and GPU is an ideal accelerator considering its power of parallel computing.However,when data size increases to hundreds of gigabytes,one GPU card becomes insufficient due to the small capacity of global memory and the slow data transfer between host and device.A straightforward solution is to equip more GPUs linked with high-bandwidth connectors,but the cost will be highly increased.We utilize unified memory(UM)produced by NVIDIA CUDA(Compute Unified Device Architecture)to make it possible to accelerate large-scale queries on just one GPU,but we notice that the transfer performance between host and UM,which happens before kernel execution,is often significantly slower than the theoretical bandwidth.An important reason is that,in singleGPU environment,data processing systems usually invoke only one or a static number of threads for data copy,leading to an inefficient transfer which slows down the overall performance heavily.In this paper,we present D-Cubicle,a runtime module to accelerate data transfer between host-managed memory and unified memory.D-Cubicle boosts the actual transfer speed dynamically through a self-adaptive approach.In our experiments,taking data transfer into account,D-Cubicle processes 200 GB of data on a single GPU with 32 GB of global memory,achieving 1.43x averagely and 2.09x maximally the performance of the baseline system.