Reducing the Memory Cost of Training Convolutional Neural Networks by CPU Offloading

Reducing the Memory Cost of Training Convolutional Neural Networks by CPU Offloading

下载PDF

导出

摘要 In recent years, Convolutional Neural Networks (CNNs) have enabled unprecedented progress on a wide range of computer vision tasks. However, training large CNNs is a resource-intensive task that requires specialized Graphical Processing Units (GPU) and highly optimized implementations to get optimal performance from the hardware. GPU memory is a major bottleneck of the CNN training procedure, limiting the size of both inputs and model architectures. In this paper, we propose to alleviate this memory bottleneck by leveraging an under-utilized resource of modern systems: the device to host bandwidth. Our method, termed CPU offloading, works by transferring hidden activations to the CPU upon computation, in order to free GPU memory for upstream layer computations during the forward pass. These activations are then transferred back to the GPU as needed by the gradient computations of the backward pass. The key challenge to our method is to efficiently overlap data transfers and computations in order to minimize wall time overheads induced by the additional data transfers. On a typical work station with a Nvidia Titan X GPU, we show that our method compares favorably to gradient checkpointing as we are able to reduce the memory consumption of training a VGG19 model by 35% with a minimal additional wall time overhead of 21%. Further experiments detail the impact of the different optimization tricks we propose. Our method is orthogonal to other techniques for memory reduction such as quantization and sparsification so that they can easily be combined for further optimizations. In recent years, Convolutional Neural Networks (CNNs) have enabled unprecedented progress on a wide range of computer vision tasks. However, training large CNNs is a resource-intensive task that requires specialized Graphical Processing Units (GPU) and highly optimized implementations to get optimal performance from the hardware. GPU memory is a major bottleneck of the CNN training procedure, limiting the size of both inputs and model architectures. In this paper, we propose to alleviate this memory bottleneck by leveraging an under-utilized resource of modern systems: the device to host bandwidth. Our method, termed CPU offloading, works by transferring hidden activations to the CPU upon computation, in order to free GPU memory for upstream layer computations during the forward pass. These activations are then transferred back to the GPU as needed by the gradient computations of the backward pass. The key challenge to our method is to efficiently overlap data transfers and computations in order to minimize wall time overheads induced by the additional data transfers. On a typical work station with a Nvidia Titan X GPU, we show that our method compares favorably to gradient checkpointing as we are able to reduce the memory consumption of training a VGG19 model by 35% with a minimal additional wall time overhead of 21%. Further experiments detail the impact of the different optimization tricks we propose. Our method is orthogonal to other techniques for memory reduction such as quantization and sparsification so that they can easily be combined for further optimizations.

作者 Tristan Hascoet Weihao Zhuang Quenti Febvre Yasuo Ariki Tetsuya Takiguchi

机构地区 Kobe University Sicara

出处《Journal of Software Engineering and Applications》 2019年第8期307-320,共14页 软件工程与应用（英文）

关键词 Deep Learning CNN Optimization Deep Learning CNN Optimization

分类号 R73 [医药卫生—肿瘤]

引文网络
相关文献

1Vladimir Gurin,Lev Derkachenko,Marcus Schmidt,Ulrich Burkhardt,Juri Grin.Studying of Diffusion of the Titan in Corundum Ceramics[J].Materials Sciences and Applications,2011,2(7):946-949.
2Robert Le Moyne,Timothy Mastroianni.Advanced Concept Ramjet Propulsion System Utilizing In-Situ Positron Antimatter Derived from Ultra-Intense Laser with Fundamental Performance Analysis[J].Journal of Applied Mathematics and Physics,2014,2(5):19-26. 被引量：3
3Yonglin Lei,Ning Zhu,Jian Yao,Hessam Sarjoughian,Weiping Wang.Model architecture-oriented combat system effectiveness simulation based on MDE[J].Journal of Systems Engineering and Electronics,2017,28(5):900-922. 被引量：3
4Robert Le Moyne,Timothy Mastroianni.Fundamental Architecture and Analysis of an Antimatter Ultra-Intense Laser Derived Pulsed Space Propulsion System[J].Journal of Applied Mathematics and Physics,2014,2(5):10-18. 被引量：3
5Aakash Ahlnad,Claus Pahl,Ahmed B. Altamimi,Abdulrahman Alreshidi.Mining Patterns from Change Logs to Support Reuse-Driven Evolution of Software Architectures[J].Journal of Computer Science & Technology,2018,33(6):1278-1306.
6Hira Narang,Fan Wu,Abdul Rafae Mohammed.An Efficient Acceleration of Solving Heat and Mass Transfer Equations with the First Kind Boundary Conditions in Capillary Porous Radially Composite Cylinder Using Programmable Graphics Hardware[J].Journal of Computer and Communications,2019,7(7):267-281.
7Hira Narang,Fan Wu,Abdul Rafae Mohammed.An Efficient Acceleration of Solving Heat and Mass Transfer Equations with the Second Kind Boundary Conditions in Capillary Porous Composite Cylinder Using Programmable Graphics Hardware[J].Journal of Computer and Communications,2018,6(9):24-38.
8Raghu Raj Prasanna Kumar,Suresh Muknahallipatna,John McInroy,Mark McKenna,Lori Franc.Real-Time Range of Motion Measurement of Physical Therapeutic Exercises[J].Journal of Computer and Communications,2017,5(9):19-42.

Journal of Software Engineering and Applications

2019年第8期

浏览历史

内容加载中请稍等...

Reducing the Memory Cost of Training Convolutional Neural Networks by CPU Offloading

相关作者

相关机构

相关主题

浏览历史