poster

Layrub: layer-centric GPU memory reuse and data migration in extreme-scale deep learning systems

Authors:

Bo Liu,

Wenbin Jiang,

Hai Jin,

Xuanhua Shi,

Yang MaAuthors Info & Claims

PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 405 - 406

https://doi.org/10.1145/3178487.3178528

Published: 10 February 2018 Publication History

Get Access

Abstract

Growing accuracy and robustness of Deep Neural Networks (DNN) models are accompanied by growing model capacity (going deeper or wider). However, high memory requirements of those models make it difficult to execute the training process in one GPU. To address it, we first identify the memory usage characteristics for deep and wide convolutional networks, and demonstrate the opportunities of memory reuse on both intra-layer and inter-layer levels. We then present Layrub, a runtime data placement strategy that orchestrates the execution of training process. It achieves layer-centric reuse to reduce memory consumption for extreme-scale deep learning that cannot be run on one single GPU.

References

[1]

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016).

Google Scholar

[2]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR'16). IEEE, Las Vegas, NV, USA, 770--778.

Crossref

Google Scholar

[3]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (MM'14). ACM, Orlando, Florida, USA, 675--678.

Digital Library

Google Scholar

[4]

Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146 (2016).

Google Scholar

Cited By

View all

Jin SLi GSong STao DLee JPetrank E(2021)A novel memory-efficient deep learning training framework via error-bounded lossy compressionProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441597(485-487)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3441597
Zhao SLiu BWang FFeng D(2021)Bridging the Gap Between Memory and Communication Efficiency on Distributed Deep Learning SystemsIEEE Access10.1109/ACCESS.2021.30715799(57075-57088)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3071579
Zhang Z(2024)Design and Implementation of Massive Data Migration System Based on Object StorageProceedings of the 3rd International Conference on Cognitive Based Information Processing and Applications–Volume 110.1007/978-981-97-1975-4_28(305-315)Online publication date: 2-Jun-2024
https://doi.org/10.1007/978-981-97-1975-4_28
Show More Cited By

Index Terms

Layrub: layer-centric GPU memory reuse and data migration in extreme-scale deep learning systems
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
      2. Neural networks

Recommendations

Layer-Centric Memory Reuse and Data Migration for Extreme-Scale Deep Learning on Many-Core Architectures

Due to the popularity of Deep Neural Network (DNN) models, we have witnessed extreme-scale DNN models with the continued increase of the scale in terms of depth and width. However, the extremely high memory requirements for them make it difficult to run ...
Layrub: layer-centric GPU memory reuse and data migration in extreme-scale deep learning systems
PPoPP '18

Growing accuracy and robustness of Deep Neural Networks (DNN) models are accompanied by growing model capacity (going deeper or wider). However, high memory requirements of those models make it difficult to execute the training process in one GPU. To ...
In-Datacenter Performance Analysis of a Tensor Processing Unit
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates ...

Comments

Information & Contributors

Information

Published In

PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 2018

442 pages

ISBN:9781450349826

DOI:10.1145/3178487

General Chair:
Andreas Krall
Vienna University of Technology, Austria
,
Program Chair:
Thomas R. Gross
ETH Zürich, Switzerland

ACM SIGPLAN Notices Volume 53, Issue 1
PPoPP '18
January 2018
426 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3200691
Editor:
Matthew Fluet
Rodchester Institude of Technology
Issue’s Table of Contents

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 February 2018

Check for updates

Author Tags

Qualifiers

Poster

Funding Sources

National Natural Science Foundation of China

Conference

PPoPP '18

Sponsor:

PPoPP '18: 23nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 24 - 28, 2018

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
362
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Jin SLi GSong STao DLee JPetrank E(2021)A novel memory-efficient deep learning training framework via error-bounded lossy compressionProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441597(485-487)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3441597
Zhao SLiu BWang FFeng D(2021)Bridging the Gap Between Memory and Communication Efficiency on Distributed Deep Learning SystemsIEEE Access10.1109/ACCESS.2021.30715799(57075-57088)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3071579
Zhang Z(2024)Design and Implementation of Massive Data Migration System Based on Object StorageProceedings of the 3rd International Conference on Cognitive Based Information Processing and Applications–Volume 110.1007/978-981-97-1975-4_28(305-315)Online publication date: 2-Jun-2024
https://doi.org/10.1007/978-981-97-1975-4_28
Jin SLi GSong STao DLee JPetrank E(2021)A novel memory-efficient deep learning training framework via error-bounded lossy compressionProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441597(485-487)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3441597

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Layer-Centric Memory Reuse and Data Migration for Extreme-Scale Deep Learning on Many-Core Architectures