research-article

Astra: Exploiting Predictability to Optimize Deep Learning

Authors:

Muthian Sivathanu,

Sanjay S. Singapuram,

Lidong ZhouAuthors Info & Claims

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 909 - 923

https://doi.org/10.1145/3297858.3304072

Published: 04 April 2019 Publication History

Abstract

We present Astra, a compilation and execution framework that optimizes execution of a deep learning training job. Instead of treating the computation as a generic data flow graph, Astra exploits domain knowledge about deep learning to adopt a custom approach to compiler optimization. The key insight in Astra is to exploit the unique repetitiveness and predictability of a deep learning job, to perform online exploration of the optimization state space in a work-conserving manner while making progress on the training job. This dynamic state space exploration in Astra uses lightweight profiling and indexing of profile data, coupled with several techniques to prune the exploration state space. Effectively, the execution layer custom-wires the infrastructure end-to-end for each job and hardware, while keeping the compiler simple and maintainable. We have implemented Astra in two popular deep learning frameworks, PyTorch and Tensorflow. On state-of-the-art deep learning models, we show that Astra improves end-to-end performance of deep learning training by up to 3x, while approaching the performance of hand-optimized implementations such as cuDNN where available. Astra also significantly outperforms static compilation frameworks such as Tensorflow XLA both in performance and robustness.

References

[1]

Mart'in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265--283, 2016.

Digital Library

[2]

Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O'Reilly, and Saman Amarasinghe. Opentuner: An extensible framework for program autotuning. In Parallel Architecture and Compilation Techniques (PACT), 2014 23rd International Conference on, pages 303--315. IEEE, 2014.

Digital Library

[3]

Jeremy Appleyard, Tomá s Kociský, and Phil Blunsom. Optimizing performance of recurrent neural networks on gpus. arXiv Preprint, abs/1604.01946, 2016.

[4]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: end-to-end optimization stack for deep learning. CoRR, abs/1802.04799, 2018.

[5]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.

[6]

Rui Costa, Ioannis Alexandros Assael, Brendan Shillingford, Nando de Freitas, and TIm Vogels. Cortical microcircuits as gated-recurrent neural networks. In Advances in Neural Information Processing Systems, pages 272--283, 2017.

Digital Library

[7]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223--1231, 2012.

Digital Library

[8]

Matteo Frigo. A fast fourier transform compiler. In Acm sigplan notices, volume 34, pages 169--180. ACM, 1999.

Digital Library

[9]

Scott Gray. Open single and half precision gemm implementations, 2017.

[10]

Geoffrey E Hinton. Learning distributed representations of concepts. In Proceedings of the eighth annual conference of the cognitive science society, volume 1, page 12. Amherst, MA, 1986.

[11]

Nvidia Inc. Nvidia tesla p100 gpu accelerator, 2016.

[12]

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA '17, pages 1--12, New York, NY, USA, 2017. ACM.

Digital Library

[13]

Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet, and Ponnuswamy Sadayappan. When polyhedral transformations meet simd code generation. In ACM Sigplan Notices, volume 48, pages 127--138. ACM, 2013.

Digital Library

[14]

Chris Leary and Todd Wang. Xla: Tensorflow, compiled. TensorFlow Dev Summit, 2017.

[15]

Tomas Mikolov, Armand Joulin, Sumit Chopra, Michael Mathieu, and Marc'Aurelio Ranzato. Learning longer memory in recurrent neural networks. arXiv preprint arXiv:1412.7753, 2014.

[16]

Azalia Mirhoseini, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. Device placement optimization with reinforcement learning. CoRR, abs/1706.04972, 2017.

Digital Library

[17]

CUDA Nvidia. Cublas library. NVIDIA Corporation, Santa Clara, California, 15(27):31, 2008.

[18]

Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration, 2017.

[19]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices, 48(6):519--530, 2013.

Digital Library

[20]

Herbert Robbins and S Monro. ªa stochastic approximation method, ܘ annals math. Statistics, 22:400--407, 1951.

[21]

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533, 1986.

[22]

Anil Thomas Scott Leishmann, Alex Park. Intel nervana reference deep learning framework committed to best performance on all hardware, 2017.

[23]

Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730, 2018.

[24]

N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha, S. Ghose, A. Jog, P. B. Gibbons, and O. Mutlu. Zorua: A holistic approach to resource virtualization in gpus. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1--14, Oct 2016.

Digital Library

[25]

Inside Volta. The world's most advanced data center gpu. URL https://devblogs.nvidia.com/parallelforall/inside-volta.

[26]

Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. Superneurons: Dynamic gpu memory management for training deep neural networks. arXiv preprint arXiv:1801.04380, 2018.

Digital Library

[27]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.

[28]

Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Ruslan R Salakhutdinov. On multiplicative integration with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 2856--2864, 2016.

Digital Library

[29]

Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.

[30]

Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutnik, and Jürgen Schmidhuber. Recurrent highway networks. arXiv preprint arXiv:1607.03474, 2016.

Cited By

Zhou HRang WChen HZhou XCheng D(2024)DeepTM: Efficient Tensor Management in Heterogeneous Memory for DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343191035:11(1920-1935)Online publication date: Nov-2024
https://doi.org/10.1109/TPDS.2024.3431910
Zhang ZYang DZhou XCheng D(2024)MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive OperatorsSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00040(1-15)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SC41406.2024.00040
Zhao YLiu XJin X(2024)How Useful is Communication Scheduling for Distributed Training?2024 International Scientific and Technical Conference Modern Computer Network Technologies (MoNeTeC)10.1109/MoNeTec60984.2024.10768125(1-13)Online publication date: 29-Oct-2024
https://doi.org/10.1109/MoNeTec60984.2024.10768125
Show More Cited By

Index Terms

Astra: Exploiting Predictability to Optimize Deep Learning
1. Computer systems organization
  1. Architectures
    1. Other architectures
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Software infrastructure

Recommendations

End-to-end learning of adaptive coded modulation schemes for resilient wireless communications
Abstract
Adaptive modulation and coding schemes play a crucial role in ensuring robust data transfer in wireless communications, especially when faced with changes or interference in the transmission channel. These schemes involve the use of variable ...
Graphical abstract

Display Omitted
Highlights
- Proposes an end-to-end machine learning architecture enabling multiple code rates.
- Custom training/multi-task learning produces competitive error-rate performance.
- Proposed approach outperforms several traditional coding techniques ...
Deep Bayesian Self-Training
Abstract
Supervised deep learning has been highly successful in recent years, achieving state-of-the-art results in most tasks. However, with the ongoing uptake of such methods in industrial applications, the requirement for large amounts of annotated data ...
Unveiling the mystery of API evolution in deep learning frameworks: a case study of tensorflow 2
ICSE-SEIP '21: Proceedings of the 43rd International Conference on Software Engineering: Software Engineering in Practice

API developers have been working hard to evolve APIs to provide more simple, powerful, and robust API libraries. Although API evolution has been studied for multiple domains, such as Web and Android development, API evolution for deep learning ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

April 2019

1126 pages

ISBN:9781450362405

DOI:10.1145/3297858

General Chairs:
Iris Bahar
Brown University
,
Maurice Herlihy
Brown University
,
Program Chairs:
Emmett Witchel
University of Texas, Austin
,
Alvin Lebeck
Duke University

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '19

Sponsor:

ASPLOS '19: Architectural Support for Programming Languages and Operating Systems

April 13 - 17, 2019

RI, Providence, USA

Acceptance Rates

ASPLOS '19 Paper Acceptance Rate 74 of 351 submissions, 21%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
1,112
Total Downloads

Downloads (Last 12 months)63
Downloads (Last 6 weeks)8

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhou HRang WChen HZhou XCheng D(2024)DeepTM: Efficient Tensor Management in Heterogeneous Memory for DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343191035:11(1920-1935)Online publication date: Nov-2024
https://doi.org/10.1109/TPDS.2024.3431910
Zhang ZYang DZhou XCheng D(2024)MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive OperatorsSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00040(1-15)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SC41406.2024.00040
Zhao YLiu XJin X(2024)How Useful is Communication Scheduling for Distributed Training?2024 International Scientific and Technical Conference Modern Computer Network Technologies (MoNeTeC)10.1109/MoNeTec60984.2024.10768125(1-13)Online publication date: 29-Oct-2024
https://doi.org/10.1109/MoNeTec60984.2024.10768125
Wu RZhu XChen JLiu SZheng TLiu XAn H(2024)SWattention: designing fast and memory-efficient attention for a new Sunway SupercomputerThe Journal of Supercomputing10.1007/s11227-024-05890-8Online publication date: 11-Mar-2024
https://doi.org/10.1007/s11227-024-05890-8
Zheng ZPan ZWang DZhu KZhao WGuo TQiu XSun MBai JZhang FDu XZhai JLin W(2023)BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler ApproachProceedings of the ACM on Management of Data10.1145/36173271:3(1-29)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3617327
Jayaram Subramanya SArfeen DLin SQiao AJia ZGanger GDruschel PKaufmann AMace JFlinn JSeltzer M(2023)Sia: Heterogeneity-aware, goodput-optimized ML-cluster schedulingProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613175(642-657)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3600006.3613175
Li MXiao WYang HSun BZhao HRen SLuan ZJia XLiu YLi YLin WQian DMohror KArnold DBadia R(2023)EasyScale: Elastic Training with Consistent Accuracy and Improved Utilization on GPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607054(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607054
Gu DZhao YZhong YXiong YHan ZCheng PYang FHuang GJin XLiu XAamodt TJerger NSwift M(2023)ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep LearningProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575721(266-280)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575721
Zheng SChen SSong PChen RLi XYan SLin DLeng JLiang Y(2023)Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071018(1113-1126)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071018
Jeon BPark SLiao PXu SChen TJia ZKloeckner AMoreira J(2022)CollageProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569651(517-529)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569651
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten