skip to main content
10.1145/3297858.3304072acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections

Astra: Exploiting Predictability to Optimize Deep Learning

Published: 04 April 2019 Publication History


We present Astra, a compilation and execution framework that optimizes execution of a deep learning training job. Instead of treating the computation as a generic data flow graph, Astra exploits domain knowledge about deep learning to adopt a custom approach to compiler optimization. The key insight in Astra is to exploit the unique repetitiveness and predictability of a deep learning job, to perform online exploration of the optimization state space in a work-conserving manner while making progress on the training job. This dynamic state space exploration in Astra uses lightweight profiling and indexing of profile data, coupled with several techniques to prune the exploration state space. Effectively, the execution layer custom-wires the infrastructure end-to-end for each job and hardware, while keeping the compiler simple and maintainable. We have implemented Astra in two popular deep learning frameworks, PyTorch and Tensorflow. On state-of-the-art deep learning models, we show that Astra improves end-to-end performance of deep learning training by up to 3x, while approaching the performance of hand-optimized implementations such as cuDNN where available. Astra also significantly outperforms static compilation frameworks such as Tensorflow XLA both in performance and robustness.


Mart'in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265--283, 2016.
Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O'Reilly, and Saman Amarasinghe. Opentuner: An extensible framework for program autotuning. In Parallel Architecture and Compilation Techniques (PACT), 2014 23rd International Conference on, pages 303--315. IEEE, 2014.
Jeremy Appleyard, Tomá s Kociský, and Phil Blunsom. Optimizing performance of recurrent neural networks on gpus. arXiv Preprint, abs/1604.01946, 2016.
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: end-to-end optimization stack for deep learning. CoRR, abs/1802.04799, 2018.
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
Rui Costa, Ioannis Alexandros Assael, Brendan Shillingford, Nando de Freitas, and TIm Vogels. Cortical microcircuits as gated-recurrent neural networks. In Advances in Neural Information Processing Systems, pages 272--283, 2017.
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223--1231, 2012.
Matteo Frigo. A fast fourier transform compiler. In Acm sigplan notices, volume 34, pages 169--180. ACM, 1999.
Scott Gray. Open single and half precision gemm implementations, 2017.
Geoffrey E Hinton. Learning distributed representations of concepts. In Proceedings of the eighth annual conference of the cognitive science society, volume 1, page 12. Amherst, MA, 1986.
Nvidia Inc. Nvidia tesla p100 gpu accelerator, 2016.
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA '17, pages 1--12, New York, NY, USA, 2017. ACM.
Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet, and Ponnuswamy Sadayappan. When polyhedral transformations meet simd code generation. In ACM Sigplan Notices, volume 48, pages 127--138. ACM, 2013.
Chris Leary and Todd Wang. Xla: Tensorflow, compiled. TensorFlow Dev Summit, 2017.
Tomas Mikolov, Armand Joulin, Sumit Chopra, Michael Mathieu, and Marc'Aurelio Ranzato. Learning longer memory in recurrent neural networks. arXiv preprint arXiv:1412.7753, 2014.
Azalia Mirhoseini, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. Device placement optimization with reinforcement learning. CoRR, abs/1706.04972, 2017.
CUDA Nvidia. Cublas library. NVIDIA Corporation, Santa Clara, California, 15(27):31, 2008.
Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration, 2017.
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices, 48(6):519--530, 2013.
Herbert Robbins and S Monro. ªa stochastic approximation method, ܘ annals math. Statistics, 22:400--407, 1951.
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533, 1986.
Anil Thomas Scott Leishmann, Alex Park. Intel nervana reference deep learning framework committed to best performance on all hardware, 2017.
Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730, 2018.
N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha, S. Ghose, A. Jog, P. B. Gibbons, and O. Mutlu. Zorua: A holistic approach to resource virtualization in gpus. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1--14, Oct 2016.
Inside Volta. The world's most advanced data center gpu. URL
Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. Superneurons: Dynamic gpu memory management for training deep neural networks. arXiv preprint arXiv:1801.04380, 2018.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Ruslan R Salakhutdinov. On multiplicative integration with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 2856--2864, 2016.
Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.
Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutnik, and Jürgen Schmidhuber. Recurrent highway networks. arXiv preprint arXiv:1607.03474, 2016.

Cited By

View all
  • (2024)DeepTM: Efficient Tensor Management in Heterogeneous Memory for DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343191035:11(1920-1935)Online publication date: Nov-2024
  • (2024)MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive OperatorsSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00040(1-15)Online publication date: 17-Nov-2024
  • (2024)How Useful is Communication Scheduling for Distributed Training?2024 International Scientific and Technical Conference Modern Computer Network Technologies (MoNeTeC)10.1109/MoNeTec60984.2024.10768125(1-13)Online publication date: 29-Oct-2024
  • Show More Cited By



Information & Contributors


Published In

cover image ACM Conferences
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems
April 2019
1126 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].




Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2019


Request permissions for this article.

Check for updates

Author Tags

  1. adaptation
  2. deep learning
  3. domain-specific compiler


  • Research-article



Acceptance Rates

ASPLOS '19 Paper Acceptance Rate 74 of 351 submissions, 21%;
Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)63
  • Downloads (Last 6 weeks)8
Reflects downloads up to 14 Feb 2025

Other Metrics


Cited By

View all
  • (2024)DeepTM: Efficient Tensor Management in Heterogeneous Memory for DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343191035:11(1920-1935)Online publication date: Nov-2024
  • (2024)MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive OperatorsSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00040(1-15)Online publication date: 17-Nov-2024
  • (2024)How Useful is Communication Scheduling for Distributed Training?2024 International Scientific and Technical Conference Modern Computer Network Technologies (MoNeTeC)10.1109/MoNeTec60984.2024.10768125(1-13)Online publication date: 29-Oct-2024
  • (2024)SWattention: designing fast and memory-efficient attention for a new Sunway SupercomputerThe Journal of Supercomputing10.1007/s11227-024-05890-8Online publication date: 11-Mar-2024
  • (2023)BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler ApproachProceedings of the ACM on Management of Data10.1145/36173271:3(1-29)Online publication date: 13-Nov-2023
  • (2023)Sia: Heterogeneity-aware, goodput-optimized ML-cluster schedulingProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613175(642-657)Online publication date: 23-Oct-2023
  • (2023)EasyScale: Elastic Training with Consistent Accuracy and Improved Utilization on GPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607054(1-14)Online publication date: 12-Nov-2023
  • (2023)ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep LearningProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575721(266-280)Online publication date: 27-Jan-2023
  • (2023)Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071018(1113-1126)Online publication date: Feb-2023
  • (2022)CollageProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569651(517-529)Online publication date: 8-Oct-2022
  • Show More Cited By

View Options

Login options

View options


View or Download as a PDF file.



View online with eReader.







Share this Publication link

Share on social media