skip to main content
10.1145/3339186.3339202acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Performance Optimizations and Analysis of Distributed Deep Learning with Approximated Second-Order Optimization Method

Published: 05 August 2019 Publication History

Abstract

Faster training of deep neural networks is desired to speed up the research and development cycle in deep learning. Distributed deep learning and second-order optimization methods are two different techniques to accelerate the training of deep neural networks. In the previous work, researchers show that an approximated second-order optimization method, called K-FAC, can mitigate each other drawbacks of the two techniques. However, there was no detailed discussion on the performance, which is critical for the usage in practice. In this work, we propose several performance optimization techniques to reduce the overheads of K-FAC and to accelerate the overall training. Applying all performance optimizations, we are able to speed up the training 1.64 times per iteration compared to a baseline. Additional to the performance optimizations, we construct a simple performance model to predict model training performance to help the users to determine whether distributed K-FAC is appropriate or not for their training in terms of wall-time.

References

[1]
Takuya Akiba, Keisuke Fukuda, and Shuji Suzuki. 2017. ChainerMN: Scalable Distributed Deep Learning Framework. In Proceedings of Workshop on Machine Learning Systems in The 31st Annual Conference on Neural Information Processing Systems.
[2]
S. Amari. 1985. Differential-geometrical methods in statistics. Springer-Verlag.
[3]
Aleksandar Botev, Hippolyt Ritter, and David Barber. 2017. Practical Gauss-Newton Optimisation for Deep Learning. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70. 557--565.
[4]
Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. 2016. An Analysis of Deep Neural Network Models for Practical Applications. (2016). arXiv:1605.07678
[5]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. (2014). arXiv:1410.0759
[6]
John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. The Journal of Machine Learning Research 12 (2011), 2121--2159.
[7]
Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. (2017). arXiv:1706.02677
[8]
Roger Grosse and James Martens. 2016. A Kronecker-factored Approximate Fisher Matrix for Convolution Layers. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, Vol. 48. 573--582.
[9]
Haowei He, Gao Huang, and Yang Yuan. 2019. Asymmetric Valleys: Beyond Sharp and Flat Local Minima. (2019). arXiv:1902.00744
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. (2015). arXiv:1512.03385
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity Mappings in Deep Residual Networks. (2016). arXiv:1603.05027
[12]
Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. 2016. Densely Connected Convolutional Networks. (2016). arXiv:1608.06993
[13]
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2016. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. (2016). arXiv:1609.04836
[14]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. (2014). arXiv:1412.6980
[15]
Tao Lin, Sebastian U. Stich, and Martin Jaggi. 2018. Don't Use Large Mini-Batches, Use Local SGD. (2018). arXiv:1808.07217
[16]
Yao Lu, Mehrtash Harandi, Richard I. Hartley, and Razvan Pascanu. 2018. Block Mean Approximation for Efficient Second Order Optimization. (2018). arXiv:1804.05484
[17]
Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. NVIDIA Tensor Core Programmability, Performance & Precision. (2018). arXiv:1803.04014
[18]
James Martens. 2010. Deep Learning via Hessian-free Optimization. In Proceedings of the 27th International Conference on Machine Learning. 735--742.
[19]
James Martens and Roger Grosse. 2015. Optimizing Neural Networks with Kronecker-factored Approximate Curvature. In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37. 2408--2417.
[20]
Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. 2018. An Empirical Model of Large-Batch Training. CoRR abs/1812.06162 (2018).
[21]
Eiji Mizutani and James W. Demmel. 2003. Iterative Scaled Trust-region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian-vector Multiply. In Proceedings of the 16th International Conference on Neural Information Processing Systems. 209--216.
[22]
NVIDIA. 2017. NVIDIA DALI documentation. https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html
[23]
NVIDIA. 2017. NVIDIA TESLA V100 GPU ARCHITECTURE. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
[24]
NVIDIA. 2018. cuDNN Developer Guide: Deep Learning SDK Documentation. https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html
[25]
NVIDIA. 2018. NVIDIA Collective Communications Library (NCCL) | NVIDIA Developer. https://developer.nvidia.com/nccl
[26]
NVIDIA. 2019. Training with Mixed Precision. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
[27]
Yann Ollivier. 2017. True Asymptotic Natural Gradient Optimization. (2017). arXiv:1712.08449
[28]
Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. 2018. Second-order Optimization Method for Large Mini-batch: Training ResNet-50 on ImageNet in 35 Epochs. (2018). arXiv:1811.12019
[29]
Razvan Pascanu and Yoshua Bengio. 2014. Revisiting Natural Gradient for Deep Networks. (2014). arXiv:1301.3584
[30]
Hang Qi, Evan R. Sparks, and Ameet Talwalkar. 2017. Paleo: A Performance Model for Deep Neural Networks. (2017).
[31]
Nicolas L. Roux, Pierre antoine Manzagol, and Yoshua Bengio. 2008. Topmoumoute Online Natural Gradient Algorithm. In Proceedings of the 20th International Conference on Neural Information Processing Systems. 849--856.
[32]
Nicolas Le Roux and Andrew W. Fitzgibbon. 2010. A fast natural Newton method. In Proceedings of the 27th International Conference on Machine Learning.
[33]
Frank Seide and Hao Fu. 2014. 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs. In Interspeech 2014.
[34]
Christopher J. Shallue, Jaehoon Lee, Joseph M. Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. 2018. Measuring the Effects of Data Parallelism on Neural Network Training. (2018). arXiv:1811.03600
[35]
Samuel L. Smith, Pieter-Jan Kindermans, and Quoc V. Le. 2018. Don't Decay the Learning Rate, Increase the Batch Size. In Proceedings of the 6th International Conference on Learning Representations.
[36]
Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. 2015. Chainer: a Next-Generation Open Source Framework for Deep Learning. In Proceedings of Workshop on Machine Learning Systems in The 29th Annual Conference on Neural Information Processing Systems.
[37]
Yuichiro Ueno and Rio Yokota. 2019. Exhaustive Study of Hierarchical AllReduce Patterns for Large Messages Between GPUs. In Proceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.
[38]
Oriol Vinyals and Daniel Povey. 2012. Krylov Subspace Descent for Deep Learning. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics.
[39]
S. Williams, D. Patterson, L. Oliker, J. Shalf, and K. Yelick. 2008. The roofline model: A pedagogical tool for program analysis and optimization. In IEEE Hot Chips 20 Symposium. 1--71.
[40]
Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2016. Aggregated Residual Transformations for Deep Neural Networks. (2016). arXiv:1611.05431
[41]
Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. 2018. Image Classification at Supercomputer Scale. (2018). arXiv:1811.06992
[42]
Yang You, Igor Gitman, and Boris Ginsburg. 2017. Scaling SGD Batch Size to 32K for ImageNet Training. (2017). arXiv:1708.03888

Cited By

View all
  • (2021)Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS51616.2021.00059(550-560)Online publication date: Jul-2021
  • (2020)Rich Information is Affordable: A Systematic Performance Analysis of Second-order Optimization Using K-FACProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3394486.3403265(2145-2153)Online publication date: 23-Aug-2020
  • (2020)Scalable and Practical Natural Gradient for Large-Scale Deep LearningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2020.3004354(1-1)Online publication date: 2020

Index Terms

  1. Performance Optimizations and Analysis of Distributed Deep Learning with Approximated Second-Order Optimization Method

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Other conferences
          ICPP Workshops '19: Workshop Proceedings of the 48th International Conference on Parallel Processing
          August 2019
          241 pages
          ISBN:9781450371964
          DOI:10.1145/3339186
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          In-Cooperation

          • University of Tsukuba: University of Tsukuba

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 05 August 2019

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. deep learning
          2. neural networks
          3. second-order optimization

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Conference

          ICPP 2019
          ICPP 2019: Workshops
          August 5 - 8, 2019
          Kyoto, Japan

          Acceptance Rates

          Overall Acceptance Rate 91 of 313 submissions, 29%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)19
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 08 Mar 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2021)Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS51616.2021.00059(550-560)Online publication date: Jul-2021
          • (2020)Rich Information is Affordable: A Systematic Performance Analysis of Second-order Optimization Using K-FACProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3394486.3403265(2145-2153)Online publication date: 23-Aug-2020
          • (2020)Scalable and Practical Natural Gradient for Large-Scale Deep LearningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2020.3004354(1-1)Online publication date: 2020

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media