research-article

Performance Optimizations and Analysis of Distributed Deep Learning with Approximated Second-Order Optimization Method

Authors:

Satoshi MatsuokaAuthors Info & Claims

ICPP Workshops '19: Workshop Proceedings of the 48th International Conference on Parallel Processing

Article No.: 21, Pages 1 - 8

https://doi.org/10.1145/3339186.3339202

Published: 05 August 2019 Publication History

Abstract

Faster training of deep neural networks is desired to speed up the research and development cycle in deep learning. Distributed deep learning and second-order optimization methods are two different techniques to accelerate the training of deep neural networks. In the previous work, researchers show that an approximated second-order optimization method, called K-FAC, can mitigate each other drawbacks of the two techniques. However, there was no detailed discussion on the performance, which is critical for the usage in practice. In this work, we propose several performance optimization techniques to reduce the overheads of K-FAC and to accelerate the overall training. Applying all performance optimizations, we are able to speed up the training 1.64 times per iteration compared to a baseline. Additional to the performance optimizations, we construct a simple performance model to predict model training performance to help the users to determine whether distributed K-FAC is appropriate or not for their training in terms of wall-time.

References

[1]

Takuya Akiba, Keisuke Fukuda, and Shuji Suzuki. 2017. ChainerMN: Scalable Distributed Deep Learning Framework. In Proceedings of Workshop on Machine Learning Systems in The 31st Annual Conference on Neural Information Processing Systems.

[2]

S. Amari. 1985. Differential-geometrical methods in statistics. Springer-Verlag.

[3]

Aleksandar Botev, Hippolyt Ritter, and David Barber. 2017. Practical Gauss-Newton Optimisation for Deep Learning. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70. 557--565.

Digital Library

[4]

Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. 2016. An Analysis of Deep Neural Network Models for Practical Applications. (2016). arXiv:1605.07678

[5]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. (2014). arXiv:1410.0759

[6]

John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. The Journal of Machine Learning Research 12 (2011), 2121--2159.

Digital Library

[7]

Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. (2017). arXiv:1706.02677

[8]

Roger Grosse and James Martens. 2016. A Kronecker-factored Approximate Fisher Matrix for Convolution Layers. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, Vol. 48. 573--582.

Digital Library

[9]

Haowei He, Gao Huang, and Yang Yuan. 2019. Asymmetric Valleys: Beyond Sharp and Flat Local Minima. (2019). arXiv:1902.00744

[10]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. (2015). arXiv:1512.03385

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity Mappings in Deep Residual Networks. (2016). arXiv:1603.05027

[12]

Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. 2016. Densely Connected Convolutional Networks. (2016). arXiv:1608.06993

[13]

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2016. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. (2016). arXiv:1609.04836

[14]

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. (2014). arXiv:1412.6980

[15]

Tao Lin, Sebastian U. Stich, and Martin Jaggi. 2018. Don't Use Large Mini-Batches, Use Local SGD. (2018). arXiv:1808.07217

[16]

Yao Lu, Mehrtash Harandi, Richard I. Hartley, and Razvan Pascanu. 2018. Block Mean Approximation for Efficient Second Order Optimization. (2018). arXiv:1804.05484

[17]

Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. NVIDIA Tensor Core Programmability, Performance & Precision. (2018). arXiv:1803.04014

[18]

James Martens. 2010. Deep Learning via Hessian-free Optimization. In Proceedings of the 27th International Conference on Machine Learning. 735--742.

Digital Library

[19]

James Martens and Roger Grosse. 2015. Optimizing Neural Networks with Kronecker-factored Approximate Curvature. In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37. 2408--2417.

Digital Library

[20]

Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. 2018. An Empirical Model of Large-Batch Training. CoRR abs/1812.06162 (2018).

[21]

Eiji Mizutani and James W. Demmel. 2003. Iterative Scaled Trust-region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian-vector Multiply. In Proceedings of the 16th International Conference on Neural Information Processing Systems. 209--216.

Digital Library

[22]

NVIDIA. 2017. NVIDIA DALI documentation. https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html

[23]

NVIDIA. 2017. NVIDIA TESLA V100 GPU ARCHITECTURE. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

[24]

NVIDIA. 2018. cuDNN Developer Guide: Deep Learning SDK Documentation. https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html

[25]

NVIDIA. 2018. NVIDIA Collective Communications Library (NCCL) | NVIDIA Developer. https://developer.nvidia.com/nccl

[26]

NVIDIA. 2019. Training with Mixed Precision. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

[27]

Yann Ollivier. 2017. True Asymptotic Natural Gradient Optimization. (2017). arXiv:1712.08449

[28]

Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. 2018. Second-order Optimization Method for Large Mini-batch: Training ResNet-50 on ImageNet in 35 Epochs. (2018). arXiv:1811.12019

[29]

Razvan Pascanu and Yoshua Bengio. 2014. Revisiting Natural Gradient for Deep Networks. (2014). arXiv:1301.3584

[30]

Hang Qi, Evan R. Sparks, and Ameet Talwalkar. 2017. Paleo: A Performance Model for Deep Neural Networks. (2017).

[31]

Nicolas L. Roux, Pierre antoine Manzagol, and Yoshua Bengio. 2008. Topmoumoute Online Natural Gradient Algorithm. In Proceedings of the 20th International Conference on Neural Information Processing Systems. 849--856.

Digital Library

[32]

Nicolas Le Roux and Andrew W. Fitzgibbon. 2010. A fast natural Newton method. In Proceedings of the 27th International Conference on Machine Learning.

Digital Library

[33]

Frank Seide and Hao Fu. 2014. 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs. In Interspeech 2014.

[34]

Christopher J. Shallue, Jaehoon Lee, Joseph M. Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. 2018. Measuring the Effects of Data Parallelism on Neural Network Training. (2018). arXiv:1811.03600

[35]

Samuel L. Smith, Pieter-Jan Kindermans, and Quoc V. Le. 2018. Don't Decay the Learning Rate, Increase the Batch Size. In Proceedings of the 6th International Conference on Learning Representations.

[36]

Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. 2015. Chainer: a Next-Generation Open Source Framework for Deep Learning. In Proceedings of Workshop on Machine Learning Systems in The 29th Annual Conference on Neural Information Processing Systems.

[37]

Yuichiro Ueno and Rio Yokota. 2019. Exhaustive Study of Hierarchical AllReduce Patterns for Large Messages Between GPUs. In Proceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[38]

Oriol Vinyals and Daniel Povey. 2012. Krylov Subspace Descent for Deep Learning. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics.

[39]

S. Williams, D. Patterson, L. Oliker, J. Shalf, and K. Yelick. 2008. The roofline model: A pedagogical tool for program analysis and optimization. In IEEE Hot Chips 20 Symposium. 1--71.

[40]

Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2016. Aggregated Residual Transformations for Deep Neural Networks. (2016). arXiv:1611.05431

[41]

Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. 2018. Image Classification at Supercomputer Scale. (2018). arXiv:1811.06992

[42]

Yang You, Igor Gitman, and Boris Ginsburg. 2017. Scaling SGD Batch Size to 32K for ImageNet Training. (2017). arXiv:1708.03888

Cited By

Shi SZhang LLi B(2021)Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS51616.2021.00059(550-560)Online publication date: Jul-2021
https://doi.org/10.1109/ICDCS51616.2021.00059
Ueno YOsawa KTsuji YNaruse AYokota RGupta RLiu YShah MRajan STang JPrakash B(2020)Rich Information is Affordable: A Systematic Performance Analysis of Second-order Optimization Using K-FACProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3394486.3403265(2145-2153)Online publication date: 23-Aug-2020
https://dl.acm.org/doi/10.1145/3394486.3403265
Osawa KTsuji YUeno YNaruse AFoo CYokota R(2020)Scalable and Practical Natural Gradient for Large-Scale Deep LearningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2020.3004354(1-1)Online publication date: 2020
https://doi.org/10.1109/TPAMI.2020.3004354

Index Terms

Performance Optimizations and Analysis of Distributed Deep Learning with Approximated Second-Order Optimization Method
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

COMPSO: Optimizing Gradient Compression for Distributed Training with Second-Order Optimizers
PPoPP '25: Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

Second-order optimization methods have been developed to enhance convergence and generalization in deep neural network (DNN) training compared to first-order methods like Stochastic Gradient Descent (SGD). However, these methods face challenges in ...
Optimization of deep learning based segmentation method
Abstract
The use of deep learning models has become widespread in different computer vision problems such as classification, detection, and segmentation. Many deep learning models have been developed in the segmentation of medical images. Although ...
Efficient Second-Order Optimization for Neural Networks with Kernel Machines
CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

Second-order optimization has been recently explored in neural network training. However, the recomputation of the Hessian matrix in the second-order optimization posts much extra computation and memory burden in the training. There have been some ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP Workshops '19: Workshop Proceedings of the 48th International Conference on Parallel Processing

August 2019

241 pages

ISBN:9781450371964

DOI:10.1145/3339186

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

University of Tsukuba: University of Tsukuba

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 August 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP 2019

ICPP 2019: Workshops

August 5 - 8, 2019

Kyoto, Japan

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
293
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shi SZhang LLi B(2021)Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS51616.2021.00059(550-560)Online publication date: Jul-2021
https://doi.org/10.1109/ICDCS51616.2021.00059
Ueno YOsawa KTsuji YNaruse AYokota RGupta RLiu YShah MRajan STang JPrakash B(2020)Rich Information is Affordable: A Systematic Performance Analysis of Second-order Optimization Using K-FACProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3394486.3403265(2145-2153)Online publication date: 23-Aug-2020
https://dl.acm.org/doi/10.1145/3394486.3403265
Osawa KTsuji YUeno YNaruse AFoo CYokota R(2020)Scalable and Practical Natural Gradient for Large-Scale Deep LearningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2020.3004354(1-1)Online publication date: 2020
https://doi.org/10.1109/TPAMI.2020.3004354

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten