short-paper

An Efficient Technique for Large Mini-batch Challenge of DNNs Training on Large Scale Cluster

Authors:
Akihiko Kasagi

Fujitsu Laboratories ltd., Kawasaki, Japan

Fujitsu Laboratories ltd., Kawasaki, Japan
View Profile

,
Akihiro Tabuchi

Fujitsu Laboratories ltd., Kawasaki, Japan

Fujitsu Laboratories ltd., Kawasaki, Japan
View Profile

,
Masafumi Yamazaki

Fujitsu Laboratories ltd., Kawasaki, Japan

Fujitsu Laboratories ltd., Kawasaki, Japan
View Profile

,
Takumi Honda

Fujitsu Laboratories ltd., Kawasaki, Japan

Fujitsu Laboratories ltd., Kawasaki, Japan
View Profile

,
Masahiro Miwa

Fujitsu Laboratories ltd., Kawasaki, Japan

Fujitsu Laboratories ltd., Kawasaki, Japan
View Profile

,
Naoto Fukumoto

Fujitsu Laboratories ltd., Kawasaki, Japan

Fujitsu Laboratories ltd., Kawasaki, Japan
View Profile

,
Tsuguchika Tabaru

Fujitsu Laboratories ltd., Kawasaki, Japan

Fujitsu Laboratories ltd., Kawasaki, Japan
View Profile

,
Atsushi Ike

Fujitsu Laboratories ltd., Kawasaki, Japan

Fujitsu Laboratories ltd., Kawasaki, Japan
View Profile

,
Kohta Nakashima

Fujitsu Laboratories ltd., Kawasaki, Japan

Fujitsu Laboratories ltd., Kawasaki, Japan
View Profile

HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed ComputingJune 2020Pages 203–207https://doi.org/10.1145/3369583.3392687

Published:23 June 2020Publication History

HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing

Pages 203–207

ABSTRACT

Distributed deep learning using large mini-batches is a key strategy to perform the deep learning as fast as possible, but it represents a great challenge as it is difficult to achieve high scaling efficiency when using large clusters without compromising accuracy. The particular problem in this challenge is decreasing the number of model update iterations in whole of training. Thus, we need a technique which can converge the validation accuracy with a small number of iterations to address this challenge. In this paper, we introduce a novel technique, Final Polishing. This technique adjusts the means and variances in the batch normalization and mitigates the difference of normalization between validation datasets and augmented training datasets. By applying the technique, we achieved top-1 validation accuracy of 75.08% with mini-batch size of 81,920, with 2,048 GPUs and completed the training of ResNet-50 in 74.7 seconds.In addition, targeting top-1 validation accuracy of 75.9% or more, we tried additional parameters tuning. Then, we adjusted the number of GPUs and hyperparameters of DNNs with Final Polishing, and we also achieved top-1 validation accuracy of 75.97% with mini-batch size of 86,016, with 3,072 GPUs and completed the training of ResNet-50 in 62.1 seconds.

Supplemental Material

3369583.3392687.mp4

mp4

22.6 MB

Download

References

T. Akiba, S. Suzuki, and K. Fukuda. 2017. Extremely Large Minibatch SGD: Train- ing ResNet-50 on ImageNet in 15 Minutes. arXiv:1711.04325 (2017).Google Scholar
T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Li- brary for Heterogeneous Distributed Systems. CoRR arXiv:1512.01274 (2015). arXiv:1512.01274 http://arxiv.org/abs/1512.01274Google Scholar
A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro. 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th International Con- ference on Machine Learning (2013).Google Scholar
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. 2012. Large Scale Distributed Deep Networks. Neural Information Processing Systems (2012).Google Scholar
P. Goyal, P. Dollar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tul- loch, Y. Jia, and K. He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. ArXiv:1706.02677 (2017).Google Scholar
N. Hamada, Y. Nagata, S. Kobayashi, and I. Ono. 2010. Adaptive weighted ag- gregation: A multiobjective function optimization framework taking account of spread and evenness of approximate solutions. In IEEE Congress on Evolutionary Computation. https://doi.org/10.1109/CEC.2010.5586368Google ScholarCross Ref
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2016).Google Scholar
Z. He, L. Xie, X. Chen, Y. Zhang, Y.Wang, and Q. Tian. 2019. Data Augmentation Revisited: Rethinking the Distribution Gap between Clean and Augmented Data. arXiv:1909.09148 (2019).Google Scholar
F. N. Iandola, K. Ashraf, M.W. Moskewicz, and K. Keutzer. 2015. FireCaffe: Near- Linear Acceleration of Deep Neural Network Training on Compute Clusters. arXiv:1511.00175 (2015).Google Scholar
F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 0.5 MB model size. arXiv:1602.07360 (2016).Google Scholar
X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, T. Chen, G. Hu, S. Shi, and X. Chu. 2018. Highly Scalable Deep Learning Training System with Mixed--Precision: Training ImageNet in Four Minutes. arXiv:1807.11205 (2018).Google Scholar
H. Mikami, H. Suganuma, P. U-chupala, Y. Tanaka, and Y. Kageyama. 2019. Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash. arXiv:1811.05233v2 (2019).Google Scholar
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, and A. Karpathy and. 2015. ImageNet Large Scale Visual Recognition Challenge. International journal of Computer Vision (2015).Google ScholarDigital Library
S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V. Le. 2017. Don't Decay the Learn- ing Rate, Increase the Batch Size. Neural Information Processing Systems (2017).Google Scholar
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z.Wojna. 2015. Rethinking the Inception Architecture for Computer Vision. arXiv:1512.00567v3 (2015).Google Scholar
H. Touvron, A. Vedaldi, M. Douze, and H. Jégou. 2019. Fixing the train-test resolution discrepancy. arXiv:1906.06423 (2019).Google Scholar
R.Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. 2015. Deep Image: Scaling up Image Recognition. arXiv:1501.02876 (2015).Google Scholar
C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng. 2018. Image Classification at Supercomputer Scale. arXiv:1811.06992v2 (2018).Google Scholar
Y. You, I. Gitman, and B. Ginsburg. 2017. Large Batch Training Of Convolutional Networks. arXiv:1708.03888 (2017).Google Scholar
Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C. Hsieh. 2019. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. arXiv:1904.00962 (2019).Google Scholar
S. Yun,D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo. 2019. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. arXiv:1905.04899 (2019).Google Scholar
H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. 2017. mixup: Beyond Empirical Risk Minimization. arXiv:1710.09412 (2017).Google Scholar
W. Zhang, X. Cui, A. Kayi, M. Liu, U. Finkler, B. Kingsbury, G. Saon, Y. Mroueh, A. Buyuktosunoglu, P. Das, D. Kung, and M. Picheny. 2020. Improving Efficiency in Large-Scale Decentralized Distributed Training. arXiv:2002.01119 (2020).Google Scholar

Index Terms

An Efficient Technique for Large Mini-batch Challenge of DNNs Training on Large Scale Cluster
1. Computing methodologies
  1. Artificial intelligence

Recommendations

Deep convolutional hashing using pairwise multi-label supervision for large-scale visual search

Image hashing has attracted much attention in the field of large-scale visual search, and learning based approaches have benefited from recent advances of deep learning, which outperforms the shallow models. Most existing deep hashing approaches tend to ...
Read More
mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training
Euro-Par 2022: Parallel Processing
Abstract
Memory usage is becoming an increasingly pressing bottleneck in the training process of Deep Neural Networks (DNNs), especially when training on Graphics Processing Units (GPUs). Existing solutions for multi-GPU training setups partition the ...
Read More
Towards dropout training for convolutional neural networks

Recently, dropout has seen increasing use in deep learning. For deep convolutional neural networks, dropout is known to work well in fully-connected layers. However, its effect in convolutional and pooling layers is still not clear. This paper ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing
June 2020
246 pages
ISBN:9781450370523
DOI:10.1145/3369583
General Chairs:
Manish Parashar
Rutgers University, USA
,
Vladimir Vlassov
KTH Royal Institute of Technology, Stockholm, Sweden
,
Program Chairs:
David Irwin
University of Massachusetts Amherst, USA
,
Kathryn Mohror
Lawrence Livermore National Laboratory, USA
Copyright © 2020 ACM
© 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 June 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
HPC
deep learning
distributed parallel computing
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate166of966submissions,17%
Upcoming Conference
HPDC '24

Sponsor:

sigarch

The 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 217
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An Efficient Technique for Large Mini-batch Challenge of DNNs Training on Large Scale Cluster

HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Deep convolutional hashing using pairwise multi-label supervision for large-scale visual search

mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training

Towards dropout training for convolutional neural networks