skip to main content
10.1145/3369583.3392687acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
short-paper

An Efficient Technique for Large Mini-batch Challenge of DNNs Training on Large Scale Cluster

Published:23 June 2020Publication History

ABSTRACT

Distributed deep learning using large mini-batches is a key strategy to perform the deep learning as fast as possible, but it represents a great challenge as it is difficult to achieve high scaling efficiency when using large clusters without compromising accuracy. The particular problem in this challenge is decreasing the number of model update iterations in whole of training. Thus, we need a technique which can converge the validation accuracy with a small number of iterations to address this challenge. In this paper, we introduce a novel technique, Final Polishing. This technique adjusts the means and variances in the batch normalization and mitigates the difference of normalization between validation datasets and augmented training datasets. By applying the technique, we achieved top-1 validation accuracy of 75.08% with mini-batch size of 81,920, with 2,048 GPUs and completed the training of ResNet-50 in 74.7 seconds.In addition, targeting top-1 validation accuracy of 75.9% or more, we tried additional parameters tuning. Then, we adjusted the number of GPUs and hyperparameters of DNNs with Final Polishing, and we also achieved top-1 validation accuracy of 75.97% with mini-batch size of 86,016, with 3,072 GPUs and completed the training of ResNet-50 in 62.1 seconds.

Skip Supplemental Material Section

Supplemental Material

3369583.3392687.mp4

mp4

22.6 MB

References

  1. T. Akiba, S. Suzuki, and K. Fukuda. 2017. Extremely Large Minibatch SGD: Train- ing ResNet-50 on ImageNet in 15 Minutes. arXiv:1711.04325 (2017).Google ScholarGoogle Scholar
  2. T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Li- brary for Heterogeneous Distributed Systems. CoRR arXiv:1512.01274 (2015). arXiv:1512.01274 http://arxiv.org/abs/1512.01274Google ScholarGoogle Scholar
  3. A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro. 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th International Con- ference on Machine Learning (2013).Google ScholarGoogle Scholar
  4. J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. 2012. Large Scale Distributed Deep Networks. Neural Information Processing Systems (2012).Google ScholarGoogle Scholar
  5. P. Goyal, P. Dollar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tul- loch, Y. Jia, and K. He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. ArXiv:1706.02677 (2017).Google ScholarGoogle Scholar
  6. N. Hamada, Y. Nagata, S. Kobayashi, and I. Ono. 2010. Adaptive weighted ag- gregation: A multiobjective function optimization framework taking account of spread and evenness of approximate solutions. In IEEE Congress on Evolutionary Computation. https://doi.org/10.1109/CEC.2010.5586368Google ScholarGoogle ScholarCross RefCross Ref
  7. K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2016).Google ScholarGoogle Scholar
  8. Z. He, L. Xie, X. Chen, Y. Zhang, Y.Wang, and Q. Tian. 2019. Data Augmentation Revisited: Rethinking the Distribution Gap between Clean and Augmented Data. arXiv:1909.09148 (2019).Google ScholarGoogle Scholar
  9. F. N. Iandola, K. Ashraf, M.W. Moskewicz, and K. Keutzer. 2015. FireCaffe: Near- Linear Acceleration of Deep Neural Network Training on Compute Clusters. arXiv:1511.00175 (2015).Google ScholarGoogle Scholar
  10. F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 0.5 MB model size. arXiv:1602.07360 (2016).Google ScholarGoogle Scholar
  11. X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, T. Chen, G. Hu, S. Shi, and X. Chu. 2018. Highly Scalable Deep Learning Training System with Mixed--Precision: Training ImageNet in Four Minutes. arXiv:1807.11205 (2018).Google ScholarGoogle Scholar
  12. H. Mikami, H. Suganuma, P. U-chupala, Y. Tanaka, and Y. Kageyama. 2019. Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash. arXiv:1811.05233v2 (2019).Google ScholarGoogle Scholar
  13. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, and A. Karpathy and. 2015. ImageNet Large Scale Visual Recognition Challenge. International journal of Computer Vision (2015).Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V. Le. 2017. Don't Decay the Learn- ing Rate, Increase the Batch Size. Neural Information Processing Systems (2017).Google ScholarGoogle Scholar
  15. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z.Wojna. 2015. Rethinking the Inception Architecture for Computer Vision. arXiv:1512.00567v3 (2015).Google ScholarGoogle Scholar
  16. H. Touvron, A. Vedaldi, M. Douze, and H. Jégou. 2019. Fixing the train-test resolution discrepancy. arXiv:1906.06423 (2019).Google ScholarGoogle Scholar
  17. R.Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. 2015. Deep Image: Scaling up Image Recognition. arXiv:1501.02876 (2015).Google ScholarGoogle Scholar
  18. C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng. 2018. Image Classification at Supercomputer Scale. arXiv:1811.06992v2 (2018).Google ScholarGoogle Scholar
  19. Y. You, I. Gitman, and B. Ginsburg. 2017. Large Batch Training Of Convolutional Networks. arXiv:1708.03888 (2017).Google ScholarGoogle Scholar
  20. Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C. Hsieh. 2019. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. arXiv:1904.00962 (2019).Google ScholarGoogle Scholar
  21. S. Yun,D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo. 2019. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. arXiv:1905.04899 (2019).Google ScholarGoogle Scholar
  22. H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. 2017. mixup: Beyond Empirical Risk Minimization. arXiv:1710.09412 (2017).Google ScholarGoogle Scholar
  23. W. Zhang, X. Cui, A. Kayi, M. Liu, U. Finkler, B. Kingsbury, G. Saon, Y. Mroueh, A. Buyuktosunoglu, P. Das, D. Kung, and M. Picheny. 2020. Improving Efficiency in Large-Scale Decentralized Distributed Training. arXiv:2002.01119 (2020).Google ScholarGoogle Scholar

Index Terms

  1. An Efficient Technique for Large Mini-batch Challenge of DNNs Training on Large Scale Cluster

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing
      June 2020
      246 pages
      ISBN:9781450370523
      DOI:10.1145/3369583

      Copyright © 2020 ACM

      © 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 June 2020

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper

      Acceptance Rates

      Overall Acceptance Rate166of966submissions,17%

      Upcoming Conference

    • Article Metrics

      • Downloads (Last 12 months)11
      • Downloads (Last 6 weeks)0

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader