ABSTRACT
Distributed deep learning using large mini-batches is a key strategy to perform the deep learning as fast as possible, but it represents a great challenge as it is difficult to achieve high scaling efficiency when using large clusters without compromising accuracy. The particular problem in this challenge is decreasing the number of model update iterations in whole of training. Thus, we need a technique which can converge the validation accuracy with a small number of iterations to address this challenge. In this paper, we introduce a novel technique, Final Polishing. This technique adjusts the means and variances in the batch normalization and mitigates the difference of normalization between validation datasets and augmented training datasets. By applying the technique, we achieved top-1 validation accuracy of 75.08% with mini-batch size of 81,920, with 2,048 GPUs and completed the training of ResNet-50 in 74.7 seconds.In addition, targeting top-1 validation accuracy of 75.9% or more, we tried additional parameters tuning. Then, we adjusted the number of GPUs and hyperparameters of DNNs with Final Polishing, and we also achieved top-1 validation accuracy of 75.97% with mini-batch size of 86,016, with 3,072 GPUs and completed the training of ResNet-50 in 62.1 seconds.
Supplemental Material
- T. Akiba, S. Suzuki, and K. Fukuda. 2017. Extremely Large Minibatch SGD: Train- ing ResNet-50 on ImageNet in 15 Minutes. arXiv:1711.04325 (2017).Google Scholar
- T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Li- brary for Heterogeneous Distributed Systems. CoRR arXiv:1512.01274 (2015). arXiv:1512.01274 http://arxiv.org/abs/1512.01274Google Scholar
- A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro. 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th International Con- ference on Machine Learning (2013).Google Scholar
- J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. 2012. Large Scale Distributed Deep Networks. Neural Information Processing Systems (2012).Google Scholar
- P. Goyal, P. Dollar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tul- loch, Y. Jia, and K. He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. ArXiv:1706.02677 (2017).Google Scholar
- N. Hamada, Y. Nagata, S. Kobayashi, and I. Ono. 2010. Adaptive weighted ag- gregation: A multiobjective function optimization framework taking account of spread and evenness of approximate solutions. In IEEE Congress on Evolutionary Computation. https://doi.org/10.1109/CEC.2010.5586368Google ScholarCross Ref
- K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2016).Google Scholar
- Z. He, L. Xie, X. Chen, Y. Zhang, Y.Wang, and Q. Tian. 2019. Data Augmentation Revisited: Rethinking the Distribution Gap between Clean and Augmented Data. arXiv:1909.09148 (2019).Google Scholar
- F. N. Iandola, K. Ashraf, M.W. Moskewicz, and K. Keutzer. 2015. FireCaffe: Near- Linear Acceleration of Deep Neural Network Training on Compute Clusters. arXiv:1511.00175 (2015).Google Scholar
- F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 0.5 MB model size. arXiv:1602.07360 (2016).Google Scholar
- X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, T. Chen, G. Hu, S. Shi, and X. Chu. 2018. Highly Scalable Deep Learning Training System with Mixed--Precision: Training ImageNet in Four Minutes. arXiv:1807.11205 (2018).Google Scholar
- H. Mikami, H. Suganuma, P. U-chupala, Y. Tanaka, and Y. Kageyama. 2019. Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash. arXiv:1811.05233v2 (2019).Google Scholar
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, and A. Karpathy and. 2015. ImageNet Large Scale Visual Recognition Challenge. International journal of Computer Vision (2015).Google ScholarDigital Library
- S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V. Le. 2017. Don't Decay the Learn- ing Rate, Increase the Batch Size. Neural Information Processing Systems (2017).Google Scholar
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z.Wojna. 2015. Rethinking the Inception Architecture for Computer Vision. arXiv:1512.00567v3 (2015).Google Scholar
- H. Touvron, A. Vedaldi, M. Douze, and H. Jégou. 2019. Fixing the train-test resolution discrepancy. arXiv:1906.06423 (2019).Google Scholar
- R.Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. 2015. Deep Image: Scaling up Image Recognition. arXiv:1501.02876 (2015).Google Scholar
- C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng. 2018. Image Classification at Supercomputer Scale. arXiv:1811.06992v2 (2018).Google Scholar
- Y. You, I. Gitman, and B. Ginsburg. 2017. Large Batch Training Of Convolutional Networks. arXiv:1708.03888 (2017).Google Scholar
- Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C. Hsieh. 2019. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. arXiv:1904.00962 (2019).Google Scholar
- S. Yun,D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo. 2019. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. arXiv:1905.04899 (2019).Google Scholar
- H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. 2017. mixup: Beyond Empirical Risk Minimization. arXiv:1710.09412 (2017).Google Scholar
- W. Zhang, X. Cui, A. Kayi, M. Liu, U. Finkler, B. Kingsbury, G. Saon, Y. Mroueh, A. Buyuktosunoglu, P. Das, D. Kung, and M. Picheny. 2020. Improving Efficiency in Large-Scale Decentralized Distributed Training. arXiv:2002.01119 (2020).Google Scholar
Index Terms
- An Efficient Technique for Large Mini-batch Challenge of DNNs Training on Large Scale Cluster
Recommendations
Deep convolutional hashing using pairwise multi-label supervision for large-scale visual search
Image hashing has attracted much attention in the field of large-scale visual search, and learning based approaches have benefited from recent advances of deep learning, which outperforms the shallow models. Most existing deep hashing approaches tend to ...
mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training
Euro-Par 2022: Parallel ProcessingAbstractMemory usage is becoming an increasingly pressing bottleneck in the training process of Deep Neural Networks (DNNs), especially when training on Graphics Processing Units (GPUs). Existing solutions for multi-GPU training setups partition the ...
Towards dropout training for convolutional neural networks
Recently, dropout has seen increasing use in deep learning. For deep convolutional neural networks, dropout is known to work well in fully-connected layers. However, its effect in convolutional and pooling layers is still not clear. This paper ...
Comments