Rise the Momentum: A Method for Reducing the Training Error on Multiple GPUs

Tang, Yu; Yin, Lujia; Zhang, Zhaoning; Li, Dongsheng

doi:10.1007/978-3-030-38961-1_4

Rise the Momentum: A Method for Reducing the Training Error on Multiple GPUs

Yu Tang¹⁰,
Lujia Yin¹⁰,
Zhaoning Zhang¹⁰ &
…
Dongsheng Li¹⁰

Conference paper
First Online: 22 January 2020

1824 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11945))

Abstract

Deep neural network training is a common issue that is receiving increasing attention in recent years and basically performed on Stochastic Gradient Descent or its variants. Distributed training increases training speed significantly but causes precision loss at the mean time. Increasing batchsize can improve training parallelism in distributed training. However, if the batchsize is too large, it will bring difficulty to training process and introduce more training error. In this paper, we consider controlling the total batchsize and lowering batchsize on each GPU by increasing the number of GPUs in distributed training. We train Resnet50 [4] on CIFAR-10 dataset by different optimizers, such as SGD, Adam and NAG. The experimental results show that large batchsize speeds up convergence to some degree. However, if the batchsize of per GPU is too small, training process fails to converge. Large number of GPUs, which means a small batchsize on each GPU declines the training performance in distributed training. We tried several ways to reduce the training error on multiple GPUs. According to our results, increasing momentum is a well-behaved method in distributed training to improve training performance on condition of multiple GPUs of constant large batchsize.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Li, D., et al.:HPDL: towards a general framework for high-performance distributed deep learning. In: Proceedings of 39th IEEE International Conference on Distributed Computing Systems (IEEE ICDCS) (2019)
Google Scholar
Szegedy, C., Ioffe, S., Vanhoucke, V., et al.: Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: AAAI, vol. 4, p. 12 (2017)
Google Scholar
Chollet, F.: Xception: deep learning with depthwise separable convolutions. arXiv preprint (2016)
Google Scholar
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Huang, G., Liu, Z., Weinberger, K.Q., et al.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, no. 2, p. 3 (2017)
Google Scholar
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.: SSD: single shot multibox detector. arXiv:1512.02325v2 (2015)
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: NIPS, pp. 379–387 (2016)
Google Scholar
Qin, Z., Zhang, Z., Chen, X., et al.: FD-MobileNet: improved MobileNet with a fast downsampling strategy. arXiv preprint arXiv:1802.03750 (2018)
Li, M., et al.: Scaling distributed machine learning with the parameter server. In: Proceedings of OSDI, pp. 583–598 (2014)
Google Scholar
Chen, T., et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015)
Smith, S.L., Le, Q.V.: A Bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451 (2017)
Smith, S.L., Kindermans, P.-J., Le, Q.V.: Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489 (2017)
Krizhevsky, A.: One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 [cs.NE] (2014)
Nitish, S.K., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016)
Goyal, P.,: Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Google Scholar
Dai, J., He, K., Sun, J.: Instance-aware semantic segmentation via multi-task network cascades. arXiv:1512.04412 (2015)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of 32nd International Conference on Machine Learning, ICML15, pp. 448–456 (2015)
Google Scholar
Masters, D., Luschi, C.: Revising small batch training for deep neural networks. arXiv preprint arXiv:1804.07612 (2018)
You, Y., Gitman, I., Ginsburg, B.: Scaling SGD batch size to 32k for ImageNet training. arXiv preprint arXiv:1708.03888 (2017a)
Akiba, T., Suzuki, S., Fukuda, K.: Extremely large minibatch SGD: training ResNet-50 on ImageNet in 15 minutes. arXiv preprint arXiv:1711.04325 (2017)
Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y.: Entropy-SGD: biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838 (2016)
You, Y., Zhang, Z., Hsieh, C.-J., Demmel, J., Keutzer, K.: ImageNet training in minutes. CoRR, abs/1709.05011 (2017)
Google Scholar
Balles, L., Romero, J., Hennig, P.: Coupling adaptive batch sizes with learning rates. arXiv preprint arXiv:1612.05086 (2016)
Li, Q., Tai, C., Weinan, E.: Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv preprint arXiv:1511.06251 (2017)
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 [stat.ML] (2016)
Chen, J., Pan, X., Monga, R., Bengio, S., Jozefowicz, R.: Revisiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981 [cs.LG] (2016)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838 [stat.ML] (2016)
Jastrzȩbski, S., et al.: Three factors influencing minima in SGD. arXiv preprint arXiv:1711.04623 [cs.LG] (2017)
Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155(1–2), 267–305 (2014)
MathSciNet MATH Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop, coursera: neural networks for machine learning. University of Toronto, Technical report (2012)
Google Scholar
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(Jul), 2121–2159 (2011)
MathSciNet MATH Google Scholar
Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence o(1/k\(^2\)). Doklady ANSSSR (Transl. Soviet. Math. Docl.), 269, 543–547 (1983)
Google Scholar
Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Netw.: Off. J. Int. Neural Netw. Soc. 12(1), 145–151 (1999)
Article Google Scholar
Ruder, S.: An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 [cs.LG] (2017)

Download references

Acknowledgement

This work is sponsored in part by the National Key R&D Program of China under Grant No. 2018YFB2101100 and the National Natural Science Foundation of China under Grant No. 61932001 and 61872376.

Author information

Authors and Affiliations

Science and Technology on Parallel and Distributed Laboratory, National University of Defense Technology, Changsha, China
Yu Tang, Lujia Yin, Zhaoning Zhang & Dongsheng Li

Authors

Yu Tang
View author publications
You can also search for this author in PubMed Google Scholar
Lujia Yin
View author publications
You can also search for this author in PubMed Google Scholar
Zhaoning Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dongsheng Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhaoning Zhang .

Editor information

Editors and Affiliations

Department of Computer Science and Software Engineering, Swinburne University of Technology, Hawthorn, Melbourne, VIC, Australia
Sheng Wen
School of Computer Science, The University of Sydney, Camperdown, NSW, Australia
Albert Zomaya
Department of Computer Science, St. Francis Xavier University, Antigonish, NS, Canada
Laurence T. Yang

Appendices

A Appendix A

Table 2. SGD’s results of Resnet50 on CIFAR-10 of different batchsizes on multiple GPUs in Parameter Server

Full size table

B Appendix B

Table 3. Adam’s Results of Resnet50 on CIFAR-10 of different batchsizes on multiple GPUs in Parameter Server

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tang, Y., Yin, L., Zhang, Z., Li, D. (2020). Rise the Momentum: A Method for Reducing the Training Error on Multiple GPUs. In: Wen, S., Zomaya, A., Yang, L.T. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2019. Lecture Notes in Computer Science(), vol 11945. Springer, Cham. https://doi.org/10.1007/978-3-030-38961-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-38961-1_4
Published: 22 January 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-38960-4
Online ISBN: 978-3-030-38961-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics