Asynchronous, Data-Parallel Deep Convolutional Neural Network Training with Linear Prediction Model for Parameter Transition

Sato, Ikuro; Fujisaki, Ryo; Oyama, Yosuke; Nomura, Akihiro; Matsuoka, Satoshi

doi:10.1007/978-3-319-70096-0_32

Ikuro Sato¹⁸,
Ryo Fujisaki¹⁸,
Yosuke Oyama¹⁹,
Akihiro Nomura¹⁹ &
…
Satoshi Matsuoka¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10635))

Included in the following conference series:

International Conference on Neural Information Processing

8440 Accesses

Abstract

Recent studies have revealed that Convolutional Neural Networks requiring vastly many sum-of-product operations with relatively small numbers of parameters tend to exhibit great model performances. Asynchronous Stochastic Gradient Descent provides a possibility of large-scale distributed computation for training such networks. However, asynchrony introduces stale gradients, which are considered to have negative effects on training speed. In this work, we propose a method to predict future parameters during the training to mitigate the drawback of staleness. We show that the proposed method gives good parameter prediction accuracies that can improve speed of asynchronous training. The experimental results on ImageNet demonstrates that the proposed asynchronous training method, compared to a synchronous training method, reduces the training time to reach a certain model accuracy by a factor of 1.9 with 256 GPUs used in parallel.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

More Effective Distributed Deep Learning Using Staleness Based Parameter Updating

Distributed Training of Deep Neural Networks: Convergence and Case Study

Accelerating neural network training with distributed asynchronous and selective optimization (DASO)

Article Open access 04 February 2022

Notes

1.
Mutexes need to be implemented in appropriate places to avoid read/write collisions.
2.
See http://image-net.org for details.
3.
See https://www.cs.toronto.edu/~kriz/cifar.html for details.
4.
Each compute node of TSUBAME-KFC/DL contains 2 Intel Xeon E5-2620 v2 CPUs and 4 NVIDIA Tesla K80. Since K80 contains 2 GPUs internally, each node has 8 GPUs for total. FDR InfiniBand is equipped for interconnect.
5.
In every case the learning rate is varied from 0 to the target value linearly from the beginning of the training until the end of the first epoch for stability. After this period, the learning rate is held fixed at the target value.

References

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Van- houcke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
Lin, M., Chen, Q., Yan, S.: Network in network. In: International Conference on Learning Representations (2014)
Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Krizhevsky, A., Ilya, S., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Dean, J., Corrado, G.S., Monga, R., Chen, K., Devin, M., Le, Q.V., Mao, M.Z., Ranzato, M., Senior, A., Tucker, P., Yang, K., Ng, A.Y.: Large scale distributed deep Networks. In: Neural Information Processing Systems, pp. 1223–1231 (2012)
Google Scholar
Iandola, F.N., Ashraf, K., Moskewicz, M.W., Keutzer, K.: Firecaffe: near-linear acceleration of deep neural network training on compute clusters arXiv:1511.00175 (2015)
Zheng, S., Meng, Q., Wang, T., Chen, W., Yu, N., Ma, Z., Liu, T.: Asynchronous stochastic gradient descent with delay compensation for distributed deep learning arXiv:1609.08326 (2016)
Wu, R., Yan, S., Shan, Y., Dang, Q., Sun, G.: Deep Image: Scaling up image recognition arXiv:1501.02876 (2015)
Gupta, S., Zhang, W., Wang, F.: Model accuracy and runtime tradeoff in distributed deep learning: a systematic study. In: IEEE International Conference on Data Mining, pp. 171–180 (2016)
Google Scholar
Zhang, W., Gupta, S., Lian, X., Liu, J.: Staleness-aware async-SGD for distributed deep learning. In: International Joint Conferences on Artificial Intelligence, pp. 2350–2356 (2016)
Google Scholar
MPI: A message-passing interface standard. http://mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf
Oyama, Y., Nomura, A., Sato, I., Nishimura, H., Tamatsu, Y., Matsuoka, S.: Predicting statistics of asynchronous SGD parameters for a large-scale distributed deep learning system on GPU supercomputers. In: IEEE Big Data, pp. 66–75 (2016)
Google Scholar
Jaderberg, M., Czarnecki, W.M., Osindero, S., Vinyals, O., Graves, A., Kavukcuoglu, K.: Decoupled neural interfaces using synthetic gradients arXiv:1608.05343 (2016)
Nesterov, Y.: A method of solving a convex programming problem with convergence rate $O (1/k^2)$. Sov. Math. Dokl. 27, 372–376 (1983)
MATH Google Scholar
Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Netw. 12(1), 145–151 (1999)
Article MathSciNet Google Scholar
Krizhevsky, A.: Learning multiple layers of features from tiny images. Master’s thesis, Computer Science Department, University of Toronto (2009)
Google Scholar
Simard, P.Y., Steinkraus, D., Platt, J.C.: Best practices for convolutional neural networks applied to visual document analysis. In: IAPR International Conference on Document Analysis and Recognition, pp. 958–963 (2003)
Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: International Conference on Machine Learning, pp. 807–814 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Denso IT Laboratory, Inc., Tokyo, Japan
Ikuro Sato & Ryo Fujisaki
Tokyo Institute of Technology, Tokyo, Japan
Yosuke Oyama, Akihiro Nomura & Satoshi Matsuoka

Authors

Ikuro Sato
View author publications
You can also search for this author in PubMed Google Scholar
Ryo Fujisaki
View author publications
You can also search for this author in PubMed Google Scholar
Yosuke Oyama
View author publications
You can also search for this author in PubMed Google Scholar
Akihiro Nomura
View author publications
You can also search for this author in PubMed Google Scholar
Satoshi Matsuoka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ikuro Sato .

Editor information

Editors and Affiliations

Guangdong University of Technology, Guangzhou, China
Derong Liu
Guangdong University of Technology, Guangzhou, China
Shengli Xie
South China University of Technology, Guangzhou, China
Yuanqing Li
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Dongbin Zhao
King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
El-Sayed M. El-Alfy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sato, I., Fujisaki, R., Oyama, Y., Nomura, A., Matsuoka, S. (2017). Asynchronous, Data-Parallel Deep Convolutional Neural Network Training with Linear Prediction Model for Parameter Transition. In: Liu, D., Xie, S., Li, Y., Zhao, D., El-Alfy, ES. (eds) Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science(), vol 10635. Springer, Cham. https://doi.org/10.1007/978-3-319-70096-0_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-70096-0_32
Published: 26 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70095-3
Online ISBN: 978-3-319-70096-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Asynchronous, Data-Parallel Deep Convolutional Neural Network Training with Linear Prediction Model for Parameter Transition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

More Effective Distributed Deep Learning Using Staleness Based Parameter Updating

Distributed Training of Deep Neural Networks: Convergence and Case Study

Accelerating neural network training with distributed asynchronous and selective optimization (DASO)

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Asynchronous, Data-Parallel Deep Convolutional Neural Network Training with Linear Prediction Model for Parameter Transition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

More Effective Distributed Deep Learning Using Staleness Based Parameter Updating

Distributed Training of Deep Neural Networks: Convergence and Case Study

Accelerating neural network training with distributed asynchronous and selective optimization (DASO)

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation