A new approach for the vanishing gradient problem on sigmoid activation

Roodschild, Matías; Gotay Sardiñas, Jorge; Will, Adrián

doi:10.1007/s13748-020-00218-y

A new approach for the vanishing gradient problem on sigmoid activation

Regular Paper
Published: 20 October 2020

Volume 9, pages 351–360, (2020)
Cite this article

Progress in Artificial Intelligence Aims and scope Submit manuscript

2267 Accesses
53 Citations
Explore all metrics

Abstract

The vanishing gradient problem (VGP) is an important issue at training time on multilayer neural networks using the backpropagation algorithm. This problem is worse when sigmoid transfer functions are used, in a network with many hidden layers. However, the sigmoid function is very important in several architectures such as recurrent neural networks and autoencoders, where the VGP might also appear. In this article, we propose a modification of the backpropagation algorithm for the sigmoid neurons training. It consists of adding a small constant to the calculation of the sigmoid’s derivative so that the proposed training direction differs slightly from the gradient while keeping the original sigmoid function in the network. This approach suggests that the derivative’s modification produces the same accuracy in fewer training steps on most datasets. Moreover, due to VGP, the original derivative does not converge using sigmoid functions on more than five hidden layers. However, the modification allows backpropagation to train two extra hidden layers in feedforward neural networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mitigating Vanishing Gradient in SGD Optimization in Neural Networks

Pushing Stochastic Gradient towards Second-Order Methods – Backpropagation Learning with Transformations in Nonlinearities

Fast Conjugate Gradient Algorithm for Feedforward Neural Networks

References

Hochreiter, S.: Untersuchungen zu dynamischen neuronalen Netzen. Diploma, Tech. Univ. München 1, 1991 (1991)
Google Scholar
Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-term Dependencies, A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, New Jersey (2001)
Google Scholar
Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. JMLR: W & CP, In: Proceedings of the 30th International Conference on Machine Learning, 28, Atlanta, Georgia, USA (2013)
Hahnloser, R.L.T.: On the piecewise analysis of networks of linear threshold neurons. Neural Netw. 11, 691–697 (1998). https://doi.org/10.1016/S0893-6080(98)00012-4
Article Google Scholar
Qin, Y., Wang, X., Zou, J.: The optimized deep belief networks with improved logistic Sigmoid units and their application in fault diagnosis for planetary gearboxes of wind turbines. IEEE Trans. Ind. Electron. 66(5), 3814–3824 (2018). https://doi.org/10.1109/TIE.2018.2856205
Article Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 (2014)
Ng, A. et al.: Sparse autoencoder. CS294A Lecture notes (2011)
Hinton, G.E., Osindero, S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Comput. 18, 527–1554 (2006). https://doi.org/10.1162/neco.2006.18.7.1527
Article MathSciNet MATH Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167 (2015)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998). https://doi.org/10.1109/5.726791
Article Google Scholar
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv:abs/1708.07747 (2017)
Cohen, G., Afshar, S., Tapson, J., van Schaik, A.: EMNIST: an extension of MNIST to handwritten letters. arXiv:abs/1702.05373 (2017)
Freund, Y., Haussler, D.: Unsupervised learning of distributions on binary vectors using two layer networks. University of California, Santa Cruz (1994). https://doi.org/10.5555/2986916.2987028
Bengio, Y., Pascal, L., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: Schö lkopf, B., Platt, J.C., Hoffman, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, pp. 153–160. MIT Press, Cambridge (2007)
Google Scholar
Ranzato, M.A., Poultney, C., Chopra, S., Cun, Y.L.: Efficient learning of sparse representations with an energy-based model. In: Schölkopf, B., Platt, J.C., Hoffman, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, pp. 1137–1144. MIT Press, Cambridge (2007)
Google Scholar
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Mach. Learn. Res. 11, 3371–3408 (2010)
MathSciNet MATH Google Scholar
Jarrett, K., Kavukcuoglu, K., Ranzato, Marc’ Aurelio, LeCun, Y.: What is the best multi-stage architecture for object recognition? In: 2009 IEEE 12th International Conference on Computer Vision, pp. 2146–2153, IEEE (2009). https://doi.org/10.1109/ICCV.2009.5459469
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010). https://doi.org/10.5555/3104322.3104425
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS) 2010, pp. 249-256, Chia Laguna Resort, Sardinia, Italy (2010)
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS) 2011, pp. 315-323, Fort Lauderdale, FL, USA (2011). http://proceedings.mlr.press/v15/glorot11a.html
Pascanu, R., Mikolov, T., Bengio, Y.: Understanding the exploding gradient problem. arXiv:abs/1211.5063 (2012)
Gulcehre, C., Moczulski, M., Denil, M., Bengio, Y.: Noisy activation functions. In: International Conference on Machine Learning, pp. 3059–3068 (2016)
Kong, S., Takatsuka, M.: Hexpo: A vanishing-proof activation function. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2562–2567 (2017). https://doi.org/10.1109/IJCNN.2017.7966168
MacDonald, G., Godbout, A., Gillcash, B., Cairns, S.: Volume-preserving Neural Networks: A Solution to the Vanishing Gradient Problem (2019). arXiv:1911.09576
Dai, S., Li, L., Li, Z.: Modeling vehicle interactions via modified LSTM models for trajectory prediction. IEEE Access 7, 38287–38296 (2019). https://doi.org/10.1109/ACCESS.2019.2907000
Article Google Scholar
Hu, Y., Huber, A., Anumula, J., Liu, S.: Overcoming the vanishing gradient problem in plain recurrent networks (2018). arXiv:1801.06105
Kerg, G., Goyette, K., Touzel, M.P., Gidel, G., Vorontsov, E., Bengio, Y., Lajoie, G.: Non-normal Recurrent Neural Network (nnRNN): learning long time dependencies while improving expressivity with transient dynamics. In: Advances in Neural Information Processing Systems, pp. 13613–13623, Curran Associates, Inc. (2019)
Neelakantan, A., Vilnis, L., Le, Q.V., Sutskever, I., Kaiser, L., Kurach, K., Martens, J.: Adding gradient noise improves learning for very deep networks (2015). arXiv:1511.06807
Larochelle, H., Bengio, Y.: Classification using discriminative restricted Boltzmann machines. In: Proceedings of the 25th International Conference on Machine Learning, pp. 536–543, ACM (2008). https://doi.org/10.1145/1390156.1390224
Frosst, N., Papernot, N., Hinton, G.: Analyzing and improving representations with the soft nearest neighbor loss (2019). arXiv:1902.01889
Qin, Y., Frosst, N., Sabour, S., Raffel, C., Cottrell, G., Hinton, G.: Detecting and diagnosing adversarial images with class-conditional capsule reconstructions (2019). arXiv:1907.02957
Garg, A., Gupta, D., Saxena, S., Sahadev, P.P.: Validation of random dataset using an efficient CNN model trained on MNIST handwritten dataset. In: 2019 6th International Conference on Signal Processing and Integrated Networks (SPIN), pp. 602–606, IEEE (2019). https://doi.org/10.1109/SPIN.2019.8711703
Cutkosky, A., Orabona, F.: Momentum-based variance reduction in non-convex SGD (2019). arXiv:1905.10018
Tran-Dinh, Q., Pham, N.H., Phan, D.T., Nguyen, L.M.: A hybrid stochastic optimization framework for stochastic composite nonconvex optimization (2019). arXiv:1907.03793

Download references

Acknowledgements

The authors wish to thank Grant UTN 4103 “Análisis de Imágenes Utilizando Redes Neuronales” of National Technological University and all of the GITIA team for the help and support. We would like to thanks Sebastian Rodriguez, PhD for many helpful discussions and comments.

Author information

Authors and Affiliations

Grupo de Investigación en Tecnologías Avanzadas (GITIA), Facultad Regional Tucumán, Universidad Tecnológica Nacional, Tucumán, Argentina
Matías Roodschild, Jorge Gotay Sardiñas & Adrián Will

Authors

Matías Roodschild
View author publications
You can also search for this author in PubMed Google Scholar
Jorge Gotay Sardiñas
View author publications
You can also search for this author in PubMed Google Scholar
Adrián Will
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matías Roodschild.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Roodschild, M., Gotay Sardiñas, J. & Will, A. A new approach for the vanishing gradient problem on sigmoid activation. Prog Artif Intell 9, 351–360 (2020). https://doi.org/10.1007/s13748-020-00218-y

Download citation

Received: 06 March 2020
Accepted: 27 September 2020
Published: 20 October 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s13748-020-00218-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A new approach for the vanishing gradient problem on sigmoid activation

Abstract

Access this article

Similar content being viewed by others

Mitigating Vanishing Gradient in SGD Optimization in Neural Networks

Pushing Stochastic Gradient towards Second-Order Methods – Backpropagation Learning with Transformations in Nonlinearities

Fast Conjugate Gradient Algorithm for Feedforward Neural Networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A new approach for the vanishing gradient problem on sigmoid activation

Abstract

Access this article

Similar content being viewed by others

Mitigating Vanishing Gradient in SGD Optimization in Neural Networks

Pushing Stochastic Gradient towards Second-Order Methods – Backpropagation Learning with Transformations in Nonlinearities

Fast Conjugate Gradient Algorithm for Feedforward Neural Networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation