Skip to main content
Log in

A new approach for the vanishing gradient problem on sigmoid activation

  • Regular Paper
  • Published:
Progress in Artificial Intelligence Aims and scope Submit manuscript

Abstract

The vanishing gradient problem (VGP) is an important issue at training time on multilayer neural networks using the backpropagation algorithm. This problem is worse when sigmoid transfer functions are used, in a network with many hidden layers. However, the sigmoid function is very important in several architectures such as recurrent neural networks and autoencoders, where the VGP might also appear. In this article, we propose a modification of the backpropagation algorithm for the sigmoid neurons training. It consists of adding a small constant to the calculation of the sigmoid’s derivative so that the proposed training direction differs slightly from the gradient while keeping the original sigmoid function in the network. This approach suggests that the derivative’s modification produces the same accuracy in fewer training steps on most datasets. Moreover, due to VGP, the original derivative does not converge using sigmoid functions on more than five hidden layers. However, the modification allows backpropagation to train two extra hidden layers in feedforward neural networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Hochreiter, S.: Untersuchungen zu dynamischen neuronalen Netzen. Diploma, Tech. Univ. München 1, 1991 (1991)

    Google Scholar 

  2. Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-term Dependencies, A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, New Jersey (2001)

    Google Scholar 

  3. Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. JMLR: W & CP, In: Proceedings of the 30th International Conference on Machine Learning, 28, Atlanta, Georgia, USA (2013)

  4. Hahnloser, R.L.T.: On the piecewise analysis of networks of linear threshold neurons. Neural Netw. 11, 691–697 (1998). https://doi.org/10.1016/S0893-6080(98)00012-4

    Article  Google Scholar 

  5. Qin, Y., Wang, X., Zou, J.: The optimized deep belief networks with improved logistic Sigmoid units and their application in fault diagnosis for planetary gearboxes of wind turbines. IEEE Trans. Ind. Electron. 66(5), 3814–3824 (2018). https://doi.org/10.1109/TIE.2018.2856205

    Article  Google Scholar 

  6. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  7. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 (2014)

  8. Ng, A. et al.: Sparse autoencoder. CS294A Lecture notes (2011)

  9. Hinton, G.E., Osindero, S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Comput. 18, 527–1554 (2006). https://doi.org/10.1162/neco.2006.18.7.1527

    Article  MathSciNet  MATH  Google Scholar 

  10. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167 (2015)

  11. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998). https://doi.org/10.1109/5.726791

    Article  Google Scholar 

  12. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv:abs/1708.07747 (2017)

  13. Cohen, G., Afshar, S., Tapson, J., van Schaik, A.: EMNIST: an extension of MNIST to handwritten letters. arXiv:abs/1702.05373 (2017)

  14. Freund, Y., Haussler, D.: Unsupervised learning of distributions on binary vectors using two layer networks. University of California, Santa Cruz (1994). https://doi.org/10.5555/2986916.2987028

  15. Bengio, Y., Pascal, L., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: Schö lkopf, B., Platt, J.C., Hoffman, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, pp. 153–160. MIT Press, Cambridge (2007)

    Google Scholar 

  16. Ranzato, M.A., Poultney, C., Chopra, S., Cun, Y.L.: Efficient learning of sparse representations with an energy-based model. In: Schölkopf, B., Platt, J.C., Hoffman, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, pp. 1137–1144. MIT Press, Cambridge (2007)

    Google Scholar 

  17. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Mach. Learn. Res. 11, 3371–3408 (2010)

    MathSciNet  MATH  Google Scholar 

  18. Jarrett, K., Kavukcuoglu, K., Ranzato, Marc’ Aurelio, LeCun, Y.: What is the best multi-stage architecture for object recognition? In: 2009 IEEE 12th International Conference on Computer Vision, pp. 2146–2153, IEEE (2009). https://doi.org/10.1109/ICCV.2009.5459469

  19. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010). https://doi.org/10.5555/3104322.3104425

  20. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS) 2010, pp. 249-256, Chia Laguna Resort, Sardinia, Italy (2010)

  21. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS) 2011, pp. 315-323, Fort Lauderdale, FL, USA (2011). http://proceedings.mlr.press/v15/glorot11a.html

  22. Pascanu, R., Mikolov, T., Bengio, Y.: Understanding the exploding gradient problem. arXiv:abs/1211.5063 (2012)

  23. Gulcehre, C., Moczulski, M., Denil, M., Bengio, Y.: Noisy activation functions. In: International Conference on Machine Learning, pp. 3059–3068 (2016)

  24. Kong, S., Takatsuka, M.: Hexpo: A vanishing-proof activation function. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2562–2567 (2017). https://doi.org/10.1109/IJCNN.2017.7966168

  25. MacDonald, G., Godbout, A., Gillcash, B., Cairns, S.: Volume-preserving Neural Networks: A Solution to the Vanishing Gradient Problem (2019). arXiv:1911.09576

  26. Dai, S., Li, L., Li, Z.: Modeling vehicle interactions via modified LSTM models for trajectory prediction. IEEE Access 7, 38287–38296 (2019). https://doi.org/10.1109/ACCESS.2019.2907000

    Article  Google Scholar 

  27. Hu, Y., Huber, A., Anumula, J., Liu, S.: Overcoming the vanishing gradient problem in plain recurrent networks (2018). arXiv:1801.06105

  28. Kerg, G., Goyette, K., Touzel, M.P., Gidel, G., Vorontsov, E., Bengio, Y., Lajoie, G.: Non-normal Recurrent Neural Network (nnRNN): learning long time dependencies while improving expressivity with transient dynamics. In: Advances in Neural Information Processing Systems, pp. 13613–13623, Curran Associates, Inc. (2019)

  29. Neelakantan, A., Vilnis, L., Le, Q.V., Sutskever, I., Kaiser, L., Kurach, K., Martens, J.: Adding gradient noise improves learning for very deep networks (2015). arXiv:1511.06807

  30. Larochelle, H., Bengio, Y.: Classification using discriminative restricted Boltzmann machines. In: Proceedings of the 25th International Conference on Machine Learning, pp. 536–543, ACM (2008). https://doi.org/10.1145/1390156.1390224

  31. Frosst, N., Papernot, N., Hinton, G.: Analyzing and improving representations with the soft nearest neighbor loss (2019). arXiv:1902.01889

  32. Qin, Y., Frosst, N., Sabour, S., Raffel, C., Cottrell, G., Hinton, G.: Detecting and diagnosing adversarial images with class-conditional capsule reconstructions (2019). arXiv:1907.02957

  33. Garg, A., Gupta, D., Saxena, S., Sahadev, P.P.: Validation of random dataset using an efficient CNN model trained on MNIST handwritten dataset. In: 2019 6th International Conference on Signal Processing and Integrated Networks (SPIN), pp. 602–606, IEEE (2019). https://doi.org/10.1109/SPIN.2019.8711703

  34. Cutkosky, A., Orabona, F.: Momentum-based variance reduction in non-convex SGD (2019). arXiv:1905.10018

  35. Tran-Dinh, Q., Pham, N.H., Phan, D.T., Nguyen, L.M.: A hybrid stochastic optimization framework for stochastic composite nonconvex optimization (2019). arXiv:1907.03793

Download references

Acknowledgements

The authors wish to thank Grant UTN 4103 “Análisis de Imágenes Utilizando Redes Neuronales” of National Technological University and all of the GITIA team for the help and support. We would like to thanks Sebastian Rodriguez, PhD for many helpful discussions and comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matías Roodschild.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Roodschild, M., Gotay Sardiñas, J. & Will, A. A new approach for the vanishing gradient problem on sigmoid activation. Prog Artif Intell 9, 351–360 (2020). https://doi.org/10.1007/s13748-020-00218-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13748-020-00218-y

Keywords

Navigation