Abstract
The vanishing gradient problem (VGP) is an important issue at training time on multilayer neural networks using the backpropagation algorithm. This problem is worse when sigmoid transfer functions are used, in a network with many hidden layers. However, the sigmoid function is very important in several architectures such as recurrent neural networks and autoencoders, where the VGP might also appear. In this article, we propose a modification of the backpropagation algorithm for the sigmoid neurons training. It consists of adding a small constant to the calculation of the sigmoid’s derivative so that the proposed training direction differs slightly from the gradient while keeping the original sigmoid function in the network. This approach suggests that the derivative’s modification produces the same accuracy in fewer training steps on most datasets. Moreover, due to VGP, the original derivative does not converge using sigmoid functions on more than five hidden layers. However, the modification allows backpropagation to train two extra hidden layers in feedforward neural networks.
Similar content being viewed by others
References
Hochreiter, S.: Untersuchungen zu dynamischen neuronalen Netzen. Diploma, Tech. Univ. München 1, 1991 (1991)
Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-term Dependencies, A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, New Jersey (2001)
Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. JMLR: W & CP, In: Proceedings of the 30th International Conference on Machine Learning, 28, Atlanta, Georgia, USA (2013)
Hahnloser, R.L.T.: On the piecewise analysis of networks of linear threshold neurons. Neural Netw. 11, 691–697 (1998). https://doi.org/10.1016/S0893-6080(98)00012-4
Qin, Y., Wang, X., Zou, J.: The optimized deep belief networks with improved logistic Sigmoid units and their application in fault diagnosis for planetary gearboxes of wind turbines. IEEE Trans. Ind. Electron. 66(5), 3814–3824 (2018). https://doi.org/10.1109/TIE.2018.2856205
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 (2014)
Ng, A. et al.: Sparse autoencoder. CS294A Lecture notes (2011)
Hinton, G.E., Osindero, S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Comput. 18, 527–1554 (2006). https://doi.org/10.1162/neco.2006.18.7.1527
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167 (2015)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998). https://doi.org/10.1109/5.726791
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv:abs/1708.07747 (2017)
Cohen, G., Afshar, S., Tapson, J., van Schaik, A.: EMNIST: an extension of MNIST to handwritten letters. arXiv:abs/1702.05373 (2017)
Freund, Y., Haussler, D.: Unsupervised learning of distributions on binary vectors using two layer networks. University of California, Santa Cruz (1994). https://doi.org/10.5555/2986916.2987028
Bengio, Y., Pascal, L., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: Schö lkopf, B., Platt, J.C., Hoffman, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, pp. 153–160. MIT Press, Cambridge (2007)
Ranzato, M.A., Poultney, C., Chopra, S., Cun, Y.L.: Efficient learning of sparse representations with an energy-based model. In: Schölkopf, B., Platt, J.C., Hoffman, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, pp. 1137–1144. MIT Press, Cambridge (2007)
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Mach. Learn. Res. 11, 3371–3408 (2010)
Jarrett, K., Kavukcuoglu, K., Ranzato, Marc’ Aurelio, LeCun, Y.: What is the best multi-stage architecture for object recognition? In: 2009 IEEE 12th International Conference on Computer Vision, pp. 2146–2153, IEEE (2009). https://doi.org/10.1109/ICCV.2009.5459469
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010). https://doi.org/10.5555/3104322.3104425
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS) 2010, pp. 249-256, Chia Laguna Resort, Sardinia, Italy (2010)
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS) 2011, pp. 315-323, Fort Lauderdale, FL, USA (2011). http://proceedings.mlr.press/v15/glorot11a.html
Pascanu, R., Mikolov, T., Bengio, Y.: Understanding the exploding gradient problem. arXiv:abs/1211.5063 (2012)
Gulcehre, C., Moczulski, M., Denil, M., Bengio, Y.: Noisy activation functions. In: International Conference on Machine Learning, pp. 3059–3068 (2016)
Kong, S., Takatsuka, M.: Hexpo: A vanishing-proof activation function. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2562–2567 (2017). https://doi.org/10.1109/IJCNN.2017.7966168
MacDonald, G., Godbout, A., Gillcash, B., Cairns, S.: Volume-preserving Neural Networks: A Solution to the Vanishing Gradient Problem (2019). arXiv:1911.09576
Dai, S., Li, L., Li, Z.: Modeling vehicle interactions via modified LSTM models for trajectory prediction. IEEE Access 7, 38287–38296 (2019). https://doi.org/10.1109/ACCESS.2019.2907000
Hu, Y., Huber, A., Anumula, J., Liu, S.: Overcoming the vanishing gradient problem in plain recurrent networks (2018). arXiv:1801.06105
Kerg, G., Goyette, K., Touzel, M.P., Gidel, G., Vorontsov, E., Bengio, Y., Lajoie, G.: Non-normal Recurrent Neural Network (nnRNN): learning long time dependencies while improving expressivity with transient dynamics. In: Advances in Neural Information Processing Systems, pp. 13613–13623, Curran Associates, Inc. (2019)
Neelakantan, A., Vilnis, L., Le, Q.V., Sutskever, I., Kaiser, L., Kurach, K., Martens, J.: Adding gradient noise improves learning for very deep networks (2015). arXiv:1511.06807
Larochelle, H., Bengio, Y.: Classification using discriminative restricted Boltzmann machines. In: Proceedings of the 25th International Conference on Machine Learning, pp. 536–543, ACM (2008). https://doi.org/10.1145/1390156.1390224
Frosst, N., Papernot, N., Hinton, G.: Analyzing and improving representations with the soft nearest neighbor loss (2019). arXiv:1902.01889
Qin, Y., Frosst, N., Sabour, S., Raffel, C., Cottrell, G., Hinton, G.: Detecting and diagnosing adversarial images with class-conditional capsule reconstructions (2019). arXiv:1907.02957
Garg, A., Gupta, D., Saxena, S., Sahadev, P.P.: Validation of random dataset using an efficient CNN model trained on MNIST handwritten dataset. In: 2019 6th International Conference on Signal Processing and Integrated Networks (SPIN), pp. 602–606, IEEE (2019). https://doi.org/10.1109/SPIN.2019.8711703
Cutkosky, A., Orabona, F.: Momentum-based variance reduction in non-convex SGD (2019). arXiv:1905.10018
Tran-Dinh, Q., Pham, N.H., Phan, D.T., Nguyen, L.M.: A hybrid stochastic optimization framework for stochastic composite nonconvex optimization (2019). arXiv:1907.03793
Acknowledgements
The authors wish to thank Grant UTN 4103 “Análisis de Imágenes Utilizando Redes Neuronales” of National Technological University and all of the GITIA team for the help and support. We would like to thanks Sebastian Rodriguez, PhD for many helpful discussions and comments.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Roodschild, M., Gotay Sardiñas, J. & Will, A. A new approach for the vanishing gradient problem on sigmoid activation. Prog Artif Intell 9, 351–360 (2020). https://doi.org/10.1007/s13748-020-00218-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13748-020-00218-y