A comparative performance analysis of different activation functions in LSTM networks for classification

Farzad, Amir; Mashayekhi, Hoda; Hassanpour, Hamid

doi:10.1007/s00521-017-3210-6

A comparative performance analysis of different activation functions in LSTM networks for classification

Original Article
Published: 19 October 2017

Volume 31, pages 2507–2521, (2019)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

7618 Accesses
96 Citations
7 Altmetric
1 Mention
Explore all metrics

Abstract

In recurrent neural networks such as the long short-term memory (LSTM), the sigmoid and hyperbolic tangent functions are commonly used as activation functions in the network units. Other activation functions developed for the neural networks are not thoroughly analyzed in LSTMs. While many researchers have adopted LSTM networks for classification tasks, no comprehensive study is available on the choice of activation functions for the gates in these networks. In this paper, we compare 23 different kinds of activation functions in a basic LSTM network with a single hidden layer. Performance of different activation functions and different number of LSTM blocks in the hidden layer are analyzed for classification of records in the IMDB, Movie Review, and MNIST data sets. The quantitative results on all data sets demonstrate that the least average error is achieved with the Elliott activation function and its modifications. Specifically, this family of functions exhibits better results than the sigmoid activation function which is popular in LSTM networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Overview of Long Short-Term Memory Neural Networks

Overview of Incorporating Nonlinear Functions into Recurrent Neural Network Models

A review on the long short-term memory model

Article 13 May 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

References

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780. doi:10.1162/neco.1997.9.8.1735
Article Google Scholar
Graves A (2012) Supervised sequence labelling with recurrent neural networks. Springer, Berlin
Book MATH Google Scholar
Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Netw 18(5):602–610
Article Google Scholar
Liwicki M, Graves A, Bunke H, Schmidhuber J (2007) A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks. In: Proceedings of the 9th international conference on document analysis and recognition, ICDAR 2007
Graves A, Liwicki M, Fernández S et al (2009) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell 31:855–868. doi:10.1109/TPAMI.2008.137
Article Google Scholar
Graves A, Fernández S, Schmidhuber J (2005) Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition. In: Duch W, Kacprzyk J, Oja E, Zadrożny S (eds) Artificial neural networks: formal models and their applications—ICANN 2005. Springer, Berlin, pp 799–804
Google Scholar
Otte S, Krechel D, Liwicki M, Dengel A (2012) Local feature based online mode detection with recurrent neural networks. In: 2012 international conference on frontiers in handwriting recognition (ICFHR). pp 533–537
Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. arXiv:1303.5778 [cs]
Thang Luong IS (2014) Addressing the rare word problem in neural machine translation. doi:10.3115/v1/P15-1002
Wöllmer M, Metallinou A, Eyben F, et al (2010) Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling. In: Proceedings of interspeech, Makuhari. pp 2362–2365
Sak H, Senior A, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the annual conference of international speech communication association (INTERSPEECH). pp 338–342
Fan Y, Qian Y, Xie F, Soong FK (2014) TTS synthesis with bidirectional LSTM based recurrent neural networks. In: Proceedings interspeech. pp 1964–1968
Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. arXiv:1409.2329 [cs]
Sønderby SK, Winther O (2014) Protein secondary structure prediction with long short term memory networks. arXiv:1412.7828 [cs, q-bio]
Marchi E, Ferroni G, Eyben F, et al (2014) Multi-resolution linear prediction based features for audio onset detection with bidirectional LSTM neural networks. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp 2164–2168
Donahue J, Hendricks LA, Guadarrama S, et al (2014) Long-term recurrent convolutional networks for visual recognition and description. arXiv:1411.4389 [cs]
Wollmer M, Blaschke C, Schindl T et al (2011) Online driver distraction detection using long short-term memory. IEEE Trans Intell Transp Syst 12:574–582. doi:10.1109/TITS.2011.2119483
Article Google Scholar
da Gomes GSS, Ludermir TB (2013) Optimization of the weights and asymmetric activation function family of neural network for time series forecasting. Exp Syst Appl 40:6438–6446. doi:10.1016/j.eswa.2013.05.053
Article Google Scholar
Duch W, Jankowski N (1999) Survey of neural transfer functions. Neural Comput Surv 2:163–213
Google Scholar
Singh Sodhi S, Chandra P (2003) A class +1 sigmoidal activation functions for FFANNs. J Econ Dyna Control 28:183–187
Article MathSciNet MATH Google Scholar
da Gomes GSS, Ludermir TB, Lima LMMR (2010) Comparison of new activation functions in neural network for forecasting financial time series. Neural Comput Appl 20:417–439. doi:10.1007/s00521-010-0407-3
Article Google Scholar
Gomes GS d S, Ludermir TB (2008) Complementary log-log and probit: activation functions implemented in artificial neural networks. In: Eighth international conference on hybrid intelligent systems, 2008. HIS’08. pp 939–942
Michal Rosen-Zvi MB (1998) Learnability of periodic activation functions: general results. Phys Rev E 58:3606–3609. doi:10.1103/PhysRevE.58.3606
Article Google Scholar
Leung H, Haykin S (1993) Rational function neural network. Neural Comput 5:928–938. doi:10.1162/neco.1993.5.6.928
Article Google Scholar
Ma L, Khorasani K (2005) Constructive feedforward neural networks using Hermite polynomial activation functions. IEEE Trans Neural Netw 16:821–833. doi:10.1109/TNN.2005.851786
Article Google Scholar
Hornik K (1993) Some new results on neural network approximation. Neural Netw 6:1069–1072. doi:10.1016/S0893-6080(09)80018-X
Article Google Scholar
Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Netw 4:251–257. doi:10.1016/0893-6080(91)90009-T
Article Google Scholar
Hartman E, Keeler JD, Kowalski JM (1990) Layered neural networks with Gaussian hidden units as universal approximations. Neural Comput 2:210–215. doi:10.1162/neco.1990.2.2.210
Article Google Scholar
Skoundrianos EN, Tzafestas SG (2004) Modelling and FDI of dynamic discrete time systems using a MLP with a new sigmoidal activation function. J Intell Robot Syst 41:19–36. doi:10.1023/B:JINT.0000049175.78893.2f
Article Google Scholar
Pao Y-H (1989) Adaptive pattern recognition and neural networks. Addison-Wesley Longman Publishing Co. Inc, Boston
MATH Google Scholar
Carroll SM, Dickinson BW (1989) Construction of neural nets using the radon transform. In: International joint conference on neural networks, 1989, vol 1. IJCNN. pp 607–611
Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control Signal Syst 2:303–314. doi:10.1007/BF02551274
Article MathSciNet MATH Google Scholar
Chandra P, Singh Y (2004) Feedforward sigmoidal networks—equicontinuity and fault-tolerance properties. IEEE Trans Neural Netw 15:1350–1366. doi:10.1109/TNN.2004.831198
Article Google Scholar
Williams RJ, Zipser D (1995) Gradient-based learning algorithms for recurrent networks and their computational complexity. In: Chauvin Y, Rumelhart DE (eds) Back-propagation: theory, architectures and applications. L. Erlbaum Associates Inc., Hillsdale, pp 433–486
Google Scholar
Zeiler MD (2012) ADADELTA: an adaptive learning rate method. arXiv:1212.5701 [cs]
Singh Sodhi S, Chandra P (2014) Bi-modal derivative activation function for sigmoidal feedforward networks. Neurocomputing 143:182–196. doi:10.1016/j.neucom.2014.06.007
Article Google Scholar
Yuan M, Hu H, Jiang Y, Hang S (2013) A new camera calibration based on neural network with tunable activation function in intelligent space. In: 2013 6th international symposium on computational intelligence and design (ISCID). pp 371–374
Chandra P, Sodhi SS (2014) A skewed derivative activation function for SFFANNs. In: Recent advances and innovations in engineering (ICRAIE). IEEE, pp 1–6
Elliott DL (1993) A better activation function for artificial neural networks. Technical Report ISR TR 93–8, University of Maryland
Hara K, Nakayamma K (1994) Comparison of activation functions in multilayer neural network for pattern classification. In: 1994 IEEE international conference on neural networks, 1994. IEEE world congress on computational intelligence, vol 5. pp 2997–3002
Burhani H, Feng W, Hu G (2015) Denoising autoencoder in neural networks with modified Elliott activation function and sparsity-favoring cost function. In: 2015 3rd international conference on applied computing and information technology/2nd international conference on computational science and intelligence (ACIT-CSI). pp 343–348
Chandra P, Singh Y (2004) A case for the self-adaptation of activation functions in FFANNs. Neurocomputing 56:447–454. doi:10.1016/j.neucom.2003.08.005
Article Google Scholar
Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on machine learning (ICML 2010). pp 807–814
Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw Mach Learn 4:26–31
Google Scholar
Duchi J, Hazan E, Singer Y (2011) Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research 12:2121–2159
MathSciNet MATH Google Scholar
Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning. arXiv:1506.00019 [cs]
Ruder S (2016) An overview of gradient descent optimization algorithms. arXiv:1609.04747 [cs]
Hinton GE, Srivastava N, Krizhevsky A, et al (2012) Improving neural networks by preventing co-adaptation of feature detectors, vol abs/1207.0580. arXiv preprint arXiv:1207.0580. The Computing Research Repository (CoRR)
Pang B, Lee L (2005) Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA. pp 115–124
Maas AL, Daly RE, Pham PT, et al (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Association for Computational Linguistics, Stroudsburg, PA, USA. pp 142–150
Lenc L, Hercig T (2016) Neural networks for sentiment analysis in Czech. In: ITAT 2016 proceedings, CEUR Workshop Proceedings, vol 1649. pp 48–55
Dai AM, Le QV (2015) Semi-supervised sequence learning. In: Cortes C, Lawrence ND, Lee DD et al (eds) Advances in neural information processing systems, vol 28. Curran Associates Inc, Red Hook, pp 3079–3087
Google Scholar
Arjovsky M, Shah A, Bengio Y (2016) Unitary evolution recurrent neural networks. In: Proceedings of the 33rd international conference on machine learning. pp 1120–1128
Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5:157–166. doi:10.1109/72.279181
Article Google Scholar
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: Proceedings of the 30th international conference on machine learning. pp 1310–1318
Gers FA, Schmidhuber J (2000) Recurrent nets that time and count. In: Proceedings of the IEEE-INNS-ENNS international joint conference on neural networks. IJCNN 2000. Neural computing: new challenges and perspectives for the new millennium, vol 3. pp 189–194
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45:2673–2681. doi:10.1109/78.650093
Article Google Scholar
Cho K, van Merrienboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder–decoder approaches. arXiv:1409.1259 [cs, stat]

Download references

Author information

Authors and Affiliations

Kharazmi International Campus, Shahrood University of Technology, Shahrood, Iran
Amir Farzad
Department of Computer Engineering, Shahrood University of Technology, P.O. Box: 3619995161, Shahrood, Iran
Hoda Mashayekhi & Hamid Hassanpour

Authors

Amir Farzad
View author publications
You can also search for this author in PubMed Google Scholar
Hoda Mashayekhi
View author publications
You can also search for this author in PubMed Google Scholar
Hamid Hassanpour
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hoda Mashayekhi.

Ethics declarations

Conflict of interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Appendix

See Tables 9, 10 and 11.

Table 9 Average test error values per each activation function for the Movie Review data set

Full size table

Table 10 Average test error values per each activation function for the IMDB data set

Full size table

Table 11 Average error values per each activation function for the MNIST data set

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Farzad, A., Mashayekhi, H. & Hassanpour, H. A comparative performance analysis of different activation functions in LSTM networks for classification. Neural Comput & Applic 31, 2507–2521 (2019). https://doi.org/10.1007/s00521-017-3210-6

Download citation

Received: 03 August 2016
Accepted: 04 October 2017
Published: 19 October 2017
Issue Date: 01 July 2019
DOI: https://doi.org/10.1007/s00521-017-3210-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative performance analysis of different activation functions in LSTM networks for classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Overview of Long Short-Term Memory Neural Networks

Overview of Incorporating Nonlinear Functions into Recurrent Neural Network Models

A review on the long short-term memory model

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now