Abstract
Since the origins of artificial neural network research, many models of feedforward networks have been proposed. This paper presents an algorithm which adapts the shape of the activation function to the training data, so that it is learned along with the connection weights. The activation function is interpreted as a piecewise polynomial approximation to the distribution function of the argument of the activation function. An online learning procedure is given, and it is formally proved that it makes the training error decrease or stay the same except for extreme cases. Moreover, the model is computationally simpler than standard feedforward networks, so that it is suitable for implementation on FPGAs and microcontrollers. However, our present proposal is limited to two-layer, one-output-neuron architectures due to the lack of differentiability of the learned activation functions with respect to the node locations. Experimental results are provided, which show the performance of the proposal algorithm for classification and regression applications.
Similar content being viewed by others
References
Agostinelli F, Hoffman M, Sadowski PJ, Baldi P (2014) Learning activation functions to improve deep neural networks. CoRR arXiv:1412.6830, URL http://arxiv.org/abs/1412.6830
Barron AR (1993) Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans Inf Theor 39(3):930–945
Bartlett PL, Maiorov V, Meir R (1998) Almost linear VC-dimension bounds for piecewise polynomial networks. Neural Comput 10(8):2159–2173
Campo ID, Finker R, Echanobe J, Basterretxea K (2013) Controlled accuracy approximation of sigmoid function for efficient FPGA-based implementation of artificial neurons. Electron Lett 49(25):1598–1600
Castelli I, Trentin E (2012a) Semi-unsupervised weighted maximum-likelihood estimation of joint densities for the co-training of adaptive activation functions. In: Schwenker F, Trentin E (eds) Partially supervised learning: first IAPR TC3 workshop, PSL 2011, Ulm, 15–16 Sept 2011. Revised selected papers, Springer, Berlin, Heidelberg, pp 62–71
Castelli I, Trentin E (2012b) Supervised and unsupervised co-training of adaptive activation functions in neural nets. In: Schwenker F, Trentin E (eds) Partially supervised learning: first IAPR TC3 workshop, PSL 2011, Ulm, 15–16 Sept 2011. Revised selected papers, Springer, Berlin, Heidelberg, pp 52–61
Castelli I, Trentin E (2014) Combination of supervised and unsupervised learning for training the activation functions of neural networks. Pattern Recognit Lett 37(Supplement C):178–191
Chen CT, Chang WD (1996) A feedforward neural network with function shape autotuning. Neural Netw 9(4):627–641
Costarelli D, Vinti G (2016) Max-product neural network and quasi-interpolation operators activated by sigmoidal functions. J Approx Theory 209:1–22
Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control Signals Syst 2(4):303–314
Ertugrul ÖF (2018) A novel type of activation function in artificial neural networks: trained activation function. Neural Netw 99:148–157
Fritsch FN, Carlson RE (1980) Monotone piecewise cubic interpolation. SIAM J Numer Anal 17:238–246
Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics (AISTATS 2011)
Goodfellow IJ, Warde-Farley D, Mirza M, Courville AC, Bengio Y (2013) Maxout networks. In: Proceedings of the 30th international conference on machine learning, ICML 2013, Atlanta, 16–21 June 2013, pp 1319–1327
Gulcehre C, Cho K, Pascanu R, Bengio Y (2014) Learned-norm pooling for deep neural networks. Lect Notes Comput Sci 8724:530–546
Hawkins DM (2004) The problem of overfitting. J Chem Inf Comput Sci 44(1):1–12
Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366
Huynh HT, Won Y (2009) Extreme learning machine with fuzzy activation function. In: 2009 Fifth international joint conference on INC, IMS and IDC. https://doi.org/10.1109/NCM.2009.206
Kang M, Palmer-Brown D (2005) An adaptive function neural network (ADFUNN) for phrase recognition. In: IEEE international joint conference on neural networks, 2005. IJCNN 2005, vol 1, pp 593–597
Kang M, Palmer-Brown D (2007) A multi-layer adaptive function neural network (MADFUNN) for letter image recognition. In: International joint conference on neural networks, 2007. IJCNN 2007, pp 2817–2822
Maas AL, Hannun AY, Ng AY (2013) Rectifier nonlinearities improve neural network acoustic models. In: 30 th International conference on machine learning, vol 28
Microelectronics Center of North Carolina (2016) MCNC benchmarks. http://www.cbl.ncsu.edu:16080/benchmarks/. Accessed 15 Oct 2016
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 807–814
Ortega-Zamorano F, Jerez J, Juarez G, Perez J, Franco L (2014) High precision fpga implementation of neural network activation functions. In: IEEE symposium on intelligent embedded systems (IES), 2014, pp 55–60. https://doi.org/10.1109/INTELES.2014.7008986
Rumelhart D, Hinton G, Williams R (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536
Rumelhart DE, Hinton GE, Williams RJ (1988) Learning representations by back-propagating errors. In: Anderson JA, Rosenfeld E (eds) Neurocomputing: foundations of research. MIT Press, Cambridge, pp 696–699
Sakurai A (1998) Tight bounds for the VC-dimension of piecewise polynomial networks. In: Advances in neural information processing systems, vol 11, pp 323–329
Springenberg J, Riedmiller M (2013) Improving deep neural networks with probabilistic maxout units, pp 1–10. arXiv:1312.6116
Sunat K, Lursinsap C, Chu CHH (2007) The p-recursive piecewise polynomial sigmoid generators and first-order algorithms for multilayer tanh-like neurons. Neural Comput Appl 16(1):33–47
Trentin E (2001) Networks with trainable amplitude of activation functions. Neural Netw 14(4–5):471–493
University of California Irvine (2016) Machine learning repository. http://archive.ics.uci.edu/ml/. Accessed 17 Oct 2016
Vecci L, Piazza F, Uncini A (1998) Learning and approximation capabilities of adaptive spline activation function neural networks. Neural Netw 11(2):259–270
Wang GT, Li P, Cao JT (2012) Variable activation function extreme learning machine based on residual prediction compensation. Soft Comput 16(9):1477–1484. https://doi.org/10.1007/s00500-012-0817-5
Werbos PJ (1974) Beyond regression: new tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University
Zhang M, Fulcher J, Scofield RA (1997) Rainfall estimation using artificial neural network group. Neurocomputing 16(2):97–115
Acknowledgements
This work is partially supported by the Ministry of Economy and Competitiveness of Spain under Grants TIN2014-53465-R, project name Video surveillance by active search of anomalous events, and TIN2014-57341-R, project name Metaheuristics, holistic intelligence and smart mobility. It is also partially supported by the Autonomous Government of Andalusia (Spain) under project P12-TIC-657, project name Self-organizing systems and robust estimators for video surveillance. All of them include funds from the European Regional Development Fund (ERDF). The authors thankfully acknowledge the computer resources, technical expertise and assistance provided by the SCBI (Supercomputing and Bioinformatics) center of the University of Málaga. They also gratefully acknowledge the support of NVIDIA Corporation with the donation of two Titan X GPUs used for this research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Proof of Proposition 2
Let us assume that \(u_{r,i}\in \left[ q_{i,k},q_{i,k+1}\right) \). Therefore,
where \(\delta _{i,k}\left( u_{r,i}\right) \in \left[ 0,1\right] \). The exact form of \(\delta \left( u_{r,i}\right) \) depends on the order of the polynomials \(\gamma \).
If \(w_{2,i}\left( y_{r}-z_{r}\right) <0\) and \(\left| \frac{2w_{2,i}}{m}\right| <\lambda \left| y_{r}-z_{r}\right| \), then the update Eq. (36) implies that:
Therefore,
where the bars correspond to the values obtained after executing the update. Moreover, from (4):
Then from (63), (65) and (66):
Since \(w_{2,i}\left( y_{r}-z_{r}\right) <0\) , there are two possible cases: (a) \(\left( w_{2,i}>0\right) \wedge \left( y_{r}-z_{r}<0\right) \); (b) \(\left( w_{2,i}<0\right) \wedge \left( y_{r}-z_{r}>0\right) \) .
For case (a), since \(\delta _{i,k}\left( u_{r,i}\right) ,\bar{\delta }_{i,k+1}\left( u_{r,i}\right) \in \left[ 0,1\right] \), from (67) we obtain:
On the other hand, since \(y_{r}-z_{r}<0\), \(w_{2,i}>0\), \(\lambda \in \left( 0,1\right] \) and \(\left| \frac{2w_{2,i}}{m}\right| <\lambda \left| y_{r}-z_{r}\right| \) we have:
That is, the new squared error \(\bar{E}_{r}\) is lower than or equal to the old squared error \(E_{r}\), as required.
For case (b), since \(\delta _{i,k}\left( u_{r,i}\right) ,\bar{\delta }_{i,k-1}\left( u_{r,i}\right) \in \left[ 0,1\right] \), from (67) we obtain:
On the other hand, since \(y_{r}-z_{r}>0\), \(w_{i}^{2}<0\), \(\lambda \in \left( 0,1\right] \) and \(\left| \frac{2w_{2,i}}{m}\right| <\lambda \left| y_{r}-z_{r}\right| \) we have:
And again the new squared error \(\bar{E}_{r}\) is lower than or equal to the old squared error \(E_{r}\), as required. \(\square \)
Rights and permissions
About this article
Cite this article
López-Rubio, E., Ortega-Zamorano, F., Domínguez, E. et al. Piecewise Polynomial Activation Functions for Feedforward Neural Networks. Neural Process Lett 50, 121–147 (2019). https://doi.org/10.1007/s11063-018-09974-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-018-09974-4