Skip to main content
Log in

Piecewise Polynomial Activation Functions for Feedforward Neural Networks

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Since the origins of artificial neural network research, many models of feedforward networks have been proposed. This paper presents an algorithm which adapts the shape of the activation function to the training data, so that it is learned along with the connection weights. The activation function is interpreted as a piecewise polynomial approximation to the distribution function of the argument of the activation function. An online learning procedure is given, and it is formally proved that it makes the training error decrease or stay the same except for extreme cases. Moreover, the model is computationally simpler than standard feedforward networks, so that it is suitable for implementation on FPGAs and microcontrollers. However, our present proposal is limited to two-layer, one-output-neuron architectures due to the lack of differentiability of the learned activation functions with respect to the node locations. Experimental results are provided, which show the performance of the proposal algorithm for classification and regression applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Agostinelli F, Hoffman M, Sadowski PJ, Baldi P (2014) Learning activation functions to improve deep neural networks. CoRR arXiv:1412.6830, URL http://arxiv.org/abs/1412.6830

  2. Barron AR (1993) Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans Inf Theor 39(3):930–945

    Article  MathSciNet  MATH  Google Scholar 

  3. Bartlett PL, Maiorov V, Meir R (1998) Almost linear VC-dimension bounds for piecewise polynomial networks. Neural Comput 10(8):2159–2173

    Article  Google Scholar 

  4. Campo ID, Finker R, Echanobe J, Basterretxea K (2013) Controlled accuracy approximation of sigmoid function for efficient FPGA-based implementation of artificial neurons. Electron Lett 49(25):1598–1600

    Article  Google Scholar 

  5. Castelli I, Trentin E (2012a) Semi-unsupervised weighted maximum-likelihood estimation of joint densities for the co-training of adaptive activation functions. In: Schwenker F, Trentin E (eds) Partially supervised learning: first IAPR TC3 workshop, PSL 2011, Ulm, 15–16 Sept 2011. Revised selected papers, Springer, Berlin, Heidelberg, pp 62–71

  6. Castelli I, Trentin E (2012b) Supervised and unsupervised co-training of adaptive activation functions in neural nets. In: Schwenker F, Trentin E (eds) Partially supervised learning: first IAPR TC3 workshop, PSL 2011, Ulm, 15–16 Sept 2011. Revised selected papers, Springer, Berlin, Heidelberg, pp 52–61

  7. Castelli I, Trentin E (2014) Combination of supervised and unsupervised learning for training the activation functions of neural networks. Pattern Recognit Lett 37(Supplement C):178–191

    Article  Google Scholar 

  8. Chen CT, Chang WD (1996) A feedforward neural network with function shape autotuning. Neural Netw 9(4):627–641

    Article  MathSciNet  Google Scholar 

  9. Costarelli D, Vinti G (2016) Max-product neural network and quasi-interpolation operators activated by sigmoidal functions. J Approx Theory 209:1–22

    Article  MathSciNet  MATH  Google Scholar 

  10. Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control Signals Syst 2(4):303–314

    Article  MathSciNet  MATH  Google Scholar 

  11. Ertugrul ÖF (2018) A novel type of activation function in artificial neural networks: trained activation function. Neural Netw 99:148–157

    Article  Google Scholar 

  12. Fritsch FN, Carlson RE (1980) Monotone piecewise cubic interpolation. SIAM J Numer Anal 17:238–246

    Article  MathSciNet  MATH  Google Scholar 

  13. Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics (AISTATS 2011)

  14. Goodfellow IJ, Warde-Farley D, Mirza M, Courville AC, Bengio Y (2013) Maxout networks. In: Proceedings of the 30th international conference on machine learning, ICML 2013, Atlanta, 16–21 June 2013, pp 1319–1327

  15. Gulcehre C, Cho K, Pascanu R, Bengio Y (2014) Learned-norm pooling for deep neural networks. Lect Notes Comput Sci 8724:530–546

    Article  Google Scholar 

  16. Hawkins DM (2004) The problem of overfitting. J Chem Inf Comput Sci 44(1):1–12

    Article  MathSciNet  Google Scholar 

  17. Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366

    Article  MATH  Google Scholar 

  18. Huynh HT, Won Y (2009) Extreme learning machine with fuzzy activation function. In: 2009 Fifth international joint conference on INC, IMS and IDC. https://doi.org/10.1109/NCM.2009.206

  19. Kang M, Palmer-Brown D (2005) An adaptive function neural network (ADFUNN) for phrase recognition. In: IEEE international joint conference on neural networks, 2005. IJCNN 2005, vol 1, pp 593–597

  20. Kang M, Palmer-Brown D (2007) A multi-layer adaptive function neural network (MADFUNN) for letter image recognition. In: International joint conference on neural networks, 2007. IJCNN 2007, pp 2817–2822

  21. Maas AL, Hannun AY, Ng AY (2013) Rectifier nonlinearities improve neural network acoustic models. In: 30 th International conference on machine learning, vol 28

  22. Microelectronics Center of North Carolina (2016) MCNC benchmarks. http://www.cbl.ncsu.edu:16080/benchmarks/. Accessed 15 Oct 2016

  23. Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 807–814

  24. Ortega-Zamorano F, Jerez J, Juarez G, Perez J, Franco L (2014) High precision fpga implementation of neural network activation functions. In: IEEE symposium on intelligent embedded systems (IES), 2014, pp 55–60. https://doi.org/10.1109/INTELES.2014.7008986

  25. Rumelhart D, Hinton G, Williams R (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536

    Article  MATH  Google Scholar 

  26. Rumelhart DE, Hinton GE, Williams RJ (1988) Learning representations by back-propagating errors. In: Anderson JA, Rosenfeld E (eds) Neurocomputing: foundations of research. MIT Press, Cambridge, pp 696–699

  27. Sakurai A (1998) Tight bounds for the VC-dimension of piecewise polynomial networks. In: Advances in neural information processing systems, vol 11, pp 323–329

  28. Springenberg J, Riedmiller M (2013) Improving deep neural networks with probabilistic maxout units, pp 1–10. arXiv:1312.6116

  29. Sunat K, Lursinsap C, Chu CHH (2007) The p-recursive piecewise polynomial sigmoid generators and first-order algorithms for multilayer tanh-like neurons. Neural Comput Appl 16(1):33–47

    Article  Google Scholar 

  30. Trentin E (2001) Networks with trainable amplitude of activation functions. Neural Netw 14(4–5):471–493

    Article  Google Scholar 

  31. University of California Irvine (2016) Machine learning repository. http://archive.ics.uci.edu/ml/. Accessed 17 Oct 2016

  32. Vecci L, Piazza F, Uncini A (1998) Learning and approximation capabilities of adaptive spline activation function neural networks. Neural Netw 11(2):259–270

    Article  Google Scholar 

  33. Wang GT, Li P, Cao JT (2012) Variable activation function extreme learning machine based on residual prediction compensation. Soft Comput 16(9):1477–1484. https://doi.org/10.1007/s00500-012-0817-5

    Article  Google Scholar 

  34. Werbos PJ (1974) Beyond regression: new tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University

  35. Zhang M, Fulcher J, Scofield RA (1997) Rainfall estimation using artificial neural network group. Neurocomputing 16(2):97–115

    Article  Google Scholar 

Download references

Acknowledgements

This work is partially supported by the Ministry of Economy and Competitiveness of Spain under Grants TIN2014-53465-R, project name Video surveillance by active search of anomalous events, and TIN2014-57341-R, project name Metaheuristics, holistic intelligence and smart mobility. It is also partially supported by the Autonomous Government of Andalusia (Spain) under project P12-TIC-657, project name Self-organizing systems and robust estimators for video surveillance. All of them include funds from the European Regional Development Fund (ERDF). The authors thankfully acknowledge the computer resources, technical expertise and assistance provided by the SCBI (Supercomputing and Bioinformatics) center of the University of Málaga. They also gratefully acknowledge the support of NVIDIA Corporation with the donation of two Titan X GPUs used for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ezequiel López-Rubio.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Proof of Proposition 2

Let us assume that \(u_{r,i}\in \left[ q_{i,k},q_{i,k+1}\right) \). Therefore,

$$\begin{aligned} g_{i}\left( u_{r,i}\right) =\frac{k+\delta _{i,k}\left( u_{r,i}\right) }{m} \end{aligned}$$
(63)

where \(\delta _{i,k}\left( u_{r,i}\right) \in \left[ 0,1\right] \). The exact form of \(\delta \left( u_{r,i}\right) \) depends on the order of the polynomials \(\gamma \).

If \(w_{2,i}\left( y_{r}-z_{r}\right) <0\) and \(\left| \frac{2w_{2,i}}{m}\right| <\lambda \left| y_{r}-z_{r}\right| \), then the update Eq. (36) implies that:

$$\begin{aligned} q_{i,k}\le \bar{q}_{i,k+1}\le u_{r,i} \end{aligned}$$
(64)

Therefore,

$$\begin{aligned} \bar{g}_{i}\left( u_{r,i}\right) =\frac{k+1+\bar{\delta }_{i,k+1}\left( u_{r,i}\right) }{m} \end{aligned}$$
(65)

where the bars correspond to the values obtained after executing the update. Moreover, from (4):

$$\begin{aligned} y_{r}-\bar{y}_{r}= & {} \sum _{s=1}^{L}w_{2,s}g_{s}\left( u_{r,s}\right) -\sum _{s=1}^{L}w_{2,s}\bar{g}_{s}\left( u_{r,s}\right) \nonumber \\= & {} w_{2,i}g_{i}\left( u_{r,i}\right) -w_{2,i}\bar{g}_{i}\left( u_{r,i}\right) \end{aligned}$$
(66)

Then from (63), (65) and (66):

$$\begin{aligned} y_{r}-\bar{y}_{r}= & {} w_{2,i}\left( \frac{k+\delta _{i,k}\left( u_{r,i}\right) }{m}-\frac{k+1+\bar{\delta }_{i,k+1}\left( u_{r,i}\right) }{m}\right) \nonumber \\= & {} w_{2,i}\frac{\delta _{i,k}\left( u_{r,i}\right) -\bar{\delta }_{i,k+1}\left( u_{r,i}\right) -1}{m} \end{aligned}$$
(67)

Since \(w_{2,i}\left( y_{r}-z_{r}\right) <0\) , there are two possible cases: (a) \(\left( w_{2,i}>0\right) \wedge \left( y_{r}-z_{r}<0\right) \); (b) \(\left( w_{2,i}<0\right) \wedge \left( y_{r}-z_{r}>0\right) \) .

For case (a), since \(\delta _{i,k}\left( u_{r,i}\right) ,\bar{\delta }_{i,k+1}\left( u_{r,i}\right) \in \left[ 0,1\right] \), from (67) we obtain:

$$\begin{aligned} -\frac{2w_{2,i}}{m}\le y_{r}-\bar{y}_{r}\le 0 \end{aligned}$$
(68)

On the other hand, since \(y_{r}-z_{r}<0\), \(w_{2,i}>0\), \(\lambda \in \left( 0,1\right] \) and \(\left| \frac{2w_{2,i}}{m}\right| <\lambda \left| y_{r}-z_{r}\right| \) we have:

$$\begin{aligned} y_{r}-z_{r}\le -\frac{2w_{2,i}}{m} \end{aligned}$$
(69)

From (68) and (69):

$$\begin{aligned}&\displaystyle y_{r}-z_{r}\le y_{r}-\bar{y}_{r}\le 0 \end{aligned}$$
(70)
$$\begin{aligned}&\displaystyle -z_{r}\le -\bar{y}_{r}\le -y_{r} \end{aligned}$$
(71)
$$\begin{aligned}&\displaystyle y_{r} \le \bar{y}_{r}\le z_{r} \end{aligned}$$
(72)
$$\begin{aligned}&\displaystyle \bar{E}_{r} \le E_{r} \end{aligned}$$
(73)

That is, the new squared error \(\bar{E}_{r}\) is lower than or equal to the old squared error \(E_{r}\), as required.

For case (b), since \(\delta _{i,k}\left( u_{r,i}\right) ,\bar{\delta }_{i,k-1}\left( u_{r,i}\right) \in \left[ 0,1\right] \), from (67) we obtain:

$$\begin{aligned} 0\le y_{r}-\bar{y}_{r}\le -\frac{2w_{2,i}}{m} \end{aligned}$$
(74)

On the other hand, since \(y_{r}-z_{r}>0\), \(w_{i}^{2}<0\), \(\lambda \in \left( 0,1\right] \) and \(\left| \frac{2w_{2,i}}{m}\right| <\lambda \left| y_{r}-z_{r}\right| \) we have:

$$\begin{aligned} -\frac{2w_{2,i}}{m}\le y_{r}-z_{r} \end{aligned}$$
(75)

From (74) and (75):

$$\begin{aligned}&\displaystyle 0\le y_{r}-\bar{y}_{r}\le y_{r}-z_{r} \end{aligned}$$
(76)
$$\begin{aligned}&\displaystyle -y_{r}\le -\bar{y}_{r}\le -z_{r} \end{aligned}$$
(77)
$$\begin{aligned}&\displaystyle z_{r} \le \bar{y}_{r}\le y_{r} \end{aligned}$$
(78)
$$\begin{aligned}&\displaystyle \bar{E}_{r} \le E_{r} \end{aligned}$$
(79)

And again the new squared error \(\bar{E}_{r}\) is lower than or equal to the old squared error \(E_{r}\), as required. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

López-Rubio, E., Ortega-Zamorano, F., Domínguez, E. et al. Piecewise Polynomial Activation Functions for Feedforward Neural Networks. Neural Process Lett 50, 121–147 (2019). https://doi.org/10.1007/s11063-018-09974-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-018-09974-4

Keywords

Navigation