Abstract
This paper considers the batch gradient method with the smoothing \(\ell _0\) regularization (BGSL0) for training and pruning feedforward neural networks. We show why BGSL0 can produce sparse weights, which are crucial for pruning networks. We prove both the weak convergence and strong convergence of BGSL0 under mild conditions. The decreasing monotonicity of the error functions during the training process is also obtained. Two examples are given to substantiate the theoretical analysis and to show the better sparsity of BGSL0 than three typical \(\ell _p\) regularization methods.
Similar content being viewed by others
References
Hornik K (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366
Rubio JD, Angelov P, Pacheco J (2011) Uniformly stable backpropagation algorithm to train a feedforward neural network. IEEE Trans Neural Netw 22(3):356–366
Sum J, Leung CS, Ho K (2012) Convergence analyses on on-line weight noise injection-based training algorithms for MLPs. IEEE Trans Neural Netw Learn Syst 23(11):1827–1840
Bordignon F, Gomide F (2014) Uninorm based evolving neural networks and approximation capabilities. Neurocomputing 127:13–20
Pratama M, Anavatti SG, Angelov PP, Lughofer E (2014) PANFIS: a novel incremental learning machine. IEEE Trans Neural Netw Learn Syst 25(1):55–68
Rubio JJ (2014) Analytic neural network model of a wind turbine. Soft Comput. doi:10.1007/s00500-014-1290-0
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control AC–19(6):716–723
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Stathakis D (2009) How many hidden layers and nodes? Int J Remote Sens 30(8):2133–2147
Augasta MG, Kathirvalavakumar T (2011) A novel pruning algorithm for optimizing feedforward neural network of classification problems. Neural Process Lett 34:241–258
Karayiannis NB, Glenn WM (1997) Growing radial basis neural networks: merging supervised and unsupervised learning with network growth techniques. IEEE Trans Neural Netw 8(6):1492–1506
Huang GB, Paramasivan S, Narasimhan S (2005) A generalized growing and pruning RBF (GGAP-RBF) neural network for function approximation. IEEE Trans Neural Netw 16(1):57–67
Reed R (1993) Pruning algorithms: a survey. IEEE Trans Neural Netw 4:740–747
Loone S, Irwin G (2001) Improving neural network training solutions using regularisation. Neurocomputing 37:71–90
Setiono R (1997) A penalty-function approach for pruning feedforward neural networks. Neural Comput 9:185–204
Shao HM, Xu DP, Zheng GF, Liu LJ (2012) Convergence of an online gradient method with inner-product penalty and adaptive momentum. Neurocomputing 77:243–252
Karnin ED (1990) A simple procedure for pruning back-propagation trained neural networks. IEEE Trans Neural Netw 1:239–242
Lughofer E (2011) Evolving fuzzy systems—methodologies, advanced concepts and applications. Springer, Berlin
Rubio JJ (2014) Evolving intelligent algorithms for the modelling of brain and eye signals. Appl Soft Comput 14(B):259–268
Ordonez FJ, Iglesias JA, Toledo DP, Ledezma A, Sanchis A (2013) Online activity recognition using evolving classifiers. Expert Syst Appl 40:1248–1255
Saito K, Nakano S (2000) Second-order learning algorithm with squared penalty term. Neural Comput 12:709–729
Zhang HS, Wu W, Liu F, Yao MC (2009) Boundedness and convergence of online gadient method with penalty for feedforward neural networks. IEEE Trans Neural Netw 20(6):1050–1054
Zhang HS, Wu W, Yao MC (2012) Boundedness and convergence of batch back-propagation algorithm with penalty for feedforward neural networks. Neurocomputing 89:141–146
Shao HM, Zheng GF (2011) Boundedness and convergence of online gradient method with penalty and momentum. Neurocomputing 74:765–770
Yu X, Chen QF (2012) Convergence of gradient method with penalty for Ridge Polynomial neural network. Neurocomputing 97:405–409
Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J Roy Stat Soc Ser B Methodol 58:267–288
Wu W, Fan QW, Zurada JM, Wang J, Yang DK, Liu Y (2014) Batch gradient method with smoothing \(L_{1/2}\) regularization for training of feedforward neural networks. Neural Netw 50:72–78
Fan QW, Zurada JM, Wu W (2014) Convergence of online gradient method for feedforward neural networks with smoothing \(L_{1/2}\) regularization penalty. Neurocomputing 131:208–216
Liu Y, Wu W, Fan QW, Yang DK, Wang J (2014) A modified gradient learning algorithm with smoothing \(L_{1/2}\) regularization for Takagi–Sugeno fuzzy models. Neurocomputing 138(2014):229–237
Candes EJ, Tao T (2005) Decoding by linear programming. IEEE Trans Inform Theory 51(12):4203–4215
Wang YF, Liu P, Li ZH, Sun T, Yang CC, Zheng QS (2013) Data regularization using Gaussian beams decomposition and sparse norms. J Inverse Ill-Posed Probl 21(1):1–23
Liu Y, Yang J, Li L, Wu W (2012) Negative effects of sufficiently small initialweights on back-propagation neural networks. J Zhejiang Univ-Sci C (Comput Electron) 13(8):585–592
Acknowledgments
We are grateful to the reviewers for their insightful comments. This research is supported by the National Natural Science Foundation of China (No. 61101228) and the China Postdoctoral Science Foundation (No. 2012M520623)
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
In this “Appendix”, we give the proof of Theorem 1. For simplicity, we introduce the following notation:
for \(n=1,2,\ldots,\) and \(j=1,\ldots,J.\)
Lemma 1
Suppose the Assumptions (A1)–(A3) are valid, then \(E_{\bf w} ({\bf w})\) satisfies Lipschitz condition, that is, there exists a positive constant \(C_2\), such that
Specially, for \(\theta \geqslant 0\) , we have
Proof
Using (12a) and the triangular inequality we have
In order to give a further estimation of the above equation, we consider the change of \({\bf F}^{n,j}\) defined in (21) between two steps:
where \(t_l\) lies on the segment between \({\bf w}_l^{n+1}\cdot \varvec{\xi }^j\) and \({\bf w}_l^{n}\cdot \varvec{\xi }^j\), for \(l=1,\ldots, L\), and \(C_3=\sup f'(t)\max \limits _{1\le j\le J}\Vert \varvec{\xi }^j\Vert \).
It is easy to see that \(e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j})\) is Lipschitz continuous and let the positive constant \(C_4\) be the corresponding Lipschitz constant. Using (25), Assumption (A1) and the Cauchy–Schwartz inequality, we have
with
Similar to the deduction of (25), using the definition of \(H_{\sigma }^{\prime }(\cdot )\), Assumption (A3) and the mean value theorem, we have
where \(C_7=\sup h_{\sigma }^{\prime \prime }(t)\).
Combing the above Eqs. (24), (26), and (27), we have
Similarly, for \(l=1,\ldots,L\), there exits a constant \(C_8\), such that
Then, Eqs. (10), (11), (28) and (29) validate (22):
where \(C_2=J\sqrt{C_6^2+LC_8^2}+\lambda C_7\).
Equation (23) is naturally valid as a simple application of (22). \(\square \)
Now, we proceed to the proof of Theorem 1 by dealing with Eqs. (16)–(19) separately.
Proof of (16)
By the differential mean value theorem, there exists a constant \(\theta \in [0,1]\), such that
To make (16) valid, we only require the learning rate \(\eta \) to satisfy
\(\square \)
Proof of (17)
Equation (17) is directly obtained by (16) and \(E({\bf w}^n)>0(n=1,2,\ldots )\).\(\square \)
Proof of (18)
Let \(\beta =\eta -C_2\eta ^2\). By (31), we have
Considering \(E({\bf w}^{n+1})>0\), let \(n\rightarrow \infty \), then we have
This immediately gives
\(\square \)
To prove (19), we need the following lemma.
Lemma 2
(See Lemma 3 in [23]) Let \(F:\Upphi \subset {\mathbb {R}}^k\rightarrow {\mathbb {R}},(k\ge 1)\) be continuous for a bounded closed region \(\Upphi\) , and \(\Upphi _0=\{{\bf z}\in \Upphi :F({\bf z})=0\}\) . The projection of \(\Upphi _0\) on each coordinate axis does not contain any interior point. Let the sequence \(\{{\bf z}^n\}\) satisfy:
-
(i)
\(\hbox {lim}_{n\rightarrow \infty }F({\bf z}^n)=0\);
-
(ii)
\(\hbox {lim}_{n\rightarrow \infty }\Vert {\bf z}^{n+1}-{\bf z}^n\Vert =0\).
Then, there exists a unique \({\bf z}^*\in \Upphi_0\) such that \(\mathop {\hbox {lim}}\limits _{n\rightarrow \infty }{\bf z}^n={\bf z}^*.\)
Proof of (19)
Obviously \(\Vert E_{\bf w} ({\bf w})\Vert \) is a continuous function under Assumptions \((A2)\) and \((A3)\). Using (13) and (18), we have
Furthermore, Assumption (A4) is valid. Thus, applying Lemma 2, there exists a unique \({\bf w}^{*}\in \Upphi \) such that \(\mathop {\hbox {lim}}\limits _{n\rightarrow \infty }{\bf w}^n={\bf w}^{*}\). \(\square \)
Rights and permissions
About this article
Cite this article
Zhang, H., Tang, Y. & Liu, X. Batch gradient training method with smoothing \(\boldsymbol{\ell}_{\bf 0}\) regularization for feedforward neural networks. Neural Comput & Applic 26, 383–390 (2015). https://doi.org/10.1007/s00521-014-1730-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-014-1730-x