Skip to main content
Log in

Batch gradient training method with smoothing \(\boldsymbol{\ell}_{\bf 0}\) regularization for feedforward neural networks

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

This paper considers the batch gradient method with the smoothing \(\ell _0\) regularization (BGSL0) for training and pruning feedforward neural networks. We show why BGSL0 can produce sparse weights, which are crucial for pruning networks. We prove both the weak convergence and strong convergence of BGSL0 under mild conditions. The decreasing monotonicity of the error functions during the training process is also obtained. Two examples are given to substantiate the theoretical analysis and to show the better sparsity of BGSL0 than three typical \(\ell _p\) regularization methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Hornik K (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366

    Article  Google Scholar 

  2. Rubio JD, Angelov P, Pacheco J (2011) Uniformly stable backpropagation algorithm to train a feedforward neural network. IEEE Trans Neural Netw 22(3):356–366

    Article  Google Scholar 

  3. Sum J, Leung CS, Ho K (2012) Convergence analyses on on-line weight noise injection-based training algorithms for MLPs. IEEE Trans Neural Netw Learn Syst 23(11):1827–1840

    Article  Google Scholar 

  4. Bordignon F, Gomide F (2014) Uninorm based evolving neural networks and approximation capabilities. Neurocomputing 127:13–20

    Article  Google Scholar 

  5. Pratama M, Anavatti SG, Angelov PP, Lughofer E (2014) PANFIS: a novel incremental learning machine. IEEE Trans Neural Netw Learn Syst 25(1):55–68

    Article  Google Scholar 

  6. Rubio JJ (2014) Analytic neural network model of a wind turbine. Soft Comput. doi:10.1007/s00500-014-1290-0

    Google Scholar 

  7. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control AC–19(6):716–723

    Article  MathSciNet  Google Scholar 

  8. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    Article  MATH  Google Scholar 

  9. Stathakis D (2009) How many hidden layers and nodes? Int J Remote Sens 30(8):2133–2147

    Article  Google Scholar 

  10. Augasta MG, Kathirvalavakumar T (2011) A novel pruning algorithm for optimizing feedforward neural network of classification problems. Neural Process Lett 34:241–258

    Article  Google Scholar 

  11. Karayiannis NB, Glenn WM (1997) Growing radial basis neural networks: merging supervised and unsupervised learning with network growth techniques. IEEE Trans Neural Netw 8(6):1492–1506

    Article  Google Scholar 

  12. Huang GB, Paramasivan S, Narasimhan S (2005) A generalized growing and pruning RBF (GGAP-RBF) neural network for function approximation. IEEE Trans Neural Netw 16(1):57–67

    Article  Google Scholar 

  13. Reed R (1993) Pruning algorithms: a survey. IEEE Trans Neural Netw 4:740–747

    Article  Google Scholar 

  14. Loone S, Irwin G (2001) Improving neural network training solutions using regularisation. Neurocomputing 37:71–90

    Article  MATH  Google Scholar 

  15. Setiono R (1997) A penalty-function approach for pruning feedforward neural networks. Neural Comput 9:185–204

    Article  MATH  Google Scholar 

  16. Shao HM, Xu DP, Zheng GF, Liu LJ (2012) Convergence of an online gradient method with inner-product penalty and adaptive momentum. Neurocomputing 77:243–252

    Article  Google Scholar 

  17. Karnin ED (1990) A simple procedure for pruning back-propagation trained neural networks. IEEE Trans Neural Netw 1:239–242

    Article  Google Scholar 

  18. Lughofer E (2011) Evolving fuzzy systems—methodologies, advanced concepts and applications. Springer, Berlin

    Book  MATH  Google Scholar 

  19. Rubio JJ (2014) Evolving intelligent algorithms for the modelling of brain and eye signals. Appl Soft Comput 14(B):259–268

    Article  Google Scholar 

  20. Ordonez FJ, Iglesias JA, Toledo DP, Ledezma A, Sanchis A (2013) Online activity recognition using evolving classifiers. Expert Syst Appl 40:1248–1255

    Article  Google Scholar 

  21. Saito K, Nakano S (2000) Second-order learning algorithm with squared penalty term. Neural Comput 12:709–729

    Article  Google Scholar 

  22. Zhang HS, Wu W, Liu F, Yao MC (2009) Boundedness and convergence of online gadient method with penalty for feedforward neural networks. IEEE Trans Neural Netw 20(6):1050–1054

    Article  Google Scholar 

  23. Zhang HS, Wu W, Yao MC (2012) Boundedness and convergence of batch back-propagation algorithm with penalty for feedforward neural networks. Neurocomputing 89:141–146

    Article  Google Scholar 

  24. Shao HM, Zheng GF (2011) Boundedness and convergence of online gradient method with penalty and momentum. Neurocomputing 74:765–770

    Article  Google Scholar 

  25. Yu X, Chen QF (2012) Convergence of gradient method with penalty for Ridge Polynomial neural network. Neurocomputing 97:405–409

    Article  Google Scholar 

  26. Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J Roy Stat Soc Ser B Methodol 58:267–288

    MATH  MathSciNet  Google Scholar 

  27. Wu W, Fan QW, Zurada JM, Wang J, Yang DK, Liu Y (2014) Batch gradient method with smoothing \(L_{1/2}\) regularization for training of feedforward neural networks. Neural Netw 50:72–78

    Article  MATH  Google Scholar 

  28. Fan QW, Zurada JM, Wu W (2014) Convergence of online gradient method for feedforward neural networks with smoothing \(L_{1/2}\) regularization penalty. Neurocomputing 131:208–216

    Article  Google Scholar 

  29. Liu Y, Wu W, Fan QW, Yang DK, Wang J (2014) A modified gradient learning algorithm with smoothing \(L_{1/2}\) regularization for Takagi–Sugeno fuzzy models. Neurocomputing 138(2014):229–237

    Article  Google Scholar 

  30. Candes EJ, Tao T (2005) Decoding by linear programming. IEEE Trans Inform Theory 51(12):4203–4215

    Article  MATH  MathSciNet  Google Scholar 

  31. Wang YF, Liu P, Li ZH, Sun T, Yang CC, Zheng QS (2013) Data regularization using Gaussian beams decomposition and sparse norms. J Inverse Ill-Posed Probl 21(1):1–23

    Article  MathSciNet  Google Scholar 

  32. Liu Y, Yang J, Li L, Wu W (2012) Negative effects of sufficiently small initialweights on back-propagation neural networks. J Zhejiang Univ-Sci C (Comput Electron) 13(8):585–592

    Article  Google Scholar 

Download references

Acknowledgments

We are grateful to the reviewers for their insightful comments. This research is supported by the National Natural Science Foundation of China (No. 61101228) and the China Postdoctoral Science Foundation (No. 2012M520623)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huisheng Zhang.

Appendix

Appendix

In this “Appendix”, we give the proof of Theorem 1. For simplicity, we introduce the following notation:

$$\begin{aligned} {\bf F}^{n,j}={\bf F}({\bf V}^{n}\varvec{\xi }^j), \end{aligned}$$
(21)

for \(n=1,2,\ldots,\) and \(j=1,\ldots,J.\)

Lemma 1

Suppose the Assumptions (A1)–(A3) are valid, then \(E_{\bf w} ({\bf w})\) satisfies Lipschitz condition, that is, there exists a positive constant \(C_2\), such that

$$\begin{aligned} \Vert E_{\bf w} ({\bf w}^{n+1} )-E_{\bf w} ({\bf w}^{n} )\Vert \le C_2\Vert {\bf w}^{n+1}-{\bf w}{^n}\Vert . \end{aligned}$$
(22)

Specially, for \(\theta \geqslant 0\) , we have

$$\begin{aligned} \Vert E_{\bf w} ({\bf w}^{n}+\theta ({\bf w}^{n+1}-{\bf w}^{n}) )-E_{\bf w} ({\bf w}^{n} )\Vert \le C_2 \theta \Vert {\bf w}^{n+1}-{\bf w}^{n}\Vert . \end{aligned}$$
(23)

Proof

Using (12a) and the triangular inequality we have

$$\begin{aligned}&\Vert E_{{\bf w}_0} ({\bf w}^{n+1} )-E_{{\bf w}_0} ({\bf w}^{n} )\Vert \\&\quad =\left\| \sum \limits _{j=1}^J(e_j'({\bf w}_0^{n+1}\cdot {\bf F}^{n+1,j}){\bf F}^{n+1,j}-e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j}){\bf F}^{n,j})+\lambda (H_{\sigma }'({\bf w}_0^{n+1})-H_{\sigma }'({\bf w}_0^{n}))\right\| \\&\quad \le \sum \limits _{j=1}^J\Vert e_j'({\bf w}_0^{n+1}\cdot {\bf F}^{n+1,j}){\bf F}^{n+1,j}-e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j}){\bf F}^{n,j}\Vert +\lambda \Vert H_{\sigma }'({\bf w}_0^{n+1})-H_{\sigma }'({\bf w}_0^{n})\Vert . \end{aligned}$$
(24)

In order to give a further estimation of the above equation, we consider the change of \({\bf F}^{n,j}\) defined in (21) between two steps:

$$\begin{aligned} \Vert {\bf F}^{n+1,j}-{\bf F}^{n,j}\Vert&= \left\| \left( \begin{array}{c} f({\bf w}_1^{n+1}\cdot \varvec{\xi } ^j)-f({\bf w}_1^{n}\cdot \varvec{\xi } ^j) \\ \vdots \\ f({\bf w}_L^{n+1}\cdot \varvec{\xi } ^j)-f({\bf w}_L^{n}\cdot \varvec{\xi } ^j)\end{array}\right) \right\| \\&= \left\| \left( \begin{array}{c} f'(t_1)({\bf w}_1^{n+1}-{\bf w}_1^{n})\cdot \varvec{\xi } ^j\\ \vdots \\ f'(t_L)({\bf w}_L^{n+1}-{\bf w}_L^{n})\cdot \varvec{\xi } ^j\end{array}\right) \right\| \\&\le C_{3}\sum \limits _{l=1}^{L}\Vert {\bf w}^{n+1}_l-{\bf w}^{n}_l\Vert, \end{aligned}$$
(25)

where \(t_l\) lies on the segment between \({\bf w}_l^{n+1}\cdot \varvec{\xi }^j\) and \({\bf w}_l^{n}\cdot \varvec{\xi }^j\), for \(l=1,\ldots, L\), and \(C_3=\sup f'(t)\max \limits _{1\le j\le J}\Vert \varvec{\xi }^j\Vert \).

It is easy to see that \(e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j})\) is Lipschitz continuous and let the positive constant \(C_4\) be the corresponding Lipschitz constant. Using (25), Assumption (A1) and the Cauchy–Schwartz inequality, we have

$$\begin{aligned}&\Vert e_j'({\bf w}_0^{n+1}\cdot {\bf F}^{n+1,j}){\bf F}^{n+1,j}-e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j}){\bf F}^{n,j}\Vert \\&\quad \le |e_j'({\bf w}_0^{n+1}\cdot {\bf F}^{n+1,j})-e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j})|\Vert {\bf F}^{n+1,j}\Vert +|e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j})|\Vert {\bf F}^{n+1,j}-{\bf F}^{n,j}\Vert \\&\quad \le C_4\Vert {\bf F}^{n+1,j}\Vert |{\bf w}_0^{n+1}\cdot {\bf F}^{n+1,j}-{\bf w}_0^{n}\cdot {\bf F}^{n,j}| +|e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j})|\Vert {\bf F}^{n+1,j}-{\bf F}^{n,j}\Vert \\&\quad \le C_4\Vert {\bf F}^{n+1,j}\Vert ^2 \Vert {\bf w}_0^{n+1}-{\bf w}_0^{n}\Vert +(C_4\Vert {\bf F}^{n+1,j}\Vert \Vert {\bf w}_0^{n}\Vert +|e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j})|)\Vert {\bf F}^{n+1,j}-{\bf F}^{n,j}\Vert \\&\quad \le C_4\Vert {\bf F}^{n+1,j}\Vert ^2 \Vert {\bf w}_0^{n+1}-{\bf w}_0^{n}\Vert +C_{5}\sum \limits _{l=1}^{L}\Vert {\bf w}^{n+1}_l-{\bf w}^{n}_l\Vert \\&\quad \le C_6 \Vert {\bf w}^{n+1}-{\bf w}{^n}\Vert, \end{aligned}$$
(26)

with

$$\begin{aligned} C_{5}&= C_3\sup (C_4\Vert {\bf F}^{n+1,j}\Vert \Vert {\bf w}_0^{n}\Vert +|e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j})|),\\ C_6&= \sqrt{L+1}\max \{ C_4\sup \Vert {\bf F}^{n+1,j}\Vert ^2,C_5\}. \end{aligned}$$

Similar to the deduction of (25), using the definition of \(H_{\sigma }^{\prime }(\cdot )\), Assumption (A3) and the mean value theorem, we have

$$\begin{aligned} \Vert H_{\sigma }'({\bf w}_0^{n+1})-H_{\sigma }'({\bf w}_0^{n})\Vert&= \left\| \left( \begin{array}{c} h_{\sigma }'(w_{01}^{n+1})-h_{\sigma }'(w_{01}^{n})\\ \vdots \\ h_{\sigma }'(w_{0L}^{n+1})-h_{\sigma }'(w_{0L}^{n})\end{array} \right) \right\| \\&\le C_7 \Vert {\bf w}_{0}^{n+1}-{\bf w}_{0}^{n}\Vert, \end{aligned}$$
(27)

where \(C_7=\sup h_{\sigma }^{\prime \prime }(t)\).

Combing the above Eqs. (24), (26), and (27), we have

$$\begin{aligned} \Vert E_{{\bf w}_0} ({\bf w}^{n+1} )-E_{{\bf w}_0} ({\bf w}^{n} )\Vert \le JC_6 \Vert {\bf w}^{n+1}-{\bf w}{^n}\Vert +\lambda C_7 \Vert {\bf w}_{0}^{n+1}-{\bf w}_{0}^{n}\Vert . \end{aligned}$$
(28)

Similarly, for \(l=1,\ldots,L\), there exits a constant \(C_8\), such that

$$\begin{aligned} \Vert E_{{\bf w}_l} ({\bf w}^{n+1} )-E_{{\bf w}_l} ({\bf w}^{n} )\Vert \le J C_8 \Vert {\bf w}^{n+1}-{\bf w}{^n}\Vert + \lambda C_7 \Vert {\bf w}_{l}^{n+1}-{\bf w}_{l}^{n}\Vert . \end{aligned}$$
(29)

Then, Eqs. (10), (11), (28) and (29) validate (22):

$$\begin{aligned} \Vert E_{\bf w} ({\bf w}^{n+1} )-E_{\bf w} ({\bf w}^{n} )\Vert \le C_2 \Vert {\bf w}^{n+1}-{\bf w}{^n}\Vert, \end{aligned}$$
(30)

where \(C_2=J\sqrt{C_6^2+LC_8^2}+\lambda C_7\).

Equation (23) is naturally valid as a simple application of (22). \(\square \)

Now, we proceed to the proof of Theorem 1 by dealing with Eqs.  (16)–(19) separately.

Proof of (16)

By the differential mean value theorem, there exists a constant \(\theta \in [0,1]\), such that

$$\begin{aligned} E({\bf w}^{n+1})- E({\bf w}^n)&= (E_{\bf w} ({\bf w}^{n}+\theta ({\bf w}^{n+1}-{\bf w}^{n}) ))^T({\bf w}^{n+1}- {\bf w}^n) \\&= (E_{\bf w} ({\bf w}^{n}))^T({\bf w}^{n+1}- {\bf w}^n) \\&\quad +\,(E_{\bf w} ({\bf w}^{n} +\theta ({\bf w}^{n+1}-{\bf w}^{n}) )-(E_{\bf w} ({\bf w}^{n})))^T({\bf w}^{n+1}- {\bf w}^n) \\& \le -\,\eta \Vert E_{\bf w} ({\bf w}^{n})\Vert ^2+C_2\theta \Vert {\bf w}^{n+1}- {\bf w}^n\Vert ^2 \\&\le \,(-\eta +C_2\eta ^2)\Vert E_{\bf w} ({\bf w}^{n})\Vert ^2. \end{aligned}$$
(31)

To make (16) valid, we only require the learning rate \(\eta \) to satisfy

$$\begin{aligned} 0<\eta <\frac{1}{C_2}. \end{aligned}$$
(32)

\(\square \)

Proof of (17)

Equation (17) is directly obtained by (16) and \(E({\bf w}^n)>0(n=1,2,\ldots )\).\(\square \)

Proof of (18)

Let \(\beta =\eta -C_2\eta ^2\). By (31), we have

$$\begin{aligned} E({\bf w}^{n+1})&\le E({\bf w}^n)-\beta \Vert E_{\bf w} ({\bf w}^{n})\Vert ^2 \\&\le \ldots \le E({\bf w}^{0})-\beta \sum \limits _{t=0}^n\Vert E_{\bf w} ({\bf w}^{t})\Vert ^2. \end{aligned}$$
(33)

Considering \(E({\bf w}^{n+1})>0\), let \(n\rightarrow \infty \), then we have

$$\begin{aligned} \beta \sum \limits _{t=0}^{\infty }\Vert E_{\bf w} ({\bf w}^{t})\Vert ^2\le E({\bf w}^{0})<\infty . \end{aligned}$$
(34)

This immediately gives

$$\begin{aligned} \mathop {\hbox {lim}}\limits _{n\rightarrow \infty }\Vert {E_{\bf w}({\bf w}^n)}\Vert =0. \end{aligned}$$
(35)

\(\square \)

To prove (19), we need the following lemma.

Lemma 2

(See Lemma 3 in [23]) Let \(F:\Upphi \subset {\mathbb {R}}^k\rightarrow {\mathbb {R}},(k\ge 1)\) be continuous for a bounded closed region \(\Upphi\) , and \(\Upphi _0=\{{\bf z}\in \Upphi :F({\bf z})=0\}\) . The projection of \(\Upphi _0\) on each coordinate axis does not contain any interior point. Let the sequence \(\{{\bf z}^n\}\) satisfy:

  1. (i)

    \(\hbox {lim}_{n\rightarrow \infty }F({\bf z}^n)=0\);

  2. (ii)

    \(\hbox {lim}_{n\rightarrow \infty }\Vert {\bf z}^{n+1}-{\bf z}^n\Vert =0\).

Then, there exists a unique \({\bf z}^*\in \Upphi_0\) such that \(\mathop {\hbox {lim}}\limits _{n\rightarrow \infty }{\bf z}^n={\bf z}^*.\)

Proof of (19)

Obviously \(\Vert E_{\bf w} ({\bf w})\Vert \) is a continuous function under Assumptions \((A2)\) and \((A3)\). Using (13) and (18), we have

$$\begin{aligned} \mathop {\hbox {lim}}\limits _{n\rightarrow \infty }\Vert {\bf w}^{n+1}-{\bf w}^n\Vert =\eta \mathop {\hbox {lim}}\limits _{n\rightarrow \infty }\Vert {E_{\bf w}({\bf w}^n)}\Vert =0. \end{aligned}$$
(36)

Furthermore, Assumption (A4) is valid. Thus, applying Lemma 2, there exists a unique \({\bf w}^{*}\in \Upphi \) such that \(\mathop {\hbox {lim}}\limits _{n\rightarrow \infty }{\bf w}^n={\bf w}^{*}\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, H., Tang, Y. & Liu, X. Batch gradient training method with smoothing \(\boldsymbol{\ell}_{\bf 0}\) regularization for feedforward neural networks. Neural Comput & Applic 26, 383–390 (2015). https://doi.org/10.1007/s00521-014-1730-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-014-1730-x

Keywords

Navigation