Batch gradient training method with smoothing $$\boldsymbol{\ell}_{\bf 0}$$ regularization for feedforward neural networks

Zhang, Huisheng; Tang, Yanli; Liu, Xiaodong

doi:10.1007/s00521-014-1730-x

Batch gradient training method with smoothing $\boldsymbol{\ell}_{\bf 0}$ regularization for feedforward neural networks

Original Article
Published: 28 September 2014

Volume 26, pages 383–390, (2015)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Huisheng Zhang^1,2,
Yanli Tang¹ &
Xiaodong Liu²

433 Accesses
15 Citations
Explore all metrics

Abstract

This paper considers the batch gradient method with the smoothing $\ell _0$ regularization (BGSL0) for training and pruning feedforward neural networks. We show why BGSL0 can produce sparse weights, which are crucial for pruning networks. We prove both the weak convergence and strong convergence of BGSL0 under mild conditions. The decreasing monotonicity of the error functions during the training process is also obtained. Two examples are given to substantiate the theoretical analysis and to show the better sparsity of BGSL0 than three typical $\ell _p$ regularization methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Batch Gradient Training Method with Smoothing Group $$L_0$$ Regularization for Feedfoward Neural Networks

Article 14 July 2022

Convergence of batch gradient algorithm with smoothing composition of group $$l_{0}$$ and $$l_{1/2}$$ regularization for feedforward neural networks

Article 25 June 2022

Convergence of batch gradient learning with smoothing regularization and adaptive momentum for neural networks

Article Open access 08 March 2016

References

Hornik K (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366
Article Google Scholar
Rubio JD, Angelov P, Pacheco J (2011) Uniformly stable backpropagation algorithm to train a feedforward neural network. IEEE Trans Neural Netw 22(3):356–366
Article Google Scholar
Sum J, Leung CS, Ho K (2012) Convergence analyses on on-line weight noise injection-based training algorithms for MLPs. IEEE Trans Neural Netw Learn Syst 23(11):1827–1840
Article Google Scholar
Bordignon F, Gomide F (2014) Uninorm based evolving neural networks and approximation capabilities. Neurocomputing 127:13–20
Article Google Scholar
Pratama M, Anavatti SG, Angelov PP, Lughofer E (2014) PANFIS: a novel incremental learning machine. IEEE Trans Neural Netw Learn Syst 25(1):55–68
Article Google Scholar
Rubio JJ (2014) Analytic neural network model of a wind turbine. Soft Comput. doi:10.1007/s00500-014-1290-0
Google Scholar
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control AC–19(6):716–723
Article MathSciNet Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MATH Google Scholar
Stathakis D (2009) How many hidden layers and nodes? Int J Remote Sens 30(8):2133–2147
Article Google Scholar
Augasta MG, Kathirvalavakumar T (2011) A novel pruning algorithm for optimizing feedforward neural network of classification problems. Neural Process Lett 34:241–258
Article Google Scholar
Karayiannis NB, Glenn WM (1997) Growing radial basis neural networks: merging supervised and unsupervised learning with network growth techniques. IEEE Trans Neural Netw 8(6):1492–1506
Article Google Scholar
Huang GB, Paramasivan S, Narasimhan S (2005) A generalized growing and pruning RBF (GGAP-RBF) neural network for function approximation. IEEE Trans Neural Netw 16(1):57–67
Article Google Scholar
Reed R (1993) Pruning algorithms: a survey. IEEE Trans Neural Netw 4:740–747
Article Google Scholar
Loone S, Irwin G (2001) Improving neural network training solutions using regularisation. Neurocomputing 37:71–90
Article MATH Google Scholar
Setiono R (1997) A penalty-function approach for pruning feedforward neural networks. Neural Comput 9:185–204
Article MATH Google Scholar
Shao HM, Xu DP, Zheng GF, Liu LJ (2012) Convergence of an online gradient method with inner-product penalty and adaptive momentum. Neurocomputing 77:243–252
Article Google Scholar
Karnin ED (1990) A simple procedure for pruning back-propagation trained neural networks. IEEE Trans Neural Netw 1:239–242
Article Google Scholar
Lughofer E (2011) Evolving fuzzy systems—methodologies, advanced concepts and applications. Springer, Berlin
Book MATH Google Scholar
Rubio JJ (2014) Evolving intelligent algorithms for the modelling of brain and eye signals. Appl Soft Comput 14(B):259–268
Article Google Scholar
Ordonez FJ, Iglesias JA, Toledo DP, Ledezma A, Sanchis A (2013) Online activity recognition using evolving classifiers. Expert Syst Appl 40:1248–1255
Article Google Scholar
Saito K, Nakano S (2000) Second-order learning algorithm with squared penalty term. Neural Comput 12:709–729
Article Google Scholar
Zhang HS, Wu W, Liu F, Yao MC (2009) Boundedness and convergence of online gadient method with penalty for feedforward neural networks. IEEE Trans Neural Netw 20(6):1050–1054
Article Google Scholar
Zhang HS, Wu W, Yao MC (2012) Boundedness and convergence of batch back-propagation algorithm with penalty for feedforward neural networks. Neurocomputing 89:141–146
Article Google Scholar
Shao HM, Zheng GF (2011) Boundedness and convergence of online gradient method with penalty and momentum. Neurocomputing 74:765–770
Article Google Scholar
Yu X, Chen QF (2012) Convergence of gradient method with penalty for Ridge Polynomial neural network. Neurocomputing 97:405–409
Article Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J Roy Stat Soc Ser B Methodol 58:267–288
MATH MathSciNet Google Scholar
Wu W, Fan QW, Zurada JM, Wang J, Yang DK, Liu Y (2014) Batch gradient method with smoothing $L_{1/2}$ regularization for training of feedforward neural networks. Neural Netw 50:72–78
Article MATH Google Scholar
Fan QW, Zurada JM, Wu W (2014) Convergence of online gradient method for feedforward neural networks with smoothing $L_{1/2}$ regularization penalty. Neurocomputing 131:208–216
Article Google Scholar
Liu Y, Wu W, Fan QW, Yang DK, Wang J (2014) A modified gradient learning algorithm with smoothing $L_{1/2}$ regularization for Takagi–Sugeno fuzzy models. Neurocomputing 138(2014):229–237
Article Google Scholar
Candes EJ, Tao T (2005) Decoding by linear programming. IEEE Trans Inform Theory 51(12):4203–4215
Article MATH MathSciNet Google Scholar
Wang YF, Liu P, Li ZH, Sun T, Yang CC, Zheng QS (2013) Data regularization using Gaussian beams decomposition and sparse norms. J Inverse Ill-Posed Probl 21(1):1–23
Article MathSciNet Google Scholar
Liu Y, Yang J, Li L, Wu W (2012) Negative effects of sufficiently small initialweights on back-propagation neural networks. J Zhejiang Univ-Sci C (Comput Electron) 13(8):585–592
Article Google Scholar

Download references

Acknowledgments

We are grateful to the reviewers for their insightful comments. This research is supported by the National Natural Science Foundation of China (No. 61101228) and the China Postdoctoral Science Foundation (No. 2012M520623)

Author information

Authors and Affiliations

Department of Mathematics, Dalian Maritime University, Dalian, 116026, People’s Republic of China
Huisheng Zhang & Yanli Tang
Research Center of Information and Control, Dalian University of Technology, Dalian, 116024, People’s Republic of China
Huisheng Zhang & Xiaodong Liu

Authors

Huisheng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yanli Tang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodong Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huisheng Zhang.

Appendix

In this “Appendix”, we give the proof of Theorem 1. For simplicity, we introduce the following notation:

$$\begin{aligned} {\bf F}^{n,j}={\bf F}({\bf V}^{n}\varvec{\xi }^j), \end{aligned}$$

(21)

for $n=1,2,\ldots,$ and $j=1,\ldots,J.$

Lemma 1

Suppose the Assumptions (A1)–(A3) are valid, then $E_{\bf w} ({\bf w})$ satisfies Lipschitz condition, that is, there exists a positive constant $C_2$, such that

$$\begin{aligned} \Vert E_{\bf w} ({\bf w}^{n+1} )-E_{\bf w} ({\bf w}^{n} )\Vert \le C_2\Vert {\bf w}^{n+1}-{\bf w}{^n}\Vert . \end{aligned}$$

(22)

Specially, for $\theta \geqslant 0$ , we have

$$\begin{aligned} \Vert E_{\bf w} ({\bf w}^{n}+\theta ({\bf w}^{n+1}-{\bf w}^{n}) )-E_{\bf w} ({\bf w}^{n} )\Vert \le C_2 \theta \Vert {\bf w}^{n+1}-{\bf w}^{n}\Vert . \end{aligned}$$

(23)

Proof

Using (12a) and the triangular inequality we have

$$\begin{aligned}&\Vert E_{{\bf w}_0} ({\bf w}^{n+1} )-E_{{\bf w}_0} ({\bf w}^{n} )\Vert \\&\quad =\left\| \sum \limits _{j=1}^J(e_j'({\bf w}_0^{n+1}\cdot {\bf F}^{n+1,j}){\bf F}^{n+1,j}-e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j}){\bf F}^{n,j})+\lambda (H_{\sigma }'({\bf w}_0^{n+1})-H_{\sigma }'({\bf w}_0^{n}))\right\| \\&\quad \le \sum \limits _{j=1}^J\Vert e_j'({\bf w}_0^{n+1}\cdot {\bf F}^{n+1,j}){\bf F}^{n+1,j}-e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j}){\bf F}^{n,j}\Vert +\lambda \Vert H_{\sigma }'({\bf w}_0^{n+1})-H_{\sigma }'({\bf w}_0^{n})\Vert . \end{aligned}$$

(24)

In order to give a further estimation of the above equation, we consider the change of ${\bf F}^{n,j}$ defined in (21) between two steps:

$$\begin{aligned} \Vert {\bf F}^{n+1,j}-{\bf F}^{n,j}\Vert&= \left\| \left( \begin{array}{c} f({\bf w}_1^{n+1}\cdot \varvec{\xi } ^j)-f({\bf w}_1^{n}\cdot \varvec{\xi } ^j) \\ \vdots \\ f({\bf w}_L^{n+1}\cdot \varvec{\xi } ^j)-f({\bf w}_L^{n}\cdot \varvec{\xi } ^j)\end{array}\right) \right\| \\&= \left\| \left( \begin{array}{c} f'(t_1)({\bf w}_1^{n+1}-{\bf w}_1^{n})\cdot \varvec{\xi } ^j\\ \vdots \\ f'(t_L)({\bf w}_L^{n+1}-{\bf w}_L^{n})\cdot \varvec{\xi } ^j\end{array}\right) \right\| \\&\le C_{3}\sum \limits _{l=1}^{L}\Vert {\bf w}^{n+1}_l-{\bf w}^{n}_l\Vert, \end{aligned}$$

(25)

where $t_l$ lies on the segment between ${\bf w}_l^{n+1}\cdot \varvec{\xi }^j$ and ${\bf w}_l^{n}\cdot \varvec{\xi }^j$, for $l=1,\ldots, L$, and $C_3=\sup f'(t)\max \limits _{1\le j\le J}\Vert \varvec{\xi }^j\Vert $.

It is easy to see that $e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j})$ is Lipschitz continuous and let the positive constant $C_4$ be the corresponding Lipschitz constant. Using (25), Assumption (A1) and the Cauchy–Schwartz inequality, we have

$$\begin{aligned}&\Vert e_j'({\bf w}_0^{n+1}\cdot {\bf F}^{n+1,j}){\bf F}^{n+1,j}-e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j}){\bf F}^{n,j}\Vert \\&\quad \le |e_j'({\bf w}_0^{n+1}\cdot {\bf F}^{n+1,j})-e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j})|\Vert {\bf F}^{n+1,j}\Vert +|e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j})|\Vert {\bf F}^{n+1,j}-{\bf F}^{n,j}\Vert \\&\quad \le C_4\Vert {\bf F}^{n+1,j}\Vert |{\bf w}_0^{n+1}\cdot {\bf F}^{n+1,j}-{\bf w}_0^{n}\cdot {\bf F}^{n,j}| +|e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j})|\Vert {\bf F}^{n+1,j}-{\bf F}^{n,j}\Vert \\&\quad \le C_4\Vert {\bf F}^{n+1,j}\Vert ^2 \Vert {\bf w}_0^{n+1}-{\bf w}_0^{n}\Vert +(C_4\Vert {\bf F}^{n+1,j}\Vert \Vert {\bf w}_0^{n}\Vert +|e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j})|)\Vert {\bf F}^{n+1,j}-{\bf F}^{n,j}\Vert \\&\quad \le C_4\Vert {\bf F}^{n+1,j}\Vert ^2 \Vert {\bf w}_0^{n+1}-{\bf w}_0^{n}\Vert +C_{5}\sum \limits _{l=1}^{L}\Vert {\bf w}^{n+1}_l-{\bf w}^{n}_l\Vert \\&\quad \le C_6 \Vert {\bf w}^{n+1}-{\bf w}{^n}\Vert, \end{aligned}$$

(26)

with

$$\begin{aligned} C_{5}&= C_3\sup (C_4\Vert {\bf F}^{n+1,j}\Vert \Vert {\bf w}_0^{n}\Vert +|e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j})|),\\ C_6&= \sqrt{L+1}\max \{ C_4\sup \Vert {\bf F}^{n+1,j}\Vert ^2,C_5\}. \end{aligned}$$

Similar to the deduction of (25), using the definition of $H_{\sigma }^{\prime }(\cdot )$, Assumption (A3) and the mean value theorem, we have

$$\begin{aligned} \Vert H_{\sigma }'({\bf w}_0^{n+1})-H_{\sigma }'({\bf w}_0^{n})\Vert&= \left\| \left( \begin{array}{c} h_{\sigma }'(w_{01}^{n+1})-h_{\sigma }'(w_{01}^{n})\\ \vdots \\ h_{\sigma }'(w_{0L}^{n+1})-h_{\sigma }'(w_{0L}^{n})\end{array} \right) \right\| \\&\le C_7 \Vert {\bf w}_{0}^{n+1}-{\bf w}_{0}^{n}\Vert, \end{aligned}$$

(27)

where $C_7=\sup h_{\sigma }^{\prime \prime }(t)$.

Combing the above Eqs. (24), (26), and (27), we have

$$\begin{aligned} \Vert E_{{\bf w}_0} ({\bf w}^{n+1} )-E_{{\bf w}_0} ({\bf w}^{n} )\Vert \le JC_6 \Vert {\bf w}^{n+1}-{\bf w}{^n}\Vert +\lambda C_7 \Vert {\bf w}_{0}^{n+1}-{\bf w}_{0}^{n}\Vert . \end{aligned}$$

(28)

Similarly, for $l=1,\ldots,L$, there exits a constant $C_8$, such that

$$\begin{aligned} \Vert E_{{\bf w}_l} ({\bf w}^{n+1} )-E_{{\bf w}_l} ({\bf w}^{n} )\Vert \le J C_8 \Vert {\bf w}^{n+1}-{\bf w}{^n}\Vert + \lambda C_7 \Vert {\bf w}_{l}^{n+1}-{\bf w}_{l}^{n}\Vert . \end{aligned}$$

(29)

Then, Eqs. (10), (11), (28) and (29) validate (22):

$$\begin{aligned} \Vert E_{\bf w} ({\bf w}^{n+1} )-E_{\bf w} ({\bf w}^{n} )\Vert \le C_2 \Vert {\bf w}^{n+1}-{\bf w}{^n}\Vert, \end{aligned}$$

(30)

where $C_2=J\sqrt{C_6^2+LC_8^2}+\lambda C_7$.

Equation (23) is naturally valid as a simple application of (22). $\square $

Now, we proceed to the proof of Theorem 1 by dealing with Eqs. (16)–(19) separately.

Proof of (16)

By the differential mean value theorem, there exists a constant $\theta \in [0,1]$, such that

$$\begin{aligned} E({\bf w}^{n+1})- E({\bf w}^n)&= (E_{\bf w} ({\bf w}^{n}+\theta ({\bf w}^{n+1}-{\bf w}^{n}) ))^T({\bf w}^{n+1}- {\bf w}^n) \\&= (E_{\bf w} ({\bf w}^{n}))^T({\bf w}^{n+1}- {\bf w}^n) \\&\quad +\,(E_{\bf w} ({\bf w}^{n} +\theta ({\bf w}^{n+1}-{\bf w}^{n}) )-(E_{\bf w} ({\bf w}^{n})))^T({\bf w}^{n+1}- {\bf w}^n) \\& \le -\,\eta \Vert E_{\bf w} ({\bf w}^{n})\Vert ^2+C_2\theta \Vert {\bf w}^{n+1}- {\bf w}^n\Vert ^2 \\&\le \,(-\eta +C_2\eta ^2)\Vert E_{\bf w} ({\bf w}^{n})\Vert ^2. \end{aligned}$$

(31)

To make (16) valid, we only require the learning rate $\eta $ to satisfy

$$\begin{aligned} 0<\eta <\frac{1}{C_2}. \end{aligned}$$

(32)

$\square $

Proof of (17)

Equation (17) is directly obtained by (16) and $E({\bf w}^n)>0(n=1,2,\ldots )$.$\square $

Proof of (18)

Let $\beta =\eta -C_2\eta ^2$. By (31), we have

$$\begin{aligned} E({\bf w}^{n+1})&\le E({\bf w}^n)-\beta \Vert E_{\bf w} ({\bf w}^{n})\Vert ^2 \\&\le \ldots \le E({\bf w}^{0})-\beta \sum \limits _{t=0}^n\Vert E_{\bf w} ({\bf w}^{t})\Vert ^2. \end{aligned}$$

(33)

Considering $E({\bf w}^{n+1})>0$, let $n\rightarrow \infty $, then we have

$$\begin{aligned} \beta \sum \limits _{t=0}^{\infty }\Vert E_{\bf w} ({\bf w}^{t})\Vert ^2\le E({\bf w}^{0})<\infty . \end{aligned}$$

(34)

This immediately gives

$$\begin{aligned} \mathop {\hbox {lim}}\limits _{n\rightarrow \infty }\Vert {E_{\bf w}({\bf w}^n)}\Vert =0. \end{aligned}$$

(35)

$\square $

To prove (19), we need the following lemma.

Lemma 2

(See Lemma 3 in [23]) Let $F:\Upphi \subset {\mathbb {R}}^k\rightarrow {\mathbb {R}},(k\ge 1)$ be continuous for a bounded closed region $\Upphi$ , and $\Upphi _0=\{{\bf z}\in \Upphi :F({\bf z})=0\}$ . The projection of $\Upphi _0$ on each coordinate axis does not contain any interior point. Let the sequence $\{{\bf z}^n\}$ satisfy:

(i)
$\hbox {lim}_{n\rightarrow \infty }F({\bf z}^n)=0$;
(ii)
$\hbox {lim}_{n\rightarrow \infty }\Vert {\bf z}^{n+1}-{\bf z}^n\Vert =0$.

Then, there exists a unique ${\bf z}^*\in \Upphi_0$ such that $\mathop {\hbox {lim}}\limits _{n\rightarrow \infty }{\bf z}^n={\bf z}^*.$

Proof of (19)

Obviously $\Vert E_{\bf w} ({\bf w})\Vert $ is a continuous function under Assumptions $(A2)$ and $(A3)$. Using (13) and (18), we have

$$\begin{aligned} \mathop {\hbox {lim}}\limits _{n\rightarrow \infty }\Vert {\bf w}^{n+1}-{\bf w}^n\Vert =\eta \mathop {\hbox {lim}}\limits _{n\rightarrow \infty }\Vert {E_{\bf w}({\bf w}^n)}\Vert =0. \end{aligned}$$

(36)

Furthermore, Assumption (A4) is valid. Thus, applying Lemma 2, there exists a unique ${\bf w}^{*}\in \Upphi $ such that $\mathop {\hbox {lim}}\limits _{n\rightarrow \infty }{\bf w}^n={\bf w}^{*}$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, H., Tang, Y. & Liu, X. Batch gradient training method with smoothing $\boldsymbol{\ell}_{\bf 0}$ regularization for feedforward neural networks. Neural Comput & Applic 26, 383–390 (2015). https://doi.org/10.1007/s00521-014-1730-x

Download citation

Received: 29 May 2014
Accepted: 14 September 2014
Published: 28 September 2014
Issue Date: February 2015
DOI: https://doi.org/10.1007/s00521-014-1730-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Batch gradient training method with smoothing \(\boldsymbol{\ell}_{\bf 0}\) regularization for feedforward neural networks

Abstract

Access this article

Similar content being viewed by others

Batch Gradient Training Method with Smoothing Group $$L_0$$ Regularization for Feedfoward Neural Networks

Convergence of batch gradient algorithm with smoothing composition of group $$l_{0}$$ and $$l_{1/2}$$ regularization for feedforward neural networks

Convergence of batch gradient learning with smoothing regularization and adaptive momentum for neural networks

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Lemma 1

Proof

Proof of (16)

Proof of (17)

Proof of (18)

Lemma 2

Proof of (19)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Batch gradient training method with smoothing \(\boldsymbol{\ell}_{\bf 0}\) regularization for feedforward neural networks

Abstract

Access this article

Similar content being viewed by others

Batch Gradient Training Method with Smoothing Group $$L_0$$ Regularization for Feedfoward Neural Networks

Convergence of batch gradient algorithm with smoothing composition of group $$l_{0}$$ and $$l_{1/2}$$ regularization for feedforward neural networks

Convergence of batch gradient learning with smoothing regularization and adaptive momentum for neural networks

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Lemma 1

Proof

Proof of (16)

Proof of (17)

Proof of (18)

Lemma 2

Proof of (19)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation