Batch gradient training method with smoothing $$\boldsymbol{\ell}_{\bf 0}$$ regularization for feedforward neural networks

Zhang, Huisheng; Tang, Yanli; Liu, Xiaodong

doi:10.1007/s00521-014-1730-x

Batch gradient training method with smoothing $\boldsymbol{\ell}_{\bf 0}$ regularization for feedforward neural networks

Original Article
Published: 28 September 2014

Volume 26, pages 383–390, (2015)
Cite this article

Download PDF

Neural Computing and Applications Aims and scope Submit manuscript

Batch gradient training method with smoothing $\boldsymbol{\ell}_{\bf 0}$ regularization for feedforward neural networks

Download PDF

Huisheng Zhang^1,2,
Yanli Tang¹ &
Xiaodong Liu²

432 Accesses
15 Citations
Explore all metrics

Abstract

This paper considers the batch gradient method with the smoothing $\ell _0$ regularization (BGSL0) for training and pruning feedforward neural networks. We show why BGSL0 can produce sparse weights, which are crucial for pruning networks. We prove both the weak convergence and strong convergence of BGSL0 under mild conditions. The decreasing monotonicity of the error functions during the training process is also obtained. Two examples are given to substantiate the theoretical analysis and to show the better sparsity of BGSL0 than three typical $\ell _p$ regularization methods.

Batch Gradient Training Method with Smoothing Group $$L_0$$ Regularization for Feedfoward Neural Networks

Article 14 July 2022

Ying Zhang, Jianing Wei, … Huisheng Zhang

Convergence of batch gradient algorithm with smoothing composition of group $$l_{0}$$ and $$l_{1/2}$$ regularization for feedforward neural networks

Article 25 June 2022

Hassan Ramchoun & Mohamed Ettaouil

Convergence of batch gradient learning with smoothing regularization and adaptive momentum for neural networks

Article Open access 08 March 2016

Qinwei Fan, Wei Wu & Jacek M. Zurada

1 Introduction

As universal approximators [1], multilayer feedforward neural networks (FNNs) [2–6] have been applied in many fields. According to the approximation theory, the network with the simpler structure is preferred because it has stronger generalization ability than the network with much more complex structures if both the networks bear the same training accuracy. Though many model selection methods have been proposed, such as AIC [7] and BIC [8], it is still difficult in theory to directly compute the exact optimal number of nodes that the network needs for a given problem [9]. There are two practical approaches to solve this problem [10]: One is the growing methods [11, 12], starting with a minimal network and adding new nodes during the training process, and another is the pruning methods, starting with a large network and then removing the unimportant nodes or weights. Depending on the techniques used for pruning, the pruning methods can be classified into the following groups [10, 13]: regularization term methods [14–16], cross-validation methods, magnitude-based methods, mutual information-based methods, evolutionary pruning methods and sensitivity analysis-based methods [17–20]. We focus on the regularization term methods in this paper.

“Weight decay” maybe the most popular regularization term method in neural network fields. By adding a term proportional to the square of the $\ell _2$ norm of the network weights to the objective function, the network weights will be driven to zeros during the training process using the gradient training method (or other optimization methods). The effective of the weight decay method has been experimentally shown by many authors [21] and the convergence for the gradient method with a weight decay term has been theoretically proved in [22–25]. However, as an $\ell _2$ regularization, the weight decay method compels all the network weights to zeros because it can not distinguish between the important weights and unimportant weights. As a result, weight decay method can not produce sparse network weights and thus can not prune the network efficiently. Another popular regularization term is the $\ell _1$ norm [26] of the network weights. Recently, smoothing $\ell _{1/2}$ regularization methods are proposed for training feedforward neural networks [27, 28] and fuzzy neural networks [29]. According to the regularization theory, $\ell _p$ regularization methods tend to bear more sparse results as $p\rightarrow 0$. Thus, the $\ell _0$ regularization is expected to be one of the most ideal ways for pruning network. However, the solving of $\ell _0$ regularization problem is an NP-hard problem [30] and can not directly make use of the optimization algorithms such as the gradient method [31].

Motivated by the work on smoothing $\ell _{1/2}$ regularization method for FNNs, in this paper, we try to use smoothing functions to approximate the $\ell _0$ regularizer and derive a batch gradient method with the smoothing $\ell _0$ regularization (BGSL0) for training feedforward neural networks. We will show how the smoothing $\ell _0$ regularization helps the gradient method to distinguish the important weights and unimportant weights. The weak convergence and strong convergence results for the given algorithm will also be investigated.

The remainder of this paper is organized as follows. The network structure and the derivation of the BGSL0 are described in the next section. Section 3 shows how the smoothing $\ell _0$ regularization helps the gradient method to produce sparse results. In Sect. 4, we give some convergence results of the proposed algorithm. The performance of the BGSL0 is compared with several typical $\ell _p$ regularization methods by two examples in Sect. 5. We give the conclusion in Sect. 6. The proof of the convergence theorem is provided in the “Appendix”.

2 Network structure and BGSL0 algorithm

Without loss of generality, we consider a three-layered network consisting of K input nodes, L hidden nodes and 1 output node. Let ${\bf w}_0=(w_{01},w_{02},\ldots,w_{0L})^T\in {\mathbb {R}}^L$ be the weight vector between all the hidden units and the output unit, and ${\bf w}_l=(w_{l1},w_{l2},\ldots,w_{lK})^T\in {\mathbb {R}}^K$ be the weight vector between all the input units and the hidden unit $l\,(l=1,2,\ldots,L)$. To simplify the presentation, we write all the weight parameters in a compact form, i.e., $ {\bf w}=({\bf w}_0^T,{\bf w}_1^T,\ldots, {\bf w}_L^T)^T\in{ \mathbb {R}}^{L+KL}$ and we define a matrix ${\bf V}=({\bf w}_1,{\bf w}_2,\ldots, {\bf w}_L)^{T}\in {\mathbb {R}}^{L\times K}$.

Given activation functions $f,g:{\mathbb {R}}\rightarrow {\mathbb {R}}$ for the hidden layer and output layer, respectively, we define a vector function ${\bf F}({\bf x})=(f(x_1),f(x_2),\ldots,f(x_L))^T$ for ${\bf x}={(x_{1},x_{2},\ldots,x_{L})^T}\in {\mathbb {R}}^L$. For an input $\varvec{\xi }\in {\mathbb {R}}^K$, the output vector of the hidden layer can be written as ${\bf F}({\bf V}\varvec{\xi })$ and the final output of the network can be written as

$$\begin{aligned} \zeta =g({\bf w}_0 \cdot {\bf F}({\bf V}\varvec{\xi })), \end{aligned}$$

(1)

where ${\bf w}_0 \cdot {\bf F}({\bf V}\varvec{\xi })$ represents the inner product between the two vectors ${\bf w}_0$ and ${\bf F}({\bf V}\varvec{\xi })$.

Suppose that $\{{\varvec{\xi } ^{j},O^{j}}\}_{j=1}^{J}\subset {\mathbb {R}}^K\times {\mathbb {R}}$ is a given set of training samples. The aim of the network training is to find the appropriate network weights ${\bf w}^*$ that can minimize the error function

$$\begin{aligned} {\mathcal {E}}({\bf w})=\frac{1}{2}\sum _{j=1}^{J}\left( O^j-g({\bf w} _0\cdot {\bf F}({\bf V}\varvec{\xi }^{j}))\right) ^2 \end{aligned}$$

(2)

$$\begin{aligned} =\sum _{j=1}^{J}e_{j}\left( {\bf w} _0\cdot {\bf F}({\bf V}\varvec{\xi }^j)\right), \end{aligned}$$

(3)

where $e_j(t):=\frac{1}{2}(O^j-g(t))^2$. In order to prune the network and thus enhance its generalization ability, an $\ell _p$ regularization term is often added to the error function. Therefore, the modified error function takes the form

$$\begin{aligned} E({\bf w} )={\mathcal {E}}({\bf w} ) +\lambda \Vert {\bf w} \Vert _p^p, \end{aligned}$$

(4)

where $\lambda $ is the regularization coefficient to balance the tradeoff between the training accuracy and the network complexity, and $\Vert {\bf w} \Vert _p^p$ is the $\ell _p$ regularizer with the $\ell _p$ norm $\Vert \cdot \Vert _p$ defined by $\Vert {\bf z}\Vert _p= (|z_1|^p+|z_2|^p+\cdots +|z_m|^p)^{1/p}$ for ${\bf z}=(z_1,z_2,\ldots,z_m)^T$. There are three common choices for p in literatures: $p=2,1$ and $\frac{1}{2}$. $\ell _2$ regularization corresponds to “weight decay”, which can efficiently control the magnitude size of network weights, while not producing sparse network weights. Though both the $\ell _1$ and $\ell _{\frac{1}{2}}$ regularizers can generate sparse results, the best $\ell _p$ regularizer to satisfy the sparsity requirement is the $\ell _0$ regularizer defined by

$$\begin{aligned} \Vert {\bf z}\Vert _0^0&= \mathop {\hbox {lim}}\limits _{p\rightarrow 0} \left( (|z_1|^p+|z_2|^p+\cdots +|z_m|^p)^{1/p}\right) ^p \\&= \mathop {\hbox {lim}}\limits _{p\rightarrow 0} \left( |z_1|^p+|z_2|^p+\cdots +|z_m|^p\right) \end{aligned}$$

(5)

for ${\bf z}=(z_1,z_2,\ldots,z_m)^T$. Assume $0^0=0$, then we have that $\Vert {\bf z}\Vert _0^0$ equals to the number of nonzero entries of the vector ${\bf z}$. However, minimizing (4) for $p=0$ is an NP-hard problem in combinatorial optimization [30]. To solve this problem efficiently, we can use a continuous function of vector variable

$$\begin{aligned} H_{\sigma }({\bf z})=\sum \limits _{i=1}^m h_{\sigma }(z_i) \end{aligned}$$

(6)

to approximate the $\ell _0$ regularizer, where $h_{\sigma }(\cdot )$ is continuously differentiable on ${\mathbb {R}}$ and satisfies the following condition

$$\begin{aligned} \mathop {\hbox {lim}}\limits _{\sigma \rightarrow 0} h_{\sigma } (t)=\left\{ \begin{array}{ll} 1, &{} {\text {if}}\,t\ne 0, \\ 0, &{} {\text {if}}\, t= 0. \end{array}\right. \end{aligned}$$

(7)

Here $\sigma $ is a positive number used to control the degree how $H_{\sigma }({\bf z})$ approximates the $\ell _0$ regularizer. According to (6) and (7), we know $H_{\sigma }({\bf z})$ approaches the number of nonzero entries of the vector ${\bf z}$, i.e., the $\ell _0$ regularizer, as $\sigma \rightarrow 0$. A typical choice for $h_{\sigma } (t) $ is

$$\begin{aligned} h_{\sigma } (t)=1-\exp \left( \frac{-t^2}{2\sigma ^2}\right) . \end{aligned}$$

(8)

At the same time, there are other candidates for $h_{\sigma } (t)$, such as

$$\begin{aligned} 1-\frac{\sigma ^2}{t^2+\sigma ^2},\quad 1-\frac{\sin (t/\sigma )}{t/\sigma }, \end{aligned}$$

(9)

where the latter is defined to be 0 at the point $t=0$. With the newly defined smoothing $\ell _0$ regularizer, the corresponding error function is

$$\begin{aligned} E({\bf w} )={\mathcal {E}}({\bf w} )+\lambda H_{\sigma }({\bf w}). \end{aligned}$$

(10)

The gradient of the error function is given by

$$\begin{aligned} E_{\bf w} ({\bf w} )=\left( E_{{\bf w} _0}^T({\bf w}),E_{{\bf w} _1}^T({\bf w}),\ldots,E_{{\bf w}_L}^T({\bf w} )\right) ^T \end{aligned}$$

(11)

with

$$\begin{aligned} E_{{\bf w}_0}({\bf w})=\sum _{j=1}^{J}e_j'\left( {\bf w}_0\cdot {\bf F}({\bf V}\varvec{\xi }^j)\right) {\bf F}({\bf V}\varvec{\xi }^j)+\lambda H_{\sigma }^{\prime } ({\bf w}_0), \end{aligned}$$

(12a)

$$\begin{aligned} E_{{\bf w}_l}({\bf w})=\sum _{j=1}^{J}e_j'\left( {\bf w}_0\cdot {\bf F}({\bf V}\varvec{\xi }^j)\right) w_{0l}f'({\bf w}_l\cdot \varvec{\xi }^j)\varvec{\xi }^j+\lambda H_{\sigma }^{\prime } ({\bf w}_l),\quad l=1,2,\ldots,L, \end{aligned}$$

(12b)

where $H_{\sigma }^{\prime } ({\bf z})$ is defined by $(h_{\sigma }^{\prime }(z_1),h_{\sigma }^{\prime }(z_2),\ldots,h_{\sigma }^{\prime }(z_m))^T$ for ${\bf z}=(z_1,z_2,\ldots,z_m)^T$. Starting from an arbitrary initial value ${\bf w}^0$, the batch gradient method with the smoothing $\ell _0$ regularization term updates the weights $\{{\bf w}^n\}$ iteratively by

$$\begin{aligned} {\bf w}^{n+1}&={\bf w}^{n}-\eta E_{\bf w} ({\bf w}^n ),\quad n=0,1,2,\ldots \end{aligned}$$

(13)

where $\eta >0$ is the learning rate.

3 Sparsity

We first show why the smoothing $\ell _0$ regularization term can help the batch gradient method distinguish between the important weights and unimportant weights and remove the latter. In general, the weight with large absolute value during the training process is assumed to be more important than the weight around zero. According to (7), for sufficiently small $\sigma $, there exists a positive constant $t_0$, such that $h_{\sigma }^{\prime }(t)\approx 0$ when $t>t_0$. Thus, the more important weights, which are larger absolutely than $t_0$, will not be affected by the regularization term. On the other hand, when $t<t_0, h_{\sigma }^{\prime }(t)$ will be considerable large. Notice that $t=0$ is the only minimum point of $h_{\sigma }(t)$, then we know that unimportant weights, which are absolutely less than $t_0$, will be driven to zero during the training process. This explains why the algorithm can achieve sparse results. The curves for $h_{\sigma } (t)$ and $h_{\sigma } ^{\prime } (t)$ are illustrated by Figs. 1 and 2, where $h_{\sigma }(t)$ is set to be $1-\exp (\frac{-t^2}{2\sigma ^2})$.

In order to achieve the most sparse results, according to the above discussion, we expect more weights can fall into the interval $(-t_0,t_0)$. To make this true, one choice for us is to set the initial weights to be very small. However, this will lead to slow convergence at the beginning [32]. Another choice is to use the weight decay method as a brute-force to shrink network weights. In fact, for large $\sigma $, the regularization term (6) is equivalent to the weight decay regularization. For example, let $h_{\sigma } (t)=1-\exp (\frac{-t^2}{2\sigma ^2})$. Then, expanding $h_{\sigma } (t)$ with the Taylor series and omitting the high order terms, we have

$$\begin{aligned} 1-\exp \left( \frac{-t^2}{2\sigma ^2}\right) \approx \frac{t^2}{2\sigma ^2} \end{aligned}$$

(14)

if $\sigma $ is sufficiently large. Thus, during the training process, the parameter $\sigma $ should be set to be a decreasing sequence which is big at the beginning and tends to zero at last.

4 Convergence

In this section, we will establish the convergence theorem of the proposed algorithm. The proof of the theorem is provided in the “Appendix”.

Let $\Upphi =\{{\bf w}:\; E({\bf w})=0\}$ be the stationary point set of the error function $E_{\bf w}({\bf w})$, and $\Upphi _s=\{w_{lk}:{\bf w}=(w_{01},\ldots,w_{lk},\ldots,w_{LK})\in \Upphi \}$ be the projection of $\Upphi $ onto the $s$th coordinate axis, where

$$\begin{aligned} s= \left\{ \begin{array}{ll} k, &{}{\text {if}}\, l= 0, \\ (l-1)K+L+k, &{} {\text {if}}\,l> 0, \end{array}\right. \end{aligned}$$

(15)

for $s=1,\ldots,KL+L$. The following assumptions are needed for our convergence results.

(A1)
There exists a constant $C_1$ such that $\Vert {\bf w}^n\Vert \le C_1$ for all $n=0,1,\ldots $.
(A2)
The functions f and g are twice differentiable on ${\mathbb {R}}$. Moreover, $f,g,f^{\prime }, g^{\prime },f^{\prime \prime }$ and $g^{\prime \prime }$ are uniformly bounded on ${\mathbb {R}}$.
(A3)
For any fixed positive parameter $\sigma \in {\mathbb {R}}, h_{\sigma }(t)$ is twice differentiable and uniformly bounded for both the first and second derivatives on ${\mathbb{ R}}$.
(A4)
The set $\Upphi_s$ does not contain any interior point for every $s=1,\ldots,KL+L$.

Remark 1

In practice, $\sigma $ is set to a decreasing sequence which is large at the beginning. As shown in the previous section, the regularization term with big $\sigma $ approximates the $\ell _2$ regularizer. We have proved the boundedness of the network weights trained by the batch gradient method with the $\ell _2$ regularization term [23]. Thus Assumption (A1) can be easily assured. Assumption (A2) is satisfied by typical activation functions such as sigmoid functions. Assumption (A3) is not restrictive as it is satisfied by the suggested functions in (8) and (9). Assumption (A4) is used to establish the strong convergence.

Now we present our convergence results for BGSL0.

Theorem 1

Suppose that the error function is given by (4) that the weight sequence $\{{\bf w}^n\}$ is generated by the algorithm (13) for any initial value ${\bf w}^0$, that $0<\eta <\frac{1}{C_2}$, where $C_2$ is defined by (30), and that Assumptions (A1)–(A4) are valid. Then we have

$$\begin{aligned}&(a)\,E({\bf w}^{n+1})\le E({\bf w}^n),\quad n=0,1,2,\ldots ;&\end{aligned}$$

(16)

$$\begin{aligned}&(b)\, \text{ There } \text{ is } E^*>0 \text{ such } \text{ that } \mathop {\hbox {lim}}\limits _{n\rightarrow \infty } E({\bf w}^n)=E^*;\end{aligned}$$

(17)

$$\begin{aligned}&(c)\, \text{ There } \text{ holds } \text{ the } \text{ weak } \text{ convergence: } \mathop {\hbox {lim}}\limits _{n\rightarrow \infty }\left\| {E_{\bf w}({\bf w}^n)} \right\| =0. \end{aligned}$$

(18)

Moreover, if Assumption (A4) is valid, then there holds the strong convergence: There exists a point ${\bf w}^{*}\in \Upphi $ such that

$$\begin{aligned} (d)\quad \mathop {\hbox {lim}}\limits _{n\rightarrow \infty }{\bf w}^n={\bf w}^{*}. \end{aligned}$$

(19)

Remark 2

According to Theorem 1 and the analysis in Sect. 3, after training we can obtain a weight vector ${\bf w}$, many of whose elements are almost zeros. Then, by setting these elements (weights) to be zeros, that is to say, removing these weights from the network, the network is pruned.

5 Simulation results

We will empirically demonstrate the performance of BGSL0 on two experiments: function approximation problem and sonar signal classification. The comparison with the batch gradient method with $\ell _2$ regularization (BGL2), the batch gradient method with $\ell _1$ regularization (BGL1), and the batch gradient method with $\ell _{1/2}$ regularization (BGL1/2) are also provided.

5.1 Function approximation problem

In this subsection, we consider using a neural network to approximate the following multi-dimensional Gabor function

$$\begin{aligned} \frac{1}{2\pi (0.5)^2} \exp \left( {-\frac{x^2+y^2}{2\pi (0.5)^2}}\right) \cos (2\pi (x+y)). \end{aligned}$$

(20)

The training samples are 36 points selected from an evenly spaced $6 \times 6$ grid on $-0.5\le x \le 0$ and $-0.5\le y \le 0$. Similarly, the testing samples are from an evenly spaced $6\times 6$ grid on $-0.5\le x \le 0$ and $0\le y \le 0.5$.

We set the network structure to be 3–20–1 (input, hidden and output nodes), and the transfer function for both hidden and output layers to be $\hbox{tansig}(\cdot )$ in MATLAB, which is a commonly used sigmoid function. We choose learning rate $\eta =0.15$, the regularization coefficient $\lambda =0.001$, and $h_{\sigma } (t)=1-\exp (\frac{-t^2}{2\sigma ^2})$, where the parameter $\sigma $ is set to be a decreasing sequence. As shown by Fig. 2, when $\sigma $ is too small, $h_{\sigma }^{\prime }(t)$ may be absolutely too large, which may cause instability during the training procedure. Thus, we set a lower bound for $\sigma $ as 0.08. The maximum number of training epochs is 30,000.

We use the number of weights whose absolute values are <0.01 after training to measure the sparsity of an algorithm. It is easy to see that the larger number means the algorithm achieves better sparsity. The testing errors and the sparsity of the network are compared in Table 1, which shows the BGSL0 are better than the other three algorithms in sparsity and generalization ability. Figure 3 shows that the trained network approximates the Gabor function well both in the learning region and the testing region. The learning curves for BGSL0 are given in Fig. 4, which shows that, as the number of iteration increases, the gradient function tends to zero and the square error function decreases monotonically and at last it tends to a constant. This substantiates Theorem 1.

Table 1 Performance comparison for BGSL0, BGL2, BGL1, and BGL1/2

Full size table

5.2 Sonar signal classification

Sonar signal classification is a benchmark problem in neural network field. The task is to train a network to discriminate between sonar returns bounced off a metal cylinder and those bounced off a roughly cylindrical rock. The date set is publicly available from UCI machine learning repository (http://archive.ics.uci.edu/ml/), which comprises 208 samples, each with 60 components. In this simulation, we randomly choose 164 samples for training and 44 samples for test.

In order to testify the sparsity ability of the algorithms, we start from a big network with the structure of 60–25–2. We set the transfer function for both hidden and output layers to be $\hbox{logsig}(\cdot )$ in MATLAB, which is a commonly used sigmoid function. We choose learning rate $\eta =0.08$, the regularization coefficient $\lambda =0.06$, and $h_{\sigma } (t)$ the same to the former example. The maximum number of training epochs is 4,000.

After training, the number of weights whose absolute values are less than 0.01 are 844, 30, 269 and 441, for BGSL0, BGL2, BGL1 and BGL1/2, respectively, and their testing accuracies are comparatively the same. Thus BGSL0 can produce better sparse results than others in this example. The gradient and training error curves are given in Fig. 5, which supports Theorem 1.

6 Conclusion

The $\ell _0$ regularization is expected to be an ideal pruning method for neural networks. However, the solving of the $\ell _0$ regularization is an NP-hard problem. In this paper, by approximating the $\ell _0$ with smoothing functions, we propose a batch gradient method with smoothing $\ell _0$ regularization for training the feedforward neural networks. We have shown why the proposed method can lead to sparse results and thus prune the network efficiently. The convergence of the proposed algorithm is guaranteed by Theorem 1. Two examples show that our algorithm can achieve better sparsity than other three typical $\ell _p$ regularization methods. In future, we will study the case of the online gradient training method with smoothing $\ell _0$ regularization term.

References

Hornik K (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366
Article Google Scholar
Rubio JD, Angelov P, Pacheco J (2011) Uniformly stable backpropagation algorithm to train a feedforward neural network. IEEE Trans Neural Netw 22(3):356–366
Article Google Scholar
Sum J, Leung CS, Ho K (2012) Convergence analyses on on-line weight noise injection-based training algorithms for MLPs. IEEE Trans Neural Netw Learn Syst 23(11):1827–1840
Article Google Scholar
Bordignon F, Gomide F (2014) Uninorm based evolving neural networks and approximation capabilities. Neurocomputing 127:13–20
Article Google Scholar
Pratama M, Anavatti SG, Angelov PP, Lughofer E (2014) PANFIS: a novel incremental learning machine. IEEE Trans Neural Netw Learn Syst 25(1):55–68
Article Google Scholar
Rubio JJ (2014) Analytic neural network model of a wind turbine. Soft Comput. doi:10.1007/s00500-014-1290-0
Google Scholar
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control AC–19(6):716–723
Article MathSciNet Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MATH Google Scholar
Stathakis D (2009) How many hidden layers and nodes? Int J Remote Sens 30(8):2133–2147
Article Google Scholar
Augasta MG, Kathirvalavakumar T (2011) A novel pruning algorithm for optimizing feedforward neural network of classification problems. Neural Process Lett 34:241–258
Article Google Scholar
Karayiannis NB, Glenn WM (1997) Growing radial basis neural networks: merging supervised and unsupervised learning with network growth techniques. IEEE Trans Neural Netw 8(6):1492–1506
Article Google Scholar
Huang GB, Paramasivan S, Narasimhan S (2005) A generalized growing and pruning RBF (GGAP-RBF) neural network for function approximation. IEEE Trans Neural Netw 16(1):57–67
Article Google Scholar
Reed R (1993) Pruning algorithms: a survey. IEEE Trans Neural Netw 4:740–747
Article Google Scholar
Loone S, Irwin G (2001) Improving neural network training solutions using regularisation. Neurocomputing 37:71–90
Article MATH Google Scholar
Setiono R (1997) A penalty-function approach for pruning feedforward neural networks. Neural Comput 9:185–204
Article MATH Google Scholar
Shao HM, Xu DP, Zheng GF, Liu LJ (2012) Convergence of an online gradient method with inner-product penalty and adaptive momentum. Neurocomputing 77:243–252
Article Google Scholar
Karnin ED (1990) A simple procedure for pruning back-propagation trained neural networks. IEEE Trans Neural Netw 1:239–242
Article Google Scholar
Lughofer E (2011) Evolving fuzzy systems—methodologies, advanced concepts and applications. Springer, Berlin
Book MATH Google Scholar
Rubio JJ (2014) Evolving intelligent algorithms for the modelling of brain and eye signals. Appl Soft Comput 14(B):259–268
Article Google Scholar
Ordonez FJ, Iglesias JA, Toledo DP, Ledezma A, Sanchis A (2013) Online activity recognition using evolving classifiers. Expert Syst Appl 40:1248–1255
Article Google Scholar
Saito K, Nakano S (2000) Second-order learning algorithm with squared penalty term. Neural Comput 12:709–729
Article Google Scholar
Zhang HS, Wu W, Liu F, Yao MC (2009) Boundedness and convergence of online gadient method with penalty for feedforward neural networks. IEEE Trans Neural Netw 20(6):1050–1054
Article Google Scholar
Zhang HS, Wu W, Yao MC (2012) Boundedness and convergence of batch back-propagation algorithm with penalty for feedforward neural networks. Neurocomputing 89:141–146
Article Google Scholar
Shao HM, Zheng GF (2011) Boundedness and convergence of online gradient method with penalty and momentum. Neurocomputing 74:765–770
Article Google Scholar
Yu X, Chen QF (2012) Convergence of gradient method with penalty for Ridge Polynomial neural network. Neurocomputing 97:405–409
Article Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J Roy Stat Soc Ser B Methodol 58:267–288
MATH MathSciNet Google Scholar
Wu W, Fan QW, Zurada JM, Wang J, Yang DK, Liu Y (2014) Batch gradient method with smoothing $L_{1/2}$ regularization for training of feedforward neural networks. Neural Netw 50:72–78
Article MATH Google Scholar
Fan QW, Zurada JM, Wu W (2014) Convergence of online gradient method for feedforward neural networks with smoothing $L_{1/2}$ regularization penalty. Neurocomputing 131:208–216
Article Google Scholar
Liu Y, Wu W, Fan QW, Yang DK, Wang J (2014) A modified gradient learning algorithm with smoothing $L_{1/2}$ regularization for Takagi–Sugeno fuzzy models. Neurocomputing 138(2014):229–237
Article Google Scholar
Candes EJ, Tao T (2005) Decoding by linear programming. IEEE Trans Inform Theory 51(12):4203–4215
Article MATH MathSciNet Google Scholar
Wang YF, Liu P, Li ZH, Sun T, Yang CC, Zheng QS (2013) Data regularization using Gaussian beams decomposition and sparse norms. J Inverse Ill-Posed Probl 21(1):1–23
Article MathSciNet Google Scholar
Liu Y, Yang J, Li L, Wu W (2012) Negative effects of sufficiently small initialweights on back-propagation neural networks. J Zhejiang Univ-Sci C (Comput Electron) 13(8):585–592
Article Google Scholar

Download references

Acknowledgments

We are grateful to the reviewers for their insightful comments. This research is supported by the National Natural Science Foundation of China (No. 61101228) and the China Postdoctoral Science Foundation (No. 2012M520623)

Author information

Authors and Affiliations

Department of Mathematics, Dalian Maritime University, Dalian, 116026, People’s Republic of China
Huisheng Zhang & Yanli Tang
Research Center of Information and Control, Dalian University of Technology, Dalian, 116024, People’s Republic of China
Huisheng Zhang & Xiaodong Liu

Authors

Huisheng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yanli Tang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodong Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huisheng Zhang.

Appendix

In this “Appendix”, we give the proof of Theorem 1. For simplicity, we introduce the following notation:

$$\begin{aligned} {\bf F}^{n,j}={\bf F}({\bf V}^{n}\varvec{\xi }^j), \end{aligned}$$

(21)

for $n=1,2,\ldots,$ and $j=1,\ldots,J.$

Lemma 1

Suppose the Assumptions (A1)–(A3) are valid, then $E_{\bf w} ({\bf w})$ satisfies Lipschitz condition, that is, there exists a positive constant $C_2$, such that

$$\begin{aligned} \Vert E_{\bf w} ({\bf w}^{n+1} )-E_{\bf w} ({\bf w}^{n} )\Vert \le C_2\Vert {\bf w}^{n+1}-{\bf w}{^n}\Vert . \end{aligned}$$

(22)

Specially, for $\theta \geqslant 0$ , we have

$$\begin{aligned} \Vert E_{\bf w} ({\bf w}^{n}+\theta ({\bf w}^{n+1}-{\bf w}^{n}) )-E_{\bf w} ({\bf w}^{n} )\Vert \le C_2 \theta \Vert {\bf w}^{n+1}-{\bf w}^{n}\Vert . \end{aligned}$$

(23)

Proof

Using (12a) and the triangular inequality we have

$$\begin{aligned}&\Vert E_{{\bf w}_0} ({\bf w}^{n+1} )-E_{{\bf w}_0} ({\bf w}^{n} )\Vert \\&\quad =\left\| \sum \limits _{j=1}^J(e_j'({\bf w}_0^{n+1}\cdot {\bf F}^{n+1,j}){\bf F}^{n+1,j}-e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j}){\bf F}^{n,j})+\lambda (H_{\sigma }'({\bf w}_0^{n+1})-H_{\sigma }'({\bf w}_0^{n}))\right\| \\&\quad \le \sum \limits _{j=1}^J\Vert e_j'({\bf w}_0^{n+1}\cdot {\bf F}^{n+1,j}){\bf F}^{n+1,j}-e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j}){\bf F}^{n,j}\Vert +\lambda \Vert H_{\sigma }'({\bf w}_0^{n+1})-H_{\sigma }'({\bf w}_0^{n})\Vert . \end{aligned}$$

(24)

In order to give a further estimation of the above equation, we consider the change of ${\bf F}^{n,j}$ defined in (21) between two steps:

$$\begin{aligned} \Vert {\bf F}^{n+1,j}-{\bf F}^{n,j}\Vert&= \left\| \left( \begin{array}{c} f({\bf w}_1^{n+1}\cdot \varvec{\xi } ^j)-f({\bf w}_1^{n}\cdot \varvec{\xi } ^j) \\ \vdots \\ f({\bf w}_L^{n+1}\cdot \varvec{\xi } ^j)-f({\bf w}_L^{n}\cdot \varvec{\xi } ^j)\end{array}\right) \right\| \\&= \left\| \left( \begin{array}{c} f'(t_1)({\bf w}_1^{n+1}-{\bf w}_1^{n})\cdot \varvec{\xi } ^j\\ \vdots \\ f'(t_L)({\bf w}_L^{n+1}-{\bf w}_L^{n})\cdot \varvec{\xi } ^j\end{array}\right) \right\| \\&\le C_{3}\sum \limits _{l=1}^{L}\Vert {\bf w}^{n+1}_l-{\bf w}^{n}_l\Vert, \end{aligned}$$

(25)

where $t_l$ lies on the segment between ${\bf w}_l^{n+1}\cdot \varvec{\xi }^j$ and ${\bf w}_l^{n}\cdot \varvec{\xi }^j$, for $l=1,\ldots, L$, and $C_3=\sup f'(t)\max \limits _{1\le j\le J}\Vert \varvec{\xi }^j\Vert $.

It is easy to see that $e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j})$ is Lipschitz continuous and let the positive constant $C_4$ be the corresponding Lipschitz constant. Using (25), Assumption (A1) and the Cauchy–Schwartz inequality, we have

$$\begin{aligned}&\Vert e_j'({\bf w}_0^{n+1}\cdot {\bf F}^{n+1,j}){\bf F}^{n+1,j}-e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j}){\bf F}^{n,j}\Vert \\&\quad \le |e_j'({\bf w}_0^{n+1}\cdot {\bf F}^{n+1,j})-e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j})|\Vert {\bf F}^{n+1,j}\Vert +|e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j})|\Vert {\bf F}^{n+1,j}-{\bf F}^{n,j}\Vert \\&\quad \le C_4\Vert {\bf F}^{n+1,j}\Vert |{\bf w}_0^{n+1}\cdot {\bf F}^{n+1,j}-{\bf w}_0^{n}\cdot {\bf F}^{n,j}| +|e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j})|\Vert {\bf F}^{n+1,j}-{\bf F}^{n,j}\Vert \\&\quad \le C_4\Vert {\bf F}^{n+1,j}\Vert ^2 \Vert {\bf w}_0^{n+1}-{\bf w}_0^{n}\Vert +(C_4\Vert {\bf F}^{n+1,j}\Vert \Vert {\bf w}_0^{n}\Vert +|e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j})|)\Vert {\bf F}^{n+1,j}-{\bf F}^{n,j}\Vert \\&\quad \le C_4\Vert {\bf F}^{n+1,j}\Vert ^2 \Vert {\bf w}_0^{n+1}-{\bf w}_0^{n}\Vert +C_{5}\sum \limits _{l=1}^{L}\Vert {\bf w}^{n+1}_l-{\bf w}^{n}_l\Vert \\&\quad \le C_6 \Vert {\bf w}^{n+1}-{\bf w}{^n}\Vert, \end{aligned}$$

(26)

with

$$\begin{aligned} C_{5}&= C_3\sup (C_4\Vert {\bf F}^{n+1,j}\Vert \Vert {\bf w}_0^{n}\Vert +|e_j'({\bf w}_0^{n}\cdot {\bf F}^{n,j})|),\\ C_6&= \sqrt{L+1}\max \{ C_4\sup \Vert {\bf F}^{n+1,j}\Vert ^2,C_5\}. \end{aligned}$$

Similar to the deduction of (25), using the definition of $H_{\sigma }^{\prime }(\cdot )$, Assumption (A3) and the mean value theorem, we have

$$\begin{aligned} \Vert H_{\sigma }'({\bf w}_0^{n+1})-H_{\sigma }'({\bf w}_0^{n})\Vert&= \left\| \left( \begin{array}{c} h_{\sigma }'(w_{01}^{n+1})-h_{\sigma }'(w_{01}^{n})\\ \vdots \\ h_{\sigma }'(w_{0L}^{n+1})-h_{\sigma }'(w_{0L}^{n})\end{array} \right) \right\| \\&\le C_7 \Vert {\bf w}_{0}^{n+1}-{\bf w}_{0}^{n}\Vert, \end{aligned}$$

(27)

where $C_7=\sup h_{\sigma }^{\prime \prime }(t)$.

Combing the above Eqs. (24), (26), and (27), we have

$$\begin{aligned} \Vert E_{{\bf w}_0} ({\bf w}^{n+1} )-E_{{\bf w}_0} ({\bf w}^{n} )\Vert \le JC_6 \Vert {\bf w}^{n+1}-{\bf w}{^n}\Vert +\lambda C_7 \Vert {\bf w}_{0}^{n+1}-{\bf w}_{0}^{n}\Vert . \end{aligned}$$

(28)

Similarly, for $l=1,\ldots,L$, there exits a constant $C_8$, such that

$$\begin{aligned} \Vert E_{{\bf w}_l} ({\bf w}^{n+1} )-E_{{\bf w}_l} ({\bf w}^{n} )\Vert \le J C_8 \Vert {\bf w}^{n+1}-{\bf w}{^n}\Vert + \lambda C_7 \Vert {\bf w}_{l}^{n+1}-{\bf w}_{l}^{n}\Vert . \end{aligned}$$

(29)

Then, Eqs. (10), (11), (28) and (29) validate (22):

$$\begin{aligned} \Vert E_{\bf w} ({\bf w}^{n+1} )-E_{\bf w} ({\bf w}^{n} )\Vert \le C_2 \Vert {\bf w}^{n+1}-{\bf w}{^n}\Vert, \end{aligned}$$

(30)

where $C_2=J\sqrt{C_6^2+LC_8^2}+\lambda C_7$.

Equation (23) is naturally valid as a simple application of (22). $\square $

Now, we proceed to the proof of Theorem 1 by dealing with Eqs. (16)–(19) separately.

Proof of (16)

By the differential mean value theorem, there exists a constant $\theta \in [0,1]$, such that

$$\begin{aligned} E({\bf w}^{n+1})- E({\bf w}^n)&= (E_{\bf w} ({\bf w}^{n}+\theta ({\bf w}^{n+1}-{\bf w}^{n}) ))^T({\bf w}^{n+1}- {\bf w}^n) \\&= (E_{\bf w} ({\bf w}^{n}))^T({\bf w}^{n+1}- {\bf w}^n) \\&\quad +\,(E_{\bf w} ({\bf w}^{n} +\theta ({\bf w}^{n+1}-{\bf w}^{n}) )-(E_{\bf w} ({\bf w}^{n})))^T({\bf w}^{n+1}- {\bf w}^n) \\& \le -\,\eta \Vert E_{\bf w} ({\bf w}^{n})\Vert ^2+C_2\theta \Vert {\bf w}^{n+1}- {\bf w}^n\Vert ^2 \\&\le \,(-\eta +C_2\eta ^2)\Vert E_{\bf w} ({\bf w}^{n})\Vert ^2. \end{aligned}$$

(31)

To make (16) valid, we only require the learning rate $\eta $ to satisfy

$$\begin{aligned} 0<\eta <\frac{1}{C_2}. \end{aligned}$$

(32)

$\square $

Proof of (17)

Equation (17) is directly obtained by (16) and $E({\bf w}^n)>0(n=1,2,\ldots )$.$\square $

Proof of (18)

Let $\beta =\eta -C_2\eta ^2$. By (31), we have

$$\begin{aligned} E({\bf w}^{n+1})&\le E({\bf w}^n)-\beta \Vert E_{\bf w} ({\bf w}^{n})\Vert ^2 \\&\le \ldots \le E({\bf w}^{0})-\beta \sum \limits _{t=0}^n\Vert E_{\bf w} ({\bf w}^{t})\Vert ^2. \end{aligned}$$

(33)

Considering $E({\bf w}^{n+1})>0$, let $n\rightarrow \infty $, then we have

$$\begin{aligned} \beta \sum \limits _{t=0}^{\infty }\Vert E_{\bf w} ({\bf w}^{t})\Vert ^2\le E({\bf w}^{0})<\infty . \end{aligned}$$

(34)

This immediately gives

$$\begin{aligned} \mathop {\hbox {lim}}\limits _{n\rightarrow \infty }\Vert {E_{\bf w}({\bf w}^n)}\Vert =0. \end{aligned}$$

(35)

$\square $

To prove (19), we need the following lemma.

Lemma 2

(See Lemma 3 in [23]) Let $F:\Upphi \subset {\mathbb {R}}^k\rightarrow {\mathbb {R}},(k\ge 1)$ be continuous for a bounded closed region $\Upphi$ , and $\Upphi _0=\{{\bf z}\in \Upphi :F({\bf z})=0\}$ . The projection of $\Upphi _0$ on each coordinate axis does not contain any interior point. Let the sequence $\{{\bf z}^n\}$ satisfy:

(i)
$\hbox {lim}_{n\rightarrow \infty }F({\bf z}^n)=0$;
(ii)
$\hbox {lim}_{n\rightarrow \infty }\Vert {\bf z}^{n+1}-{\bf z}^n\Vert =0$.

Then, there exists a unique ${\bf z}^*\in \Upphi_0$ such that $\mathop {\hbox {lim}}\limits _{n\rightarrow \infty }{\bf z}^n={\bf z}^*.$

Proof of (19)

Obviously $\Vert E_{\bf w} ({\bf w})\Vert $ is a continuous function under Assumptions $(A2)$ and $(A3)$. Using (13) and (18), we have

$$\begin{aligned} \mathop {\hbox {lim}}\limits _{n\rightarrow \infty }\Vert {\bf w}^{n+1}-{\bf w}^n\Vert =\eta \mathop {\hbox {lim}}\limits _{n\rightarrow \infty }\Vert {E_{\bf w}({\bf w}^n)}\Vert =0. \end{aligned}$$

(36)

Furthermore, Assumption (A4) is valid. Thus, applying Lemma 2, there exists a unique ${\bf w}^{*}\in \Upphi $ such that $\mathop {\hbox {lim}}\limits _{n\rightarrow \infty }{\bf w}^n={\bf w}^{*}$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, H., Tang, Y. & Liu, X. Batch gradient training method with smoothing $\boldsymbol{\ell}_{\bf 0}$ regularization for feedforward neural networks. Neural Comput & Applic 26, 383–390 (2015). https://doi.org/10.1007/s00521-014-1730-x

Download citation

Received: 29 May 2014
Accepted: 14 September 2014
Published: 28 September 2014
Issue Date: February 2015
DOI: https://doi.org/10.1007/s00521-014-1730-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Batch gradient training method with smoothing \(\boldsymbol{\ell}_{\bf 0}\) regularization for feedforward neural networks