1 Introduction

As universal approximators [1], multilayer feedforward neural networks (FNNs) [26] have been applied in many fields. According to the approximation theory, the network with the simpler structure is preferred because it has stronger generalization ability than the network with much more complex structures if both the networks bear the same training accuracy. Though many model selection methods have been proposed, such as AIC [7] and BIC [8], it is still difficult in theory to directly compute the exact optimal number of nodes that the network needs for a given problem [9]. There are two practical approaches to solve this problem [10]: One is the growing methods [11, 12], starting with a minimal network and adding new nodes during the training process, and another is the pruning methods, starting with a large network and then removing the unimportant nodes or weights. Depending on the techniques used for pruning, the pruning methods can be classified into the following groups [10, 13]: regularization term methods [1416], cross-validation methods, magnitude-based methods, mutual information-based methods, evolutionary pruning methods and sensitivity analysis-based methods [1720]. We focus on the regularization term methods in this paper.

“Weight decay” maybe the most popular regularization term method in neural network fields. By adding a term proportional to the square of the \(\ell _2\) norm of the network weights to the objective function, the network weights will be driven to zeros during the training process using the gradient training method (or other optimization methods). The effective of the weight decay method has been experimentally shown by many authors [21] and the convergence for the gradient method with a weight decay term has been theoretically proved in [2225]. However, as an \(\ell _2\) regularization, the weight decay method compels all the network weights to zeros because it can not distinguish between the important weights and unimportant weights. As a result, weight decay method can not produce sparse network weights and thus can not prune the network efficiently. Another popular regularization term is the \(\ell _1\) norm [26] of the network weights. Recently, smoothing \(\ell _{1/2}\) regularization methods are proposed for training feedforward neural networks [27, 28] and fuzzy neural networks [29]. According to the regularization theory, \(\ell _p\) regularization methods tend to bear more sparse results as \(p\rightarrow 0\). Thus, the \(\ell _0\) regularization is expected to be one of the most ideal ways for pruning network. However, the solving of \(\ell _0\) regularization problem is an NP-hard problem [30] and can not directly make use of the optimization algorithms such as the gradient method [31].

Motivated by the work on smoothing \(\ell _{1/2}\) regularization method for FNNs, in this paper, we try to use smoothing functions to approximate the \(\ell _0\) regularizer and derive a batch gradient method with the smoothing \(\ell _0\) regularization (BGSL0) for training feedforward neural networks. We will show how the smoothing \(\ell _0\) regularization helps the gradient method to distinguish the important weights and unimportant weights. The weak convergence and strong convergence results for the given algorithm will also be investigated.

The remainder of this paper is organized as follows. The network structure and the derivation of the BGSL0 are described in the next section. Section 3 shows how the smoothing \(\ell _0\) regularization helps the gradient method to produce sparse results. In Sect. 4, we give some convergence results of the proposed algorithm. The performance of the BGSL0 is compared with several typical \(\ell _p\) regularization methods by two examples in Sect. 5. We give the conclusion in Sect. 6. The proof of the convergence theorem is provided in the “Appendix”.

2 Network structure and BGSL0 algorithm

Without loss of generality, we consider a three-layered network consisting of K input nodes, L hidden nodes and 1 output node. Let \({\bf w}_0=(w_{01},w_{02},\ldots,w_{0L})^T\in {\mathbb {R}}^L\) be the weight vector between all the hidden units and the output unit, and \({\bf w}_l=(w_{l1},w_{l2},\ldots,w_{lK})^T\in {\mathbb {R}}^K\) be the weight vector between all the input units and the hidden unit \(l\,(l=1,2,\ldots,L)\). To simplify the presentation, we write all the weight parameters in a compact form, i.e., \( {\bf w}=({\bf w}_0^T,{\bf w}_1^T,\ldots, {\bf w}_L^T)^T\in{ \mathbb {R}}^{L+KL}\) and we define a matrix \({\bf V}=({\bf w}_1,{\bf w}_2,\ldots, {\bf w}_L)^{T}\in {\mathbb {R}}^{L\times K}\).

Given activation functions \(f,g:{\mathbb {R}}\rightarrow {\mathbb {R}}\) for the hidden layer and output layer, respectively, we define a vector function \({\bf F}({\bf x})=(f(x_1),f(x_2),\ldots,f(x_L))^T\) for \({\bf x}={(x_{1},x_{2},\ldots,x_{L})^T}\in {\mathbb {R}}^L\). For an input \(\varvec{\xi }\in {\mathbb {R}}^K\), the output vector of the hidden layer can be written as \({\bf F}({\bf V}\varvec{\xi })\) and the final output of the network can be written as

$$\begin{aligned} \zeta =g({\bf w}_0 \cdot {\bf F}({\bf V}\varvec{\xi })), \end{aligned}$$
(1)

where \({\bf w}_0 \cdot {\bf F}({\bf V}\varvec{\xi })\) represents the inner product between the two vectors \({\bf w}_0\) and \({\bf F}({\bf V}\varvec{\xi })\).

Suppose that \(\{{\varvec{\xi } ^{j},O^{j}}\}_{j=1}^{J}\subset {\mathbb {R}}^K\times {\mathbb {R}}\) is a given set of training samples. The aim of the network training is to find the appropriate network weights \({\bf w}^*\) that can minimize the error function

$$\begin{aligned} {\mathcal {E}}({\bf w})=\frac{1}{2}\sum _{j=1}^{J}\left( O^j-g({\bf w} _0\cdot {\bf F}({\bf V}\varvec{\xi }^{j}))\right) ^2 \end{aligned}$$
(2)
$$\begin{aligned} =\sum _{j=1}^{J}e_{j}\left( {\bf w} _0\cdot {\bf F}({\bf V}\varvec{\xi }^j)\right), \end{aligned}$$
(3)

where \(e_j(t):=\frac{1}{2}(O^j-g(t))^2\). In order to prune the network and thus enhance its generalization ability, an \(\ell _p\) regularization term is often added to the error function. Therefore, the modified error function takes the form

$$\begin{aligned} E({\bf w} )={\mathcal {E}}({\bf w} ) +\lambda \Vert {\bf w} \Vert _p^p, \end{aligned}$$
(4)

where \(\lambda \) is the regularization coefficient to balance the tradeoff between the training accuracy and the network complexity, and \(\Vert {\bf w} \Vert _p^p\) is the \(\ell _p\) regularizer with the \(\ell _p\) norm \(\Vert \cdot \Vert _p\) defined by \(\Vert {\bf z}\Vert _p= (|z_1|^p+|z_2|^p+\cdots +|z_m|^p)^{1/p}\) for \({\bf z}=(z_1,z_2,\ldots,z_m)^T\). There are three common choices for p in literatures: \(p=2,1\) and \(\frac{1}{2}\). \(\ell _2\) regularization corresponds to “weight decay”, which can efficiently control the magnitude size of network weights, while not producing sparse network weights. Though both the \(\ell _1\) and \(\ell _{\frac{1}{2}}\) regularizers can generate sparse results, the best \(\ell _p\) regularizer to satisfy the sparsity requirement is the \(\ell _0\) regularizer defined by

$$\begin{aligned} \Vert {\bf z}\Vert _0^0&= \mathop {\hbox {lim}}\limits _{p\rightarrow 0} \left( (|z_1|^p+|z_2|^p+\cdots +|z_m|^p)^{1/p}\right) ^p \\&= \mathop {\hbox {lim}}\limits _{p\rightarrow 0} \left( |z_1|^p+|z_2|^p+\cdots +|z_m|^p\right) \end{aligned}$$
(5)

for \({\bf z}=(z_1,z_2,\ldots,z_m)^T\). Assume \(0^0=0\), then we have that \(\Vert {\bf z}\Vert _0^0\) equals to the number of nonzero entries of the vector \({\bf z}\). However, minimizing (4) for \(p=0\) is an NP-hard problem in combinatorial optimization [30]. To solve this problem efficiently, we can use a continuous function of vector variable

$$\begin{aligned} H_{\sigma }({\bf z})=\sum \limits _{i=1}^m h_{\sigma }(z_i) \end{aligned}$$
(6)

to approximate the \(\ell _0\) regularizer, where \(h_{\sigma }(\cdot )\) is continuously differentiable on \({\mathbb {R}}\) and satisfies the following condition

$$\begin{aligned} \mathop {\hbox {lim}}\limits _{\sigma \rightarrow 0} h_{\sigma } (t)=\left\{ \begin{array}{ll} 1, &{} {\text {if}}\,t\ne 0, \\ 0, &{} {\text {if}}\, t= 0. \end{array}\right. \end{aligned}$$
(7)

Here \(\sigma \) is a positive number used to control the degree how \(H_{\sigma }({\bf z})\) approximates the \(\ell _0\) regularizer. According to (6) and (7), we know \(H_{\sigma }({\bf z})\) approaches the number of nonzero entries of the vector \({\bf z}\), i.e., the \(\ell _0\) regularizer, as \(\sigma \rightarrow 0\). A typical choice for \(h_{\sigma } (t) \) is

$$\begin{aligned} h_{\sigma } (t)=1-\exp \left( \frac{-t^2}{2\sigma ^2}\right) . \end{aligned}$$
(8)

At the same time, there are other candidates for \(h_{\sigma } (t)\), such as

$$\begin{aligned} 1-\frac{\sigma ^2}{t^2+\sigma ^2},\quad 1-\frac{\sin (t/\sigma )}{t/\sigma }, \end{aligned}$$
(9)

where the latter is defined to be 0 at the point \(t=0\). With the newly defined smoothing \(\ell _0\) regularizer, the corresponding error function is

$$\begin{aligned} E({\bf w} )={\mathcal {E}}({\bf w} )+\lambda H_{\sigma }({\bf w}). \end{aligned}$$
(10)

The gradient of the error function is given by

$$\begin{aligned} E_{\bf w} ({\bf w} )=\left( E_{{\bf w} _0}^T({\bf w}),E_{{\bf w} _1}^T({\bf w}),\ldots,E_{{\bf w}_L}^T({\bf w} )\right) ^T \end{aligned}$$
(11)

with

$$\begin{aligned} E_{{\bf w}_0}({\bf w})=\sum _{j=1}^{J}e_j'\left( {\bf w}_0\cdot {\bf F}({\bf V}\varvec{\xi }^j)\right) {\bf F}({\bf V}\varvec{\xi }^j)+\lambda H_{\sigma }^{\prime } ({\bf w}_0), \end{aligned}$$
(12a)
$$\begin{aligned} E_{{\bf w}_l}({\bf w})=\sum _{j=1}^{J}e_j'\left( {\bf w}_0\cdot {\bf F}({\bf V}\varvec{\xi }^j)\right) w_{0l}f'({\bf w}_l\cdot \varvec{\xi }^j)\varvec{\xi }^j+\lambda H_{\sigma }^{\prime } ({\bf w}_l),\quad l=1,2,\ldots,L, \end{aligned}$$
(12b)

where \(H_{\sigma }^{\prime } ({\bf z})\) is defined by \((h_{\sigma }^{\prime }(z_1),h_{\sigma }^{\prime }(z_2),\ldots,h_{\sigma }^{\prime }(z_m))^T\) for \({\bf z}=(z_1,z_2,\ldots,z_m)^T\). Starting from an arbitrary initial value \({\bf w}^0\), the batch gradient method with the smoothing \(\ell _0\) regularization term updates the weights \(\{{\bf w}^n\}\) iteratively by

$$\begin{aligned} {\bf w}^{n+1}&={\bf w}^{n}-\eta E_{\bf w} ({\bf w}^n ),\quad n=0,1,2,\ldots \end{aligned}$$
(13)

where \(\eta >0\) is the learning rate.

3 Sparsity

We first show why the smoothing \(\ell _0\) regularization term can help the batch gradient method distinguish between the important weights and unimportant weights and remove the latter. In general, the weight with large absolute value during the training process is assumed to be more important than the weight around zero. According to (7), for sufficiently small \(\sigma \), there exists a positive constant \(t_0\), such that \(h_{\sigma }^{\prime }(t)\approx 0\) when \(t>t_0\). Thus, the more important weights, which are larger absolutely than \(t_0\), will not be affected by the regularization term. On the other hand, when \(t<t_0, h_{\sigma }^{\prime }(t)\) will be considerable large. Notice that \(t=0\) is the only minimum point of \(h_{\sigma }(t)\), then we know that unimportant weights, which are absolutely less than \(t_0\), will be driven to zero during the training process. This explains why the algorithm can achieve sparse results. The curves for \(h_{\sigma } (t)\) and \(h_{\sigma } ^{\prime } (t)\) are illustrated by Figs. 1 and 2, where \(h_{\sigma }(t)\) is set to be \(1-\exp (\frac{-t^2}{2\sigma ^2})\).

Fig. 1
figure 1

The curve of \(h_{\sigma }(t)\) for different \(\sigma \)

Fig. 2
figure 2

The curve of \(h_{\sigma }^{\prime }(t)\) for different \(\sigma \)

In order to achieve the most sparse results, according to the above discussion, we expect more weights can fall into the interval \((-t_0,t_0)\). To make this true, one choice for us is to set the initial weights to be very small. However, this will lead to slow convergence at the beginning [32]. Another choice is to use the weight decay method as a brute-force to shrink network weights. In fact, for large \(\sigma \), the regularization term (6) is equivalent to the weight decay regularization. For example, let \(h_{\sigma } (t)=1-\exp (\frac{-t^2}{2\sigma ^2})\). Then, expanding \(h_{\sigma } (t)\) with the Taylor series and omitting the high order terms, we have

$$\begin{aligned} 1-\exp \left( \frac{-t^2}{2\sigma ^2}\right) \approx \frac{t^2}{2\sigma ^2} \end{aligned}$$
(14)

if \(\sigma \) is sufficiently large. Thus, during the training process, the parameter \(\sigma \) should be set to be a decreasing sequence which is big at the beginning and tends to zero at last.

4 Convergence

In this section, we will establish the convergence theorem of the proposed algorithm. The proof of the theorem is provided in the “Appendix”.

Let \(\Upphi =\{{\bf w}:\; E({\bf w})=0\}\) be the stationary point set of the error function \(E_{\bf w}({\bf w})\), and \(\Upphi _s=\{w_{lk}:{\bf w}=(w_{01},\ldots,w_{lk},\ldots,w_{LK})\in \Upphi \}\) be the projection of \(\Upphi \) onto the \(s\)th coordinate axis, where

$$\begin{aligned} s= \left\{ \begin{array}{ll} k, &{}{\text {if}}\, l= 0, \\ (l-1)K+L+k, &{} {\text {if}}\,l> 0, \end{array}\right. \end{aligned}$$
(15)

for \(s=1,\ldots,KL+L\). The following assumptions are needed for our convergence results.

  1. (A1)

    There exists a constant \(C_1\) such that \(\Vert {\bf w}^n\Vert \le C_1\) for all \(n=0,1,\ldots \).

  2. (A2)

    The functions f and g are twice differentiable on \({\mathbb {R}}\). Moreover, \(f,g,f^{\prime }, g^{\prime },f^{\prime \prime }\) and \(g^{\prime \prime }\) are uniformly bounded on \({\mathbb {R}}\).

  3. (A3)

    For any fixed positive parameter \(\sigma \in {\mathbb {R}}, h_{\sigma }(t)\) is twice differentiable and uniformly bounded for both the first and second derivatives on \({\mathbb{ R}}\).

  4. (A4)

    The set \(\Upphi_s\) does not contain any interior point for every \(s=1,\ldots,KL+L\).

Remark 1

In practice, \(\sigma \) is set to a decreasing sequence which is large at the beginning. As shown in the previous section, the regularization term with big \(\sigma \) approximates the \(\ell _2\) regularizer. We have proved the boundedness of the network weights trained by the batch gradient method with the \(\ell _2\) regularization term [23]. Thus Assumption (A1) can be easily assured. Assumption (A2) is satisfied by typical activation functions such as sigmoid functions. Assumption (A3) is not restrictive as it is satisfied by the suggested functions in (8) and (9). Assumption (A4) is used to establish the strong convergence.

Now we present our convergence results for BGSL0.

Theorem 1

Suppose that the error function is given by (4) that the weight sequence \(\{{\bf w}^n\}\) is generated by the algorithm (13) for any initial value \({\bf w}^0\), that \(0<\eta <\frac{1}{C_2}\), where \(C_2\) is defined by (30), and that Assumptions  (A1)–(A4) are valid. Then we have

$$\begin{aligned}&(a)\,E({\bf w}^{n+1})\le E({\bf w}^n),\quad n=0,1,2,\ldots ;&\end{aligned}$$
(16)
$$\begin{aligned}&(b)\, \text{ There } \text{ is } E^*>0 \text{ such } \text{ that } \mathop {\hbox {lim}}\limits _{n\rightarrow \infty } E({\bf w}^n)=E^*;\end{aligned}$$
(17)
$$\begin{aligned}&(c)\, \text{ There } \text{ holds } \text{ the } \text{ weak } \text{ convergence: } \mathop {\hbox {lim}}\limits _{n\rightarrow \infty }\left\| {E_{\bf w}({\bf w}^n)} \right\| =0. \end{aligned}$$
(18)

Moreover, if Assumption (A4) is valid, then there holds the strong convergence: There exists a point \({\bf w}^{*}\in \Upphi \) such that

$$\begin{aligned} (d)\quad \mathop {\hbox {lim}}\limits _{n\rightarrow \infty }{\bf w}^n={\bf w}^{*}. \end{aligned}$$
(19)

Remark 2

According to Theorem 1 and the analysis in Sect. 3, after training we can obtain a weight vector \({\bf w}\), many of whose elements are almost zeros. Then, by setting these elements (weights) to be zeros, that is to say, removing these weights from the network, the network is pruned.

5 Simulation results

We will empirically demonstrate the performance of BGSL0 on two experiments: function approximation problem and sonar signal classification. The comparison with the batch gradient method with \(\ell _2\) regularization (BGL2), the batch gradient method with \(\ell _1\) regularization (BGL1), and the batch gradient method with \(\ell _{1/2}\) regularization (BGL1/2) are also provided.

5.1 Function approximation problem

In this subsection, we consider using a neural network to approximate the following multi-dimensional Gabor function

$$\begin{aligned} \frac{1}{2\pi (0.5)^2} \exp \left( {-\frac{x^2+y^2}{2\pi (0.5)^2}}\right) \cos (2\pi (x+y)). \end{aligned}$$
(20)

The training samples are 36 points selected from an evenly spaced \(6 \times 6\) grid on \(-0.5\le x \le 0\) and \(-0.5\le y \le 0\). Similarly, the testing samples are from an evenly spaced \(6\times 6\) grid on \(-0.5\le x \le 0\) and \(0\le y \le 0.5\).

We set the network structure to be 3–20–1 (input, hidden and output nodes), and the transfer function for both hidden and output layers to be \(\hbox{tansig}(\cdot )\) in MATLAB, which is a commonly used sigmoid function. We choose learning rate \(\eta =0.15\), the regularization coefficient \(\lambda =0.001\), and \(h_{\sigma } (t)=1-\exp (\frac{-t^2}{2\sigma ^2})\), where the parameter \(\sigma \) is set to be a decreasing sequence. As shown by Fig. 2, when \(\sigma \) is too small, \(h_{\sigma }^{\prime }(t)\) may be absolutely too large, which may cause instability during the training procedure. Thus, we set a lower bound for \(\sigma \) as 0.08. The maximum number of training epochs is 30,000.

We use the number of weights whose absolute values are <0.01 after training to measure the sparsity of an algorithm. It is easy to see that the larger number means the algorithm achieves better sparsity. The testing errors and the sparsity of the network are compared in Table 1, which shows the BGSL0 are better than the other three algorithms in sparsity and generalization ability. Figure 3 shows that the trained network approximates the Gabor function well both in the learning region and the testing region. The learning curves for BGSL0 are given in Fig. 4, which shows that, as the number of iteration increases, the gradient function tends to zero and the square error function decreases monotonically and at last it tends to a constant. This substantiates Theorem 1.

Table 1 Performance comparison for BGSL0, BGL2, BGL1, and BGL1/2
Fig. 3
figure 3

Graphs of the Gabor function and the approximation performance of the network

Fig. 4
figure 4

Learning curves of BGSL0 for function approximation problem

5.2 Sonar signal classification

Sonar signal classification is a benchmark problem in neural network field. The task is to train a network to discriminate between sonar returns bounced off a metal cylinder and those bounced off a roughly cylindrical rock. The date set is publicly available from UCI machine learning repository (http://archive.ics.uci.edu/ml/), which comprises 208 samples, each with 60 components. In this simulation, we randomly choose 164 samples for training and 44 samples for test.

In order to testify the sparsity ability of the algorithms, we start from a big network with the structure of 60–25–2. We set the transfer function for both hidden and output layers to be \(\hbox{logsig}(\cdot )\) in MATLAB, which is a commonly used sigmoid function. We choose learning rate \(\eta =0.08\), the regularization coefficient \(\lambda =0.06\), and \(h_{\sigma } (t)\) the same to the former example. The maximum number of training epochs is 4,000.

After training, the number of weights whose absolute values are less than 0.01 are 844, 30, 269 and 441, for BGSL0, BGL2, BGL1 and BGL1/2, respectively, and their testing accuracies are comparatively the same. Thus BGSL0 can produce better sparse results than others in this example. The gradient and training error curves are given in Fig. 5, which supports Theorem 1.

Fig. 5
figure 5

Learning curves of BGSL0 for sonar signal classification problem

6 Conclusion

The \(\ell _0\) regularization is expected to be an ideal pruning method for neural networks. However, the solving of the \(\ell _0\) regularization is an NP-hard problem. In this paper, by approximating the \(\ell _0\) with smoothing functions, we propose a batch gradient method with smoothing \(\ell _0\) regularization for training the feedforward neural networks. We have shown why the proposed method can lead to sparse results and thus prune the network efficiently. The convergence of the proposed algorithm is guaranteed by Theorem 1. Two examples show that our algorithm can achieve better sparsity than other three typical \(\ell _p\) regularization methods. In future, we will study the case of the online gradient training method with smoothing \(\ell _0\) regularization term.