Abstract
Proposed in this paper is a new conjugate gradient method with smoothing \(L_{1/2} \) regularization based on a modified secant equation for training neural networks, where a descent search direction is generated by selecting an adaptive learning rate based on the strong Wolfe conditions. Two adaptive parameters are introduced such that the new training method possesses both quasi-Newton property and sufficient descent property. As shown in the numerical experiments for five benchmark classification problems from UCI repository, compared with the other conjugate gradient training algorithms, the new training algorithm has roughly the same or even better learning capacity, but significantly better generalization capacity and network sparsity. Under mild assumptions, a global convergence result of the proposed training method is also proved.
Similar content being viewed by others
References
Bishop CM (1995) Neural networks for pattern recognition. Clarendon Press, Oxford
Hmich A, Badri A, Sahel A (2011) Automatic speaker identification by using the neural network. In: Proceeding of 2011 IEEE international conference on multimedia computing and systems (ICMCS), pp 1–5
Zhou W, Zurada JM (2010) Competitive layer model of discrete-time recurrent neural networks with LT neurons. Neural Comput 22:2137–2160
Li EY (1994) Artificial neural networks and their business applications. Inf Manag 27:303–313
Svozil D, Kvasnicka V, Pospichal J (1997) Introduction to multi-layer feed-forward neural networks. Chemom Intell Lab 39(1):43–62
Wu W (2003) Computation of neural networks. Higher Education Press, Beijing
Xu ZB, Chang XY, Xu FM, Zhang H (2012) \(L_{1/2}\) regularization: a thresholding representation theory and a fast solver. IEEE Trans Neural Netw Learn Syst 23(7):1013–1027
Wu W, Fan QW, Zurada JM, Wang J, Yang DK, Liu Y (2014) Batch gradient method with smoothing \(L_{1/2}\) regularization for training of feedforward neural networks. Neural Netw 50:72–78
Reed R (1993) Pruning algorithms a survey. IEEE Trans Neural Netw 4:740–747
Sakar A, Mammone RJ (1993) Growing and pruning neural tree networks. IEEE Trans Comput 42(3):291–299
Arribas JI, Cid-Sueiro J (2005) A model selection algorithm for a posteriori probability estimation with neural networks. IEEE Trans Neural Netw 16(4):799–809
Hinton G (1989) Connectionist learning procedures. Artif Intell 40:185–235
Hertz J, Krogh A, Palmer R (1991) Introduction to the theory of neural computation. Addison Wesley, Redwood City
Wang C, Venkatesh SS, Judd JS (1994) Optimal stopping and effective machine complexity in learning. Adv Neural Inf Process Syst 6:303–310
Bishop CM (1995) Regularization and complexity control in feedforward networks. In: Proceedings of international conference on artificial neural networks ICANN’95. EC2 et Cie, pp 141–148
Nowlan SJ, Hinton GE (1992) Simplifying neural networks by soft weight-sharing. Neural Comput 4(4):473–493
Fogel DB (1991) An information criterion for optimal neural network selection. IEEE Trans Neural Netw 2(2):490–7
Seghouane AK, Amari SI (2007) The AIC criterion and symmetrizing the Kullback–Leibler divergence. IEEE Trans Neural Netw 18(1):97–106
Ishikawa M (1996) Structural learning with forgetting. Neural Netw 9:509–521
Mc Loone S, Irwin G (2001) Improving neural network training solutions using regularisation. Neurocomputing 37:71–90
Saito K, Nakano R (2000) Second-order learning algorithm with squared penalty term. Neural Comput 12:709–729
Wu W, Shao HM, Li ZX (2006) Convergence of batch BP algorithm with penalty for FNN training. In: King I, Wang J, Chan L, Wang DL (eds) Neural information processing. Springer, Berlin, pp 562–569
Cid-Sueiro J, Arribas JI, Urbn-Muñoz S, Figueiras-Vidal AR (1999) Cost functions to estimate a posteriori probabilities in multiclass problems. IEEE Trans Neural Netw 10(3):645–656
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Natarajan BK (1995) Sparse approximate solutions to linear systems. Siam J Sci Comput 24:227–234
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol) 58:267–288
Yuan GX, Ho CH, Lin CJ (2012) Recent advances of large-scale linear classification. Proc IEEE 100(9):2584–2603
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Methodol) (Stat Methodol) 67:301–320
Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning: data mining, inference, and prediction. Springer, Heidelberg
Donoho DL (2006) For most large underdetermined systems of linear equations the minimal \(l_1\)-norm solution is also the sparsest solution. Commun Pure Appl Math 59(6):797–829
Xu ZB, Zhang H, Wang Y, Chang XY (2010) \(L_{1/2}\) regularizer. Sci China Ser F Inf Sci 52:1–9
Igel C, Hüsken M (2003) Empirical evaluation of the improved Rprop learning algorithms. Neurocomputing 50:105–123
Rumelhart DE, McClelland JL, PDP Research Group (1986) Parallel distributed processing: explorations in the microstructure of cognition: psychological and biological models. MIT Press, Cambridge
Vogl TP, Mangis JK, Rigler AK, Zink WT, Alkon DL (1988) Accelerating the convergence of the back-propagation method. Biol Cybern 59:257–263
Shao H, Zheng G (2011) Convergence analysis of a back-propagation algorithm with adaptive momentum. Neurocomputing 74:749–752
Sun W, Yuan Y (2006) Optimization theory and methods nonlinear programming. Springer, New York
Livieris IE, Pintelas P (2013) A new conjugate gradient algorithm for training neural networks based on a modified secant equation. Appl Math Comput 221:491–502
Jiang M, Gielen G, Zhang B, Luo Z (2003) Fast learning algorithms for feedforward neural networks. Appl Intell 18:37–54
Battiti R (1989) Accelerated backpropagation learning: two optimization methods. Complex Syst 3(4):331–342
Battiti R (1992) First-and second-order methods for learning: between steepest descent and Newton’s method. Neural Comput 4(2):141–166
Johansson EM, Dowla FU, Goodman DM (1991) Backpropagation learning for multilayer feed-forward neural networks using the conjugate gradient method. Int J Neural Syst 2(4):291–301
Charalambous C (1992) Conjugate gradient algorithm for efficient training of artificial neural networks. Circuits Device Syst IEE Proc G 139(3):301–310
Adeli H, Hung SL (1994) An adaptive conjugate gradient learning algorithm for efficient training of neural networks. Appl Math Comput 62:81–102
Moller MF (1993) A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw 6:525–533
Kostopoulos AE, Grapsa TN (2009) Self-scaled conjugate gradient training algorithms. Neurocomputing 72:3000–3019
Barzilai J, Borwein JM (1988) Two point step size gradient methods. IMA J Numer Anal 8:141–148
Wang J, Wu W, Zurada JM (2011) Deterministic convergence of conjugate gradient method for feedforward neural networks. Neurocomputing 74:2368–2376
Dai YH, Yuan Y (1999) A nonlinear conjugate gradient method with a strong global convergence property. Siam J Optim 10:177–182
Xu C, Zhang J (2001) A survey of quasi-Newton equations and quasi-Newton methods for optimization. Ann Oper Res 103:213–234
Yabe H, Sakaiwa N (2005) A new nonlinear conjugate gradient method for unconstrained optimization. J Oper Res Soc Jpn Keiei Kagaku 48:284–296
Li WY, Wu W (2015) A parameter conjugate gradient method based on secant equation for unconstrained optimization. J Inf Comput Sci 12(16):5865–5871
Fan Q, Zurada JM, Wu W (2014) Convergence of online gradient method for feedforward neural networks with smoothing \(L_{1/2}\) regularization penalty. Neurocomputing 131:208–216
Liu Y, Wu W, Fan Q, Yang D, Wang J (2014) A modified gradient learning algorithm with smoothing \(L_{1/2}\) regularization for Takagi–Sugeno fuzzy models. Neurocomputing 138:229–237
Dai YH, Yuan Y (2001) An efficient hybrid conjugate gradient method for unconstrained optimization. Ann Oper Res 103:33–47
Blake C, Merz C (1998) UCI repository of machine leaning database. http://www.ics.uci.edu/mlearn/MLReposito
Acknowledgements
The authors thank the editor and the anonymous reviewers for their careful reading and thoughtful comments. This research is supported by the National Natural Science Foundation of China (Projects 11171367, 91230103, 61473059 and 61403056), and the Fundamental Research Funds for the Central Universities of China.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix
Lemma 1
Suppose that assumptions (A1) and (A2) hold, then \(\nabla E(\mathbf {W})\) is Lipschitz continuous in a neighborhood S of level set \(\varGamma ,\) i.e. there exists a constant \(L\ge 0\) such that
Proof
Note that \(\mathbf{G}(\mathbf{U}{\varvec{\zeta }}^{j})=(\mathbf{G}({ \mathbf{w}^{T}_\mathbf{1}}{\varvec{\zeta }}^{j}),\mathbf{G}({ \mathbf{w}^{T}_\mathbf{2}}{\varvec{\zeta }}^{j}),\ldots ,\mathbf{G}({ \mathbf{w}^{T}_\mathbf{q}}{\varvec{\zeta }}^{j}))^T. \) By Lagrange mean value theorem and Cauchy–Buniakowsky–Schwarz inequality, we have
where \(\xi _i\in ({\mathbf{w^1_i}}^T{\varvec{\zeta }^j}, {\mathbf{w^2_i}}^T{\varvec{\zeta }^j}), i=1, 2, \ldots , q.\)
From the property of the activation functions, we know that \(\mathbf{G}(\mathbf{U}{\varvec{\zeta }^j}), \mathbf{h'_j}(\mathbf{w_0}\cdot \mathbf{G}(\mathbf{U}{\varvec{\zeta }^j}))\) and \( g'(\mathbf{w_i}\cdot {\varvec{\zeta }^j})\) are bounded in S, where \(j=1, 2, \ldots , J\) and \( i=1, 2, \ldots , q\). This means that there exists a constant M such that \(\Vert \mathbf{G}(\mathbf{U}{\varvec{\zeta }^j})\Vert \le M, \Vert \mathbf{h'_j}(\mathbf{w_0}\cdot \mathbf{G}(\mathbf{U}{\varvec{\zeta }^j})\Vert \le M ~\text{ and }~ |g'(\mathbf{w_i}\cdot {\varvec{\zeta }^j})|\le M.\) Under the Assumptions (A1) and (A2), by (32), we have
where \(N_1=L_1\Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\Vert \Vert \mathbf{w^2_0}\Vert +M, L_3=\sqrt{q+1}\max \{L_1\Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j}))\Vert ^2, MN_1\Vert {\varvec{\zeta }^j}\Vert \}.\)
Using (33), we have
Similarly we have for some constant \(L_4 \) that
where \(i=1, 2, \ldots , q.\)
By (14) and Lagrange mean value theorem, when the positive constant \(\nu \) is sufficiently small, we have for any \(\mathbf {x}\) and \(\mathbf {y}\) that
where \(L_5=(\frac{2}{\nu ^2}+\frac{4}{9\nu ^2}\sqrt{6\nu }).\)
Using (36), we have
From (34), (35) and (37), we have
where the Lipschitz constant \(L=JL_3+qJL_4+q(1+p)L_5.\) This completes the proof. \(\square \)
The following lemma shows that our training algorithm always generates descent directions.
Lemma 2
Let the sequence \(\{W_k\}\) be generated by Algorithm MDY. Then the following estimate holds for all \(k\ge 0\):
Proof
Let us prove the Lemma by an induction argument. It is obvious that \(\nabla E_k^T\mathbf{d_k}=-||\nabla E_k||^2<0\) holds for \(k=0\).
By Sub-algorithm MDY, there are three cases for the formula \(\beta _k\): \(\beta _k=\beta _k^{MDY}\), \(\beta _k=\beta _k^{DY}\), and \(\beta _k=|\beta _k^{HS}|\).
For the first case \(\beta _k=\beta _k^{MDY}\). Assume that (39) holds for some \(k-1\). By (24), (25), and \(-1\le \mu _k\le 1 \) by Sub-algorithm MDY, we have
From Sub-algorithm MDY, we can easily conclude that the equality \(\mu _k l_{k-1}\ge 0\). Then, using(27), (28), (29) and (40), we have
For the second case \(\beta _k=\beta _k^{DY}\). By (24), (27) and Sub-algorithm MDY, we have
For the third case \(\beta _k=|\beta _k^{HS}|\). By Sub-algorithm MDY, we have by induction that
Furthermore, we have by (40) that
i.e.
This completes the proof. \(\square \)
Lemma 3
Suppose that the sequence \(\{{\mathbf{W}_\mathbf{k}}\}\) is generated by Algorithm MDY, and \(\{\mu _k\}\) and \(\{l_{k-1}\}\) are generated by Sub-algorithm MDY. Then, for sufficiently small \(\sigma \), the sufficient descent condition
holds for \(\kappa =\frac{1}{1+\sigma }.\)
Proof
From (41)
Let us consider the following three cases (A), (B) and (C).
(A) The case \(\theta _{k-1}\le 0\) and \(|\tilde{\mu }_k|\le 1\).
(A-1) For the sub-case \(\theta _{k-1}\le 0\) and \(0\le \tilde{\mu }_k\le 1\), the Equality (43) reduces to
By (25), we have
where \(\kappa =\frac{1}{1+\sigma }\).
(A-2) For the sub-case \(\theta _{k-1}\le 0\) and \(-1\le \tilde{\mu }_k<0\), (42) holds by a similar proof as that for (A-1).
(B) The case \(\theta _{k-1}> 0\) and \(|\tilde{\mu }_k|\le 1\).
(B-1) For the sub-case \(l_{k-1}=-\eta _{k-1}\nabla E_{k-1}^T \mathbf{d_{k-1}}\ge 0\) and \(0\le \tilde{\mu }_k\le 1\), by (41) we have
Then, (42) holds by noticing (45) and (47),.
(B-2) For the sub-case \(l_{k-1}=\eta _{k-1}\nabla E_{k-1}^T \mathbf{d_{k-1}}\le 0\) and \(-1\le \tilde{\mu }_k\le 0\), the result (42) holds by a similar proof as in (B-1).
(B-3) For the sub-case \(l_{k-1}=0\) and \(|\tilde{\mu }_k|\le 1\), the proof is similar as in (A).
(C) For the case \(|\tilde{\mu }_k|>1\), we have \(\mu _k=1, l_{k-1}=0\). Then \(\beta _k=\beta _k^{DY}\) or \(\beta _k=|\beta _k^{HS}|\). If \(\beta _k=\beta _k^{DY}\), the conclusion (42) holds by (25). If \(\beta _k=|\beta _k^{HS}|\), by (25), we have for sufficiently small \(\sigma \) that
This completes the proof. \(\square \)
The following Lemma 4 can be directly proved by using Lemmas 1 and 2. Actually, the proof is similar to that for Lemma 3.2 in [48].
Lemma 4
Suppose that Assumptions (A1) and (A2) hold, the sequence \(\{{\mathbf{W}_\mathbf{k}}\}\) is generated by (26) such that \(\mathbf{d_k}\) satisfies \(\nabla E_k^T\mathbf{d_k}<0\), and \(\eta _k\) is generated by the Wolfe conditions (24) and (25). Then the following Zoutendijk condition holds
Lemma 5
Let the sequence \(\{W_k\}\) be generated by Algorithm MDY. Then
Proof
By (28) and (27), we have
By (28), (29) and (40), we have
Next, we consider the following three cases (i)–(iii) to prove (49).
-
(i)
For the case \(\theta _{k-1}\le 0\) and \(|\tilde{\mu }_k|\le 1\), the Equality (52) reduces to
$$\begin{aligned} \mu _k\nabla E_k^T\mathbf{d_k}=\beta _k^{MDY}\nabla E_{k-1}^T\mathbf{d_{k-1}}. \end{aligned}$$Furthermore, since \(|\tilde{\mu }_k|\le 1\) we have
$$\begin{aligned} \left| \beta _k^{MDY}\right| =\left| \mu _k\frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}\right| \le \frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}. \end{aligned}$$(53) -
(ii)
For the case \(\theta _{k-1}> 0\) and \(|\tilde{\mu }_k|\le 1\), the Equality (52) reduces to
$$\begin{aligned} \mu _k\nabla E_k^T\mathbf{d_k}=\beta _k^{MDY}\left( \nabla E_{k-1}^T\mathbf{d_{k-1}}-\mu _k\frac{3l_k}{\eta _k}\theta _{k-1}\right) , \end{aligned}$$(54)Now, we consider the following three sub-cases: The first sub-case \(l_{k-1}=-\eta _{k-1}\nabla E_{k-1}^T \mathbf{d_{k-1}}\ge 0\) and \(0\le \tilde{\mu }_k\le 1\). Then, from (54) we have
$$\begin{aligned} \mu _k\nabla E_k^T\mathbf{d_k}=\beta _k^{MDY}\nabla E_{k-1}^T\mathbf{d_{k-1}}(1+3\mu _k\theta _{k-1}). \end{aligned}$$Furthermore,
$$\begin{aligned} \left| \beta _k^{MDY}\right| =\mu _k\frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}\frac{1}{1+3\mu _k\theta _{k-1}}\le \frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}. \end{aligned}$$(55)The second sub-case \(l_{k-1}=\eta _{k-1}\nabla E_{k-1}^T \mathbf{d_{k-1}}\le 0\) and \(-1\le \tilde{\mu }_k\le 0\). From (54) we have
$$\begin{aligned} \mu _k\nabla E_k^T\mathbf{d_k}=\beta _k^{MDY}\nabla E_{k-1}^T\mathbf{d_{k-1}}(1-3\mu _k\theta _{k-1}), \end{aligned}$$and thus
$$\begin{aligned} \left| \beta _k^{MDY}\right| =\mu _k\frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}\frac{1}{1-3\mu _k\theta _{k-1}}\le \frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}. \end{aligned}$$(56)The third sub-case \(l_{k-1}=0\) and \(|\tilde{\mu }_k|\le 1\). The proof here is similar as in (i).
-
(iii)
For the case \(|\tilde{\mu }_k|>1\), we have \(\mu _k=1\) and \( l_{k-1}=0\). Then, \(\beta _k=\beta _k^{DY}\) or \(\beta _k=|\beta _k^{HS}|\). For the case \(\beta _k=\beta _k^{DY}\), it is easy to prove that the equality
$$\begin{aligned} \beta _k^{DY}=\frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}} \end{aligned}$$(57)holds. For the case \(\beta _k=|\beta _k^{HS}|\), by Sub-algorithm MDY,
$$\begin{aligned} \left| \beta _k^{HS}\right| \le \beta _k^{DY}=\frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}. \end{aligned}$$(58)Finally, (49) holds by combing (53), (55), (56), (57) and (58).
\(\square \)
Proof of Theorem 2
Proof
We will prove the theorem by contradiction. If the theorem is not true, there exists a constant \(c>0\) such that
By Lemma 5, we have
By (50), (60) and (51), we have
It follows from
and (59) that
Furthermore,
which indicates that
This contradicts the Zoutendijk condition (48). Therefore, the conclusion (30) must hold. \(\square \)
The deduction of (19) for the adaptive parameter \(\tilde{\mu }_k\):
The conjugate gradient direction (4) on the current epoch has been set to be equal to the quasi-Newton direction (17):
When \(\theta _{k-1}\le 0\), using (10), we have
i.e.
Multiplying the above equality by \(\mathbf{s_{k-1}}\), we have
By the modified secant condition (6), we have
The adaptive parameter \(\mu _k\) can be computed by
When \(\theta _{k-1}>0\), similar to the case of \(\theta _{k-1}\le 0\), the adaptive parameter \(\mu _k\) can be computed by
where \(\vartheta _k=3l_{k-1}\theta _{k-1}\nabla E_{k}^{T}{ \hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}-3l_{k-1}\theta _{k-1}\nabla E_{k}^{T}{} \mathbf{s_{k-1}}\).
In summary, the adaptive parameter \(\mu _k\) can be computed by the following formula:
where \(\vartheta _k=3l_{k-1}\theta _{k-1}\nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}-3l_{k-1}\theta _{k-1}\nabla E_{k}^{T}\mathbf{s_{k-1}}\).
Rights and permissions
About this article
Cite this article
Li, W., Liu, Y., Yang, J. et al. A New Conjugate Gradient Method with Smoothing \(L_{1/2} \) Regularization Based on a Modified Secant Equation for Training Neural Networks. Neural Process Lett 48, 955–978 (2018). https://doi.org/10.1007/s11063-017-9737-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-017-9737-9