Skip to main content
Log in

A New Conjugate Gradient Method with Smoothing \(L_{1/2} \) Regularization Based on a Modified Secant Equation for Training Neural Networks

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Proposed in this paper is a new conjugate gradient method with smoothing \(L_{1/2} \) regularization based on a modified secant equation for training neural networks, where a descent search direction is generated by selecting an adaptive learning rate based on the strong Wolfe conditions. Two adaptive parameters are introduced such that the new training method possesses both quasi-Newton property and sufficient descent property. As shown in the numerical experiments for five benchmark classification problems from UCI repository, compared with the other conjugate gradient training algorithms, the new training algorithm has roughly the same or even better learning capacity, but significantly better generalization capacity and network sparsity. Under mild assumptions, a global convergence result of the proposed training method is also proved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  1. Bishop CM (1995) Neural networks for pattern recognition. Clarendon Press, Oxford

    MATH  Google Scholar 

  2. Hmich A, Badri A, Sahel A (2011) Automatic speaker identification by using the neural network. In: Proceeding of 2011 IEEE international conference on multimedia computing and systems (ICMCS), pp 1–5

  3. Zhou W, Zurada JM (2010) Competitive layer model of discrete-time recurrent neural networks with LT neurons. Neural Comput 22:2137–2160

    Article  MathSciNet  Google Scholar 

  4. Li EY (1994) Artificial neural networks and their business applications. Inf Manag 27:303–313

    Article  Google Scholar 

  5. Svozil D, Kvasnicka V, Pospichal J (1997) Introduction to multi-layer feed-forward neural networks. Chemom Intell Lab 39(1):43–62

    Article  Google Scholar 

  6. Wu W (2003) Computation of neural networks. Higher Education Press, Beijing

    Google Scholar 

  7. Xu ZB, Chang XY, Xu FM, Zhang H (2012) \(L_{1/2}\) regularization: a thresholding representation theory and a fast solver. IEEE Trans Neural Netw Learn Syst 23(7):1013–1027

    Article  Google Scholar 

  8. Wu W, Fan QW, Zurada JM, Wang J, Yang DK, Liu Y (2014) Batch gradient method with smoothing \(L_{1/2}\) regularization for training of feedforward neural networks. Neural Netw 50:72–78

    Article  Google Scholar 

  9. Reed R (1993) Pruning algorithms a survey. IEEE Trans Neural Netw 4:740–747

    Article  Google Scholar 

  10. Sakar A, Mammone RJ (1993) Growing and pruning neural tree networks. IEEE Trans Comput 42(3):291–299

    Article  Google Scholar 

  11. Arribas JI, Cid-Sueiro J (2005) A model selection algorithm for a posteriori probability estimation with neural networks. IEEE Trans Neural Netw 16(4):799–809

    Article  Google Scholar 

  12. Hinton G (1989) Connectionist learning procedures. Artif Intell 40:185–235

    Article  Google Scholar 

  13. Hertz J, Krogh A, Palmer R (1991) Introduction to the theory of neural computation. Addison Wesley, Redwood City

    Google Scholar 

  14. Wang C, Venkatesh SS, Judd JS (1994) Optimal stopping and effective machine complexity in learning. Adv Neural Inf Process Syst 6:303–310

    Google Scholar 

  15. Bishop CM (1995) Regularization and complexity control in feedforward networks. In: Proceedings of international conference on artificial neural networks ICANN’95. EC2 et Cie, pp 141–148

  16. Nowlan SJ, Hinton GE (1992) Simplifying neural networks by soft weight-sharing. Neural Comput 4(4):473–493

    Article  Google Scholar 

  17. Fogel DB (1991) An information criterion for optimal neural network selection. IEEE Trans Neural Netw 2(2):490–7

    Article  Google Scholar 

  18. Seghouane AK, Amari SI (2007) The AIC criterion and symmetrizing the Kullback–Leibler divergence. IEEE Trans Neural Netw 18(1):97–106

    Article  Google Scholar 

  19. Ishikawa M (1996) Structural learning with forgetting. Neural Netw 9:509–521

    Article  Google Scholar 

  20. Mc Loone S, Irwin G (2001) Improving neural network training solutions using regularisation. Neurocomputing 37:71–90

    Article  Google Scholar 

  21. Saito K, Nakano R (2000) Second-order learning algorithm with squared penalty term. Neural Comput 12:709–729

    Article  Google Scholar 

  22. Wu W, Shao HM, Li ZX (2006) Convergence of batch BP algorithm with penalty for FNN training. In: King I, Wang J, Chan L, Wang DL (eds) Neural information processing. Springer, Berlin, pp 562–569

    Chapter  Google Scholar 

  23. Cid-Sueiro J, Arribas JI, Urbn-Muñoz S, Figueiras-Vidal AR (1999) Cost functions to estimate a posteriori probabilities in multiclass problems. IEEE Trans Neural Netw 10(3):645–656

    Article  Google Scholar 

  24. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  25. Natarajan BK (1995) Sparse approximate solutions to linear systems. Siam J Sci Comput 24:227–234

    Article  MathSciNet  Google Scholar 

  26. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol) 58:267–288

    MathSciNet  MATH  Google Scholar 

  27. Yuan GX, Ho CH, Lin CJ (2012) Recent advances of large-scale linear classification. Proc IEEE 100(9):2584–2603

    Article  Google Scholar 

  28. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Methodol) (Stat Methodol) 67:301–320

    Article  MathSciNet  Google Scholar 

  29. Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning: data mining, inference, and prediction. Springer, Heidelberg

    MATH  Google Scholar 

  30. Donoho DL (2006) For most large underdetermined systems of linear equations the minimal \(l_1\)-norm solution is also the sparsest solution. Commun Pure Appl Math 59(6):797–829

    Article  MathSciNet  Google Scholar 

  31. Xu ZB, Zhang H, Wang Y, Chang XY (2010) \(L_{1/2}\) regularizer. Sci China Ser F Inf Sci 52:1–9

    Google Scholar 

  32. Igel C, Hüsken M (2003) Empirical evaluation of the improved Rprop learning algorithms. Neurocomputing 50:105–123

    Article  Google Scholar 

  33. Rumelhart DE, McClelland JL, PDP Research Group (1986) Parallel distributed processing: explorations in the microstructure of cognition: psychological and biological models. MIT Press, Cambridge

  34. Vogl TP, Mangis JK, Rigler AK, Zink WT, Alkon DL (1988) Accelerating the convergence of the back-propagation method. Biol Cybern 59:257–263

    Article  Google Scholar 

  35. Shao H, Zheng G (2011) Convergence analysis of a back-propagation algorithm with adaptive momentum. Neurocomputing 74:749–752

    Article  Google Scholar 

  36. Sun W, Yuan Y (2006) Optimization theory and methods nonlinear programming. Springer, New York

    MATH  Google Scholar 

  37. Livieris IE, Pintelas P (2013) A new conjugate gradient algorithm for training neural networks based on a modified secant equation. Appl Math Comput 221:491–502

    MathSciNet  MATH  Google Scholar 

  38. Jiang M, Gielen G, Zhang B, Luo Z (2003) Fast learning algorithms for feedforward neural networks. Appl Intell 18:37–54

    Article  Google Scholar 

  39. Battiti R (1989) Accelerated backpropagation learning: two optimization methods. Complex Syst 3(4):331–342

    MATH  Google Scholar 

  40. Battiti R (1992) First-and second-order methods for learning: between steepest descent and Newton’s method. Neural Comput 4(2):141–166

    Article  Google Scholar 

  41. Johansson EM, Dowla FU, Goodman DM (1991) Backpropagation learning for multilayer feed-forward neural networks using the conjugate gradient method. Int J Neural Syst 2(4):291–301

    Article  Google Scholar 

  42. Charalambous C (1992) Conjugate gradient algorithm for efficient training of artificial neural networks. Circuits Device Syst IEE Proc G 139(3):301–310

    Article  Google Scholar 

  43. Adeli H, Hung SL (1994) An adaptive conjugate gradient learning algorithm for efficient training of neural networks. Appl Math Comput 62:81–102

    MATH  Google Scholar 

  44. Moller MF (1993) A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw 6:525–533

    Article  Google Scholar 

  45. Kostopoulos AE, Grapsa TN (2009) Self-scaled conjugate gradient training algorithms. Neurocomputing 72:3000–3019

    Article  Google Scholar 

  46. Barzilai J, Borwein JM (1988) Two point step size gradient methods. IMA J Numer Anal 8:141–148

    Article  MathSciNet  Google Scholar 

  47. Wang J, Wu W, Zurada JM (2011) Deterministic convergence of conjugate gradient method for feedforward neural networks. Neurocomputing 74:2368–2376

    Article  Google Scholar 

  48. Dai YH, Yuan Y (1999) A nonlinear conjugate gradient method with a strong global convergence property. Siam J Optim 10:177–182

    Article  MathSciNet  Google Scholar 

  49. Xu C, Zhang J (2001) A survey of quasi-Newton equations and quasi-Newton methods for optimization. Ann Oper Res 103:213–234

    Article  MathSciNet  Google Scholar 

  50. Yabe H, Sakaiwa N (2005) A new nonlinear conjugate gradient method for unconstrained optimization. J Oper Res Soc Jpn Keiei Kagaku 48:284–296

    Article  MathSciNet  Google Scholar 

  51. Li WY, Wu W (2015) A parameter conjugate gradient method based on secant equation for unconstrained optimization. J Inf Comput Sci 12(16):5865–5871

    Article  Google Scholar 

  52. Fan Q, Zurada JM, Wu W (2014) Convergence of online gradient method for feedforward neural networks with smoothing \(L_{1/2}\) regularization penalty. Neurocomputing 131:208–216

    Article  Google Scholar 

  53. Liu Y, Wu W, Fan Q, Yang D, Wang J (2014) A modified gradient learning algorithm with smoothing \(L_{1/2}\) regularization for Takagi–Sugeno fuzzy models. Neurocomputing 138:229–237

    Article  Google Scholar 

  54. Dai YH, Yuan Y (2001) An efficient hybrid conjugate gradient method for unconstrained optimization. Ann Oper Res 103:33–47

    Article  MathSciNet  Google Scholar 

  55. http://www.cs.toronto.edu/~hinton

  56. Blake C, Merz C (1998) UCI repository of machine leaning database. http://www.ics.uci.edu/mlearn/MLReposito

Download references

Acknowledgements

The authors thank the editor and the anonymous reviewers for their careful reading and thoughtful comments. This research is supported by the National Natural Science Foundation of China (Projects 11171367, 91230103, 61473059 and 61403056), and the Fundamental Research Funds for the Central Universities of China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Wu.

Appendices

Appendix

Lemma 1

Suppose that assumptions (A1) and (A2) hold, then \(\nabla E(\mathbf {W})\) is Lipschitz continuous in a neighborhood S of level set \(\varGamma ,\) i.e. there exists a constant \(L\ge 0\) such that

$$\begin{aligned} ||\nabla E(\mathbf{W^1})-\nabla E(\mathbf{W^2})||\le L||\mathbf{W^1}-\mathbf{W^2}||, \forall ~ {\mathbf{W^1}, \mathbf{W^2}}\in S. \end{aligned}$$
(31)

Proof

Note that \(\mathbf{G}(\mathbf{U}{\varvec{\zeta }}^{j})=(\mathbf{G}({ \mathbf{w}^{T}_\mathbf{1}}{\varvec{\zeta }}^{j}),\mathbf{G}({ \mathbf{w}^{T}_\mathbf{2}}{\varvec{\zeta }}^{j}),\ldots ,\mathbf{G}({ \mathbf{w}^{T}_\mathbf{q}}{\varvec{\zeta }}^{j}))^T. \) By Lagrange mean value theorem and Cauchy–Buniakowsky–Schwarz inequality, we have

$$\begin{aligned} \begin{array}{llll} \Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})-\mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\Vert &{}=\left( \sum _{i=1}^{q}\left( \mathbf{G}\left( {\mathbf{w^1_i}}^T{\varvec{\zeta }^j}\right) -\mathbf{G}\left( {\mathbf{w^2_i}}^T{\varvec{\zeta }^j}\right) \right) ^2\right) ^\frac{1}{2}\\ &{}\le \left( \sum _{i=1}^{q}(g'(\xi _i))^2\left( {\mathbf{w^1_i}}^T{\varvec{\zeta }^j}-{\mathbf{w^2_i}}^T{\varvec{\zeta }^j}\right) ^2\right) ^\frac{1}{2}\\ &{}\le M\Vert {\varvec{\zeta }^j}\Vert \sum _{i=1}^{q}\left( \Vert \mathbf{w^1_i}-\mathbf{w^2_i}\Vert ^2\right) ^\frac{1}{2}\\ &{}\le M\Vert {\varvec{\zeta }^j}\Vert \sum _{i=1}^{q}\left\| \mathbf{w^1_i}-\mathbf{w^2_i}\right\| , \end{array} \end{aligned}$$
(32)

where \(\xi _i\in ({\mathbf{w^1_i}}^T{\varvec{\zeta }^j}, {\mathbf{w^2_i}}^T{\varvec{\zeta }^j}), i=1, 2, \ldots , q.\)

From the property of the activation functions, we know that \(\mathbf{G}(\mathbf{U}{\varvec{\zeta }^j}), \mathbf{h'_j}(\mathbf{w_0}\cdot \mathbf{G}(\mathbf{U}{\varvec{\zeta }^j}))\) and \( g'(\mathbf{w_i}\cdot {\varvec{\zeta }^j})\) are bounded in S, where \(j=1, 2, \ldots , J\) and \( i=1, 2, \ldots , q\). This means that there exists a constant M such that \(\Vert \mathbf{G}(\mathbf{U}{\varvec{\zeta }^j})\Vert \le M, \Vert \mathbf{h'_j}(\mathbf{w_0}\cdot \mathbf{G}(\mathbf{U}{\varvec{\zeta }^j})\Vert \le M ~\text{ and }~ |g'(\mathbf{w_i}\cdot {\varvec{\zeta }^j})|\le M.\) Under the Assumptions (A1) and (A2), by (32), we have

$$\begin{aligned}&\left\| \mathbf{h'_j}\left( \mathbf{w^1_0}\cdot \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\right) \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})-\mathbf{h'_j}\left( \mathbf{w^2_0}\cdot \mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\right) \mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\right\| \nonumber \\&\quad \le \left\| \mathbf{h'_j}\left( \mathbf{w^1_0}\cdot \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\right) -\mathbf{h'_j}(\mathbf{w^2_0}\cdot \mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j}))\right\| \Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\Vert \nonumber \\&\qquad + \, \left| \mathbf{h'_j}\left( \mathbf{w^2_0}\cdot \mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\right) \right| \Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})-\mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\Vert \nonumber \\&\quad \le L_1\Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\Vert \left\| \mathbf{w^1_0}\cdot \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})-\mathbf{w^2_0}\cdot \mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\right\| \nonumber \\&\qquad + \, M\Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})-\mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\Vert \nonumber \\&\quad \le L_1\Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\Vert ^2\left\| \mathbf{w^1_0}-\mathbf{w^2_0}\right\| \nonumber \\&\qquad + \, \left( L_1\Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\Vert \left\| \mathbf{w^2_0}\right\| +M\right) \Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})-\mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\Vert \nonumber \\&\quad \le L_1\Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\Vert ^2\left\| \mathbf{w^1_0}-\mathbf{w^2_0}\right\| \nonumber \\&\qquad + \, MN_1\Vert {\varvec{\zeta }^j}\Vert \sum _{i=1}^{q}\left\| \mathbf{w^1_i}-\mathbf{w^2_i}\right\| \nonumber \\&\quad \le \max \{L_1\Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j}))\Vert ^2, MN_1\Vert {\varvec{\zeta }^j}\Vert \}\left( \left\| \mathbf{w^1_0}-\mathbf{w^2_0}\right\| +\sum _{i=1}^{q}\left\| \mathbf{w^1_i}-\mathbf{w^2_i}\right\| \right) \nonumber \\&\quad \le \sqrt{q+1}\max \{L_1\Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j}))\Vert ^2, MN_1\Vert {\varvec{\zeta }^j}\Vert \}\left( \sum _{i=0}^{q}\left\| \mathbf{w^1_i}-\mathbf{w^2_i}\right\| ^2\right) ^\frac{1}{2}\nonumber \\&\quad \le L_3\Vert \mathbf{W^1}-\mathbf{W^2}\Vert , \end{aligned}$$
(33)

where \(N_1=L_1\Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\Vert \Vert \mathbf{w^2_0}\Vert +M, L_3=\sqrt{q+1}\max \{L_1\Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j}))\Vert ^2, MN_1\Vert {\varvec{\zeta }^j}\Vert \}.\)

Using (33), we have

$$\begin{aligned} \begin{array}{ll} &{}\Vert \nabla \bar{E}(\mathbf{W^1})_{\mathbf{w_0}}-\nabla \bar{E}(\mathbf{W^2})_{\mathbf{w_0}}\Vert \\ &{}\quad =\sum _{j=1}^{J}\left( \left\| \mathbf{h'_j}\left( \mathbf{w^1_0}\cdot \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\right) \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})-\mathbf{h'_j}\left( \mathbf{w^2_0}\cdot \mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\right) \mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\right\| \right) \\ &{}\quad \le JL_3\Vert \mathbf{W^1}-\mathbf{W^2}\Vert . \end{array} \end{aligned}$$
(34)

Similarly we have for some constant \(L_4 \) that

$$\begin{aligned} \begin{array}{llll} \Vert \nabla \bar{E}(\mathbf{W^1})_{\mathbf{w_i}}-\nabla \bar{E}(\mathbf{W^2})_{\mathbf{w_i}}\Vert &{}=\sum _{j=1}^{J}\left( \left\| \mathbf{h'_j}\left( \mathbf{w^1_0}\cdot \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\right) \mathbf{w^1_{i0}} \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j}){\varvec{\zeta }^j}\right. \right. \\ &{}\quad \left. \left. - \, \mathbf{h'_j}\left( \mathbf{w^2_0}\cdot \mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\right) \mathbf{w^2_{i0}} \mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j}){\varvec{\zeta }^j}\right\| \right) \\ &{}\le JL_4\Vert \mathbf{W^1}-\mathbf{W^2}\Vert , \end{array} \end{aligned}$$
(35)

where \(i=1, 2, \ldots , q.\)

By (14) and Lagrange mean value theorem, when the positive constant \(\nu \) is sufficiently small, we have for any \(\mathbf {x}\) and \(\mathbf {y}\) that

$$\begin{aligned} \begin{array}{llllll} \left| \frac{f'(\mathbf{x})}{2f(\mathbf{x})^{\frac{1}{2}}}-\frac{f'(\mathbf{y})}{2f(\mathbf{y})^{\frac{1}{2}}}\right| &{}=\left| \frac{1}{2f(\mathbf{x})^{\frac{1}{2}}}-\frac{1}{2f(\mathbf{y})^{\frac{1}{2}}}\right| |f'(\mathbf{x})|+\left| \frac{1}{2f(\mathbf{y})^{\frac{1}{2}}}(f'(\mathbf{x})-f'(\mathbf{y}))\right| \\ &{}\le \left| \frac{1}{2f(\mathbf{x})^{\frac{1}{2}}}-\frac{1}{2f(\mathbf{y})^{\frac{1}{2}}}\right| +\frac{1}{2\frac{3}{8}\nu }\frac{3}{2\nu }|\mathbf{x}-\mathbf{y}|\\ &{}\le \left| \frac{|f(\mathbf{y})^{\frac{1}{2}}-f(\mathbf{x})^{\frac{1}{2}}|}{2|f(\mathbf{x})^{\frac{1}{2}}f(\mathbf{y})^{\frac{1}{2}}|}\right| +\frac{2}{\nu ^2}|\mathbf{x}-\mathbf{y}|\\ &{}=\frac{|f(\mathbf{y})-f(\mathbf{x})|}{2|(f(\mathbf{y})^{\frac{1}{2}}+f(\mathbf{x})^{\frac{1}{2}})f(\mathbf{x})^{\frac{1}{2}}f(\mathbf{y})^{\frac{1}{2}}|}+\frac{2}{\nu ^2}|\mathbf{x}-\mathbf{y}|\\ &{}\le \frac{1}{4\frac{3}{8}\nu \sqrt{\frac{3}{8}\nu }}|\mathbf{x}-\mathbf{y}|+\frac{2}{\nu ^2}|\mathbf{x}-\mathbf{y}|\\ &{}\le L_5|\mathbf{x}-\mathbf{y}|, \end{array} \end{aligned}$$
(36)

where \(L_5=(\frac{2}{\nu ^2}+\frac{4}{9\nu ^2}\sqrt{6\nu }).\)

Using (36), we have

$$\begin{aligned} \begin{array}{lll} \Vert \nabla F(\mathbf{{W^1}})-\nabla F(\mathbf {W^2})\Vert &{}=\left( \sum _{i=1}^{q}\sum _{k=0}^p \left( \frac{f'(\mathbf{w^1_{ik}})}{2f(\mathbf{w^1_{ik}})^{\frac{1}{2}}}-\frac{f'(\mathbf{w^2_{ik}})}{2f(\mathbf{w^2_{ik}})^{\frac{1}{2}}}\right) ^2\right) ^{\frac{1}{2}}\\ &{}\le q(1+p)L_5||\mathbf{W^1}-\mathbf{W^2}||. \end{array} \end{aligned}$$
(37)

From (34), (35) and (37), we have

$$\begin{aligned} \begin{array}{llll} ||\nabla E(\mathbf{W^1})-\nabla E(\mathbf{W^2})||&{}\le \Vert \nabla \bar{E}(\mathbf{W^1})_\mathbf{w_0}-\nabla \bar{E}(\mathbf{W^2})_\mathbf{w_0}\Vert \\ &{}\quad + \, \sum _{i=1}^q\Vert \nabla \bar{E}(\mathbf{W^1})_\mathbf{w_i}-\nabla \bar{E}(\mathbf{W^2})_\mathbf{w_i}\Vert \\ &{}\quad + \, \Vert \nabla F(\mathbf{W^1})-\nabla F(\mathbf{W^2})\Vert \\ &{}\le L\Vert \mathbf{W^1}-\mathbf{W^2}\Vert , \end{array} \end{aligned}$$
(38)

where the Lipschitz constant \(L=JL_3+qJL_4+q(1+p)L_5.\) This completes the proof. \(\square \)

The following lemma shows that our training algorithm always generates descent directions.

Lemma 2

Let the sequence \(\{W_k\}\) be generated by Algorithm MDY. Then the following estimate holds for all \(k\ge 0\):

$$\begin{aligned} \nabla E_k^T\mathbf{d_k}<0 \end{aligned}$$
(39)

Proof

Let us prove the Lemma by an induction argument. It is obvious that \(\nabla E_k^T\mathbf{d_k}=-||\nabla E_k||^2<0\) holds for \(k=0\).

By Sub-algorithm MDY, there are three cases for the formula \(\beta _k\): \(\beta _k=\beta _k^{MDY}\), \(\beta _k=\beta _k^{DY}\), and \(\beta _k=|\beta _k^{HS}|\).

For the first case \(\beta _k=\beta _k^{MDY}\). Assume that (39) holds for some \(k-1\). By (24), (25), and \(-1\le \mu _k\le 1 \) by Sub-algorithm MDY, we have

$$\begin{aligned} \begin{array}{ll} \mu _k\nabla E^T_k\mathbf{d_{k-1}}-\nabla E^T_{k-1}{} \mathbf{d_{k-1}} &{}\ge \mu _k\sigma {\nabla E^T_{k-1}}{} \mathbf{d_{k-1}}-{\nabla E^T_{k-1}} \mathbf{d_{k-1}}\\ &{}={(\mu _k\sigma -1)}{\nabla E^T_{k-1}}{} \mathbf{d_{k-1}}>0. \end{array} \end{aligned}$$
(40)

From Sub-algorithm MDY, we can easily conclude that the equality \(\mu _k l_{k-1}\ge 0\). Then, using(27), (28), (29) and (40), we have

$$\begin{aligned} \begin{array}{lllll} \nabla E_k^T\mathbf{d_k}&{}=-||\nabla E_k||^2+\frac{\mu _k ||\nabla E_k||^2}{\mu _k\varrho _{k-1}+(\mu _k-1)\nabla E_{k-1}^T\mathbf{d_{k-1}}}{\nabla E_{k}}^T\mathbf{d_{k-1}}\\ &{}=-||\nabla E_k||^2+\frac{\mu _k ||\nabla E_k||^2}{\mu _k({ \mathbf{y}^{T}_{\mathbf{k}-\mathbf{1}}}^T \mathbf{d_{k-1}}+\frac{3l_{k-1}}{\eta _{k-1}}\max \{\theta _{k-1},0\})+(\mu _k-1)\nabla E_{k-1}^T\mathbf{d_{k-1}}}{\nabla E_{k}}^T\mathbf{d_{k-1}} \\ &{}= ||\nabla E_k||^2 \frac{-\mu _k {\mathbf{y}^{T}_{\mathbf{k}-\mathbf{1}}}^T\mathbf{d_{k-1}}-\mu _k\frac{3l_{k-1}}{\eta _{k-1}}\max \{\theta _{k-1},0\}-\mu _k \nabla E^T_{k-1}{} \mathbf{d_{k-1}}+\nabla E^T_{k-1}{} \mathbf{d_{k-1}}+\mu _k \nabla E_k^T\mathbf{d_{k-1}}}{\mu _k {\mathbf{y}^{T}_{\mathbf{k}-\mathbf{1}}}\mathbf{d_{k-1}}+(\mu _k-1)\nabla E^T_{k-1}\mathbf{d_{k-1}}+\mu _k\frac{3l_{k-1}}{\eta _{k-1}}\max \{\theta _{k-1},0\}}\\ &{}= ||\nabla E_k||^2 \frac{\nabla E_{k-1}^T\mathbf{d_{k-1}}-\mu _k\frac{3l_{k-1}}{\eta _{k-1}}\max \{\theta _{k-1},0\}}{\mu _k \nabla E_k^T\mathbf{d_{k-1}}-\nabla E^T_{k-1}\mathbf{d_{k-1}}+\mu _k\frac{3l_{k-1}}{\eta _{k-1}}\max \{\theta _{k-1},0\}}<0.\\ \end{array}\nonumber \\ \end{aligned}$$
(41)

For the second case \(\beta _k=\beta _k^{DY}\). By (24), (27) and Sub-algorithm MDY, we have

$$\begin{aligned} \begin{array}{ll} \nabla E_k^T\mathbf{d_k} &{}= - \, \Vert \nabla E_k\Vert ^2+\frac{\Vert \nabla E_k\Vert ^2\nabla E_k^T\mathbf{d_{k-1}}}{\nabla E_k^T\mathbf{d_{k-1}}-\nabla E^T_{k-1}{} \mathbf{d_{k-1}}}\\ &{}=\frac{\nabla E_{k-1}^T\mathbf{d_{k-1}}}{\nabla E_k^T\mathbf{d_{k-1}}-\nabla E^T_{k-1}{} \mathbf{d_{k-1}}}\Vert \nabla E_k\Vert ^2<0. \end{array} \end{aligned}$$

For the third case \(\beta _k=|\beta _k^{HS}|\). By Sub-algorithm MDY, we have by induction that

$$\begin{aligned} \left| \beta _k^{HS}\right| \le \beta _k^{DY}. \end{aligned}$$

Furthermore, we have by (40) that

$$\begin{aligned} \left| \beta _k^{HS}\right| \nabla E_k^T\mathbf{d_{k-1}}<\left| \beta _k^{HS}\right| \left( \nabla E_k^T\mathbf{d_{k-1}}-\nabla E_{k-1}^T\mathbf{d_{k-1}}\right) \le \Vert \nabla E_k\Vert ^2. \end{aligned}$$

i.e.

$$\begin{aligned} \nabla E_k^T\mathbf{d_k}=- \, \Vert \nabla E_k\Vert ^2+\left| \beta _k^{HS}\right| \nabla E_k^T\mathbf{d_{k-1}}<0. \end{aligned}$$

This completes the proof. \(\square \)

Lemma 3

Suppose that the sequence \(\{{\mathbf{W}_\mathbf{k}}\}\) is generated by Algorithm MDY, and \(\{\mu _k\}\) and \(\{l_{k-1}\}\) are generated by Sub-algorithm MDY. Then, for sufficiently small \(\sigma \), the sufficient descent condition

$$\begin{aligned} \nabla E_k^T\mathbf{d_k}<- \, \kappa \Vert \nabla E_k\Vert ^2. \end{aligned}$$
(42)

holds for \(\kappa =\frac{1}{1+\sigma }.\)

Proof

From (41)

$$\begin{aligned} \nabla E_k^T\mathbf{d_k}= ||\nabla E_k||^2 \frac{\nabla E_{k-1}^T\mathbf{d_{k-1}}-\mu _k\frac{3l_{k-1}}{\eta _{k-1}}\max \{\theta _{k-1},0\}}{\mu _k \nabla E_k^T\mathbf{d_{k-1}}-\nabla E^T_{k-1}{} \mathbf{d_{k-1}}+\mu _k\frac{3l_{k-1}}{\eta _{k-1}}\max \{\theta _{k-1},0\}}. \end{aligned}$$
(43)

Let us consider the following three cases (A), (B) and (C).

(A) The case \(\theta _{k-1}\le 0\) and \(|\tilde{\mu }_k|\le 1\).

(A-1) For the sub-case \(\theta _{k-1}\le 0\) and \(0\le \tilde{\mu }_k\le 1\), the Equality (43) reduces to

$$\begin{aligned} \nabla E_k^T\mathbf{d_k}= ||\nabla E_k||^2 \frac{\nabla E_{k-1}^T\mathbf{d_{k-1}}}{\mu _k \nabla E_k^T\mathbf{d_{k-1}}-\nabla E^T_{k-1}\mathbf{d_{k-1}}}. \end{aligned}$$
(44)

By (25), we have

$$\begin{aligned} \sigma \nabla E_k^T \mathbf{d_k}\le \nabla E(\mathbf{{W}}_k+\eta _k\mathbf{d_k})^T\mathbf{d_k} \le -\sigma \nabla E_k^T \mathbf{d_k}. \end{aligned}$$
(45)

Using Lemma 2, (44) and (45),

$$\begin{aligned} \nabla E_k^T\mathbf{d_k}\le -\kappa \Vert \nabla E_k\Vert ^2, \end{aligned}$$
(46)

where \(\kappa =\frac{1}{1+\sigma }\).

(A-2) For the sub-case \(\theta _{k-1}\le 0\) and \(-1\le \tilde{\mu }_k<0\), (42) holds by a similar proof as that for (A-1).

(B) The case \(\theta _{k-1}> 0\) and \(|\tilde{\mu }_k|\le 1\).

(B-1) For the sub-case \(l_{k-1}=-\eta _{k-1}\nabla E_{k-1}^T \mathbf{d_{k-1}}\ge 0\) and \(0\le \tilde{\mu }_k\le 1\), by (41) we have

$$\begin{aligned} \nabla E_k^T\mathbf{d_k} = ||\nabla E_k||^2 \frac{\nabla E_{k-1}^T\mathbf{d_{k-1}}+{3\mu _k\theta _{k-1}\nabla E_{k-1}^T\mathbf{d_{k-1}}}}{\mu _k \nabla E_k^T\mathbf{d_{k-1}}-\nabla E^T_{k-1}{} \mathbf{d_{k-1}}-{3\mu _k\theta _{k-1}\nabla E_{k-1}^T\mathbf{d_{k-1}}}}. \end{aligned}$$
(47)

Then, (42) holds by noticing (45) and (47),.

(B-2) For the sub-case \(l_{k-1}=\eta _{k-1}\nabla E_{k-1}^T \mathbf{d_{k-1}}\le 0\) and \(-1\le \tilde{\mu }_k\le 0\), the result (42) holds by a similar proof as in (B-1).

(B-3) For the sub-case \(l_{k-1}=0\) and \(|\tilde{\mu }_k|\le 1\), the proof is similar as in (A).

(C) For the case \(|\tilde{\mu }_k|>1\), we have \(\mu _k=1, l_{k-1}=0\). Then \(\beta _k=\beta _k^{DY}\) or \(\beta _k=|\beta _k^{HS}|\). If \(\beta _k=\beta _k^{DY}\), the conclusion (42) holds by (25). If \(\beta _k=|\beta _k^{HS}|\), by (25), we have for sufficiently small \(\sigma \) that

$$\begin{aligned} \begin{array}{ll} \nabla E_k^T\mathbf{d_k}&{}=- \, \Vert \nabla E_k\Vert ^2+\left| \beta _k^{HS}\right| \nabla E_k^T\mathbf{d_{k-1}}<0\\ &{}\le - \, \Vert \nabla E_k\Vert ^2-\left| \beta _k^{HS}\right| \sigma \nabla E_{k-1}^T\mathbf{d_{k-1}}\\ &{}\le - \, \frac{1}{1+\sigma }\Vert \nabla E_k\Vert ^2. \end{array} \end{aligned}$$

This completes the proof. \(\square \)

The following Lemma 4 can be directly proved by using Lemmas 1 and 2. Actually, the proof is similar to that for Lemma 3.2 in [48].

Lemma 4

Suppose that Assumptions (A1) and (A2) hold, the sequence \(\{{\mathbf{W}_\mathbf{k}}\}\) is generated by (26) such that \(\mathbf{d_k}\) satisfies \(\nabla E_k^T\mathbf{d_k}<0\), and \(\eta _k\) is generated by the Wolfe conditions (24) and (25). Then the following Zoutendijk condition holds

$$\begin{aligned} \sum _{k=0}^{+\infty }\frac{{\left( \nabla E_k^T\mathbf{d_k}\right) }^2}{||\mathbf{d_k}||^2}<+ \, \infty . \end{aligned}$$
(48)

Lemma 5

Let the sequence \(\{W_k\}\) be generated by Algorithm MDY. Then

$$\begin{aligned} |\beta _k|\le \frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}. \end{aligned}$$
(49)

Proof

By (28) and (27), we have

$$\begin{aligned} {||\mathbf{d_k}||}^2={\left( \beta _k^{MDY}\right) }^2{||\mathbf{d_{k-1}}||}^2-2{\nabla E^T_k}{} \mathbf{d_k}-||\nabla E_k||^2. \end{aligned}$$
(50)

By (28), (29) and (40), we have

$$\begin{aligned} \begin{array}{ll} \beta _{k+1}^{MDY}&{}=\frac{\mu _k ||\nabla E_{k+1}||^2}{\mu _k\varrho _{k}+(\mu _k-1)\nabla E_{k}^T\mathbf {d_{k}}}\\ &{}=\frac{\mu _k ||\nabla E_{k+1}||^2}{\mu _k\left( {\mathbf{d}^{T}_\mathbf{k}}\mathbf{y_k}+\frac{3l_k}{\eta _k}\max \{\theta _k,0\}\right) +(\mu _k-1)\nabla E_{k}^T\mathbf {d_{k}}}\\ &{}=\frac{\mu _k ||\nabla E_{k+1}||^2}{\mu _k\nabla E^T_k\mathbf{d_{k-1}}-\nabla E^T_{k-1}\mathbf{d_{k-1}}+\mu _k\frac{3l_k}{\eta _k}\max \{\theta _k,0\})}.\\ \end{array} \end{aligned}$$
(51)

By (41) and (51), we have

$$\begin{aligned} \begin{array}{lll} \mu _k\nabla E_k^T\mathbf{d_k}&{}=\frac{\mu _k||\nabla E_k||^2\left( \nabla E_{k-1}^T\mathbf{d_{k-1}}-\mu _k\frac{3l_{k-1}}{\eta _{k-1}}\max \{\theta _{k-1},0\}\right) }{\mu _k\varrho _{k-1}+(\mu _k-1)\nabla E_{k-1}^T\mathbf{d_{k-1}}}\\ &{}=\beta _k^{MDY}\left( \nabla E_{k-1}^T\mathbf{d_{k-1}}-\mu _k\frac{3l_{k-1}}{\eta _{k-1}}\max \{\theta _{k-1},0\}\right) .\\ \end{array} \end{aligned}$$
(52)

Next, we consider the following three cases (i)–(iii) to prove (49).

  1. (i)

    For the case \(\theta _{k-1}\le 0\) and \(|\tilde{\mu }_k|\le 1\), the Equality (52) reduces to

    $$\begin{aligned} \mu _k\nabla E_k^T\mathbf{d_k}=\beta _k^{MDY}\nabla E_{k-1}^T\mathbf{d_{k-1}}. \end{aligned}$$

    Furthermore, since \(|\tilde{\mu }_k|\le 1\) we have

    $$\begin{aligned} \left| \beta _k^{MDY}\right| =\left| \mu _k\frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}\right| \le \frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}. \end{aligned}$$
    (53)
  2. (ii)

    For the case \(\theta _{k-1}> 0\) and \(|\tilde{\mu }_k|\le 1\), the Equality (52) reduces to

    $$\begin{aligned} \mu _k\nabla E_k^T\mathbf{d_k}=\beta _k^{MDY}\left( \nabla E_{k-1}^T\mathbf{d_{k-1}}-\mu _k\frac{3l_k}{\eta _k}\theta _{k-1}\right) , \end{aligned}$$
    (54)

    Now, we consider the following three sub-cases: The first sub-case \(l_{k-1}=-\eta _{k-1}\nabla E_{k-1}^T \mathbf{d_{k-1}}\ge 0\) and \(0\le \tilde{\mu }_k\le 1\). Then, from (54) we have

    $$\begin{aligned} \mu _k\nabla E_k^T\mathbf{d_k}=\beta _k^{MDY}\nabla E_{k-1}^T\mathbf{d_{k-1}}(1+3\mu _k\theta _{k-1}). \end{aligned}$$

    Furthermore,

    $$\begin{aligned} \left| \beta _k^{MDY}\right| =\mu _k\frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}\frac{1}{1+3\mu _k\theta _{k-1}}\le \frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}. \end{aligned}$$
    (55)

    The second sub-case \(l_{k-1}=\eta _{k-1}\nabla E_{k-1}^T \mathbf{d_{k-1}}\le 0\) and \(-1\le \tilde{\mu }_k\le 0\). From (54) we have

    $$\begin{aligned} \mu _k\nabla E_k^T\mathbf{d_k}=\beta _k^{MDY}\nabla E_{k-1}^T\mathbf{d_{k-1}}(1-3\mu _k\theta _{k-1}), \end{aligned}$$

    and thus

    $$\begin{aligned} \left| \beta _k^{MDY}\right| =\mu _k\frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}\frac{1}{1-3\mu _k\theta _{k-1}}\le \frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}. \end{aligned}$$
    (56)

    The third sub-case \(l_{k-1}=0\) and \(|\tilde{\mu }_k|\le 1\). The proof here is similar as in (i).

  3. (iii)

    For the case \(|\tilde{\mu }_k|>1\), we have \(\mu _k=1\) and \( l_{k-1}=0\). Then, \(\beta _k=\beta _k^{DY}\) or \(\beta _k=|\beta _k^{HS}|\). For the case \(\beta _k=\beta _k^{DY}\), it is easy to prove that the equality

    $$\begin{aligned} \beta _k^{DY}=\frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}} \end{aligned}$$
    (57)

    holds. For the case \(\beta _k=|\beta _k^{HS}|\), by Sub-algorithm MDY,

    $$\begin{aligned} \left| \beta _k^{HS}\right| \le \beta _k^{DY}=\frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}. \end{aligned}$$
    (58)

    Finally, (49) holds by combing (53), (55), (56), (57) and (58).

\(\square \)

Proof of Theorem 2

Proof

We will prove the theorem by contradiction. If the theorem is not true, there exists a constant \(c>0\) such that

$$\begin{aligned} ||\nabla E_k||\ge c, ~k=0,1,2,\ldots \end{aligned}$$
(59)

By Lemma 5, we have

$$\begin{aligned} |\beta _k|^2\le \frac{\left( \nabla E_k^T\mathbf{d_k}\right) ^2}{\left( \nabla E_{k-1}^T\mathbf{d_{k-1}}\right) ^2}. \end{aligned}$$
(60)

By (50), (60) and (51), we have

$$\begin{aligned} \frac{||\mathbf{d_k}||^2}{\left( {\nabla E^T_k}\mathbf{d_k}\right) ^2}\le & {} \frac{\left( \nabla E_k^T\mathbf{d_k}\right) ^2}{\left( \nabla E^T_{k-1}{} \mathbf{d_{k-1}}\right) ^2}\frac{{||\mathbf{d_{k-1}}||}^2}{{\left( {\nabla E^T_k}{} \mathbf{d_k}\right) }^2}-\frac{2}{{\nabla E^T_k}\mathbf{d_k}}-\frac{||\nabla E_k||^2}{{({\nabla E^T_k}{} \mathbf{d_k})}^2}\\= & {} {\frac{{||\mathbf{d_{k-1}}||}^2}{{\left( {\nabla E^T_{k-1}}\mathbf{d_{k-1}}\right) }^2}-{\left( \frac{1}{||\nabla E_k||}+\frac{||\nabla E_k||}{\left( {\nabla E^T_k}{} \mathbf{d_k}\right) }\right) }^2+\frac{1}{||\nabla E_k||^2}}\\\le & {} {\frac{{||\mathbf{d_{k-1}}||}^2}{{\left( {\nabla E^T_{k-1}}{} \mathbf{d_{k-1}}\right) }^2}+\frac{1}{||\nabla E_k||^2}}. \end{aligned}$$

It follows from

$$\begin{aligned} \frac{{||\mathbf {d_{0}}||}^2}{{\left( {\nabla E^T_{0}}\mathbf {d_{0}}\right) }^2}=\frac{1}{||\nabla E_0||^2} \end{aligned}$$

and (59) that

$$\begin{aligned} \frac{||\mathbf{d_k}||^2}{{\left( {\nabla E^T_k}\mathbf{d_k}\right) }^2}\le \sum _{j=0}^k\frac{1}{||\nabla E_j||^2}\le \frac{k+1}{c^2}. \end{aligned}$$

Furthermore,

$$\begin{aligned} \frac{{\left( {\nabla E^T_k}{} \mathbf{d_k}\right) }^2}{||\mathbf{d_k}||^2}\ge \frac{c^2}{k+1}, \end{aligned}$$

which indicates that

$$\begin{aligned} \sum _{k=0}^\infty \frac{{\left( \nabla E_k^T\mathbf{d_k}\right) }^2}{||\mathbf{d_k}||^2}=+\, \infty . \end{aligned}$$

This contradicts the Zoutendijk condition (48). Therefore, the conclusion (30) must hold. \(\square \)

The deduction of (19) for the adaptive parameter \(\tilde{\mu }_k\):

The conjugate gradient direction (4) on the current epoch has been set to be equal to the quasi-Newton direction (17):

$$\begin{aligned} \mathbf{d_k}=\widehat{B}_{k}^{-1}\nabla E_k=-\, \nabla E_k+\beta _k^{MDY} \mathbf{d_{k-1}}. \end{aligned}$$

When \(\theta _{k-1}\le 0\), using (10), we have

$$\begin{aligned} \widehat{B}_{k}^{-1}\nabla E_k=-\nabla E_k+\frac{\mu _k\Vert \nabla E_k\Vert ^2}{\mu _k {\mathbf{d}^{T}_{\mathbf{k}-\mathbf{1}}}\nabla E_k-\nabla E_{k-1}^{T}{} \mathbf{d_{k-1}}}{} \mathbf{d_{k-1}}, \end{aligned}$$

i.e.

$$\begin{aligned} \nabla E_k=\widehat{B}_{k}\nabla E_k-\frac{\mu _k\Vert \nabla E_k\Vert ^2\widehat{B}_{k} \mathbf{s_{k-1}}}{\mu _k { \mathbf{s}^{T}_{\mathbf{k}-\mathbf{1}}}\nabla E_k-\nabla E_{k-1}^{T} \mathbf{s_{k-1}}}. \end{aligned}$$

Multiplying the above equality by \(\mathbf{s_{k-1}}\), we have

$$\begin{aligned} \nabla E_{k}^T \mathbf{s_{k-1}}=\nabla E_{k}^{T}\widehat{B}_k \mathbf{s_{k-1}}-\frac{\mu _k\Vert \nabla E_k\Vert ^2}{\mu _k { \mathbf{s}^{T}_{\mathbf{k}-\mathbf{1}}}\nabla E_k-\nabla E_{k-1}^{T}\mathbf{s_{k-1}}}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}^{T}{} \mathbf{s_{k-1}}, \end{aligned}$$

By the modified secant condition (6), we have

$$\begin{aligned} \nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}-\nabla E_{k}^{T}\mathbf{s_{k-1}}=\frac{\mu _k\Vert \nabla E_k\Vert ^2}{\mu _k { \mathbf{s}^{T}_{\mathbf{k}-\mathbf{1}}}\nabla E_k-\nabla E_{k-1}^{T}\mathbf{s_{k-1}}}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}^{T}{} \mathbf{s_{k-1}}, \end{aligned}$$

The adaptive parameter \(\mu _k\) can be computed by

$$\begin{aligned} \mu _k=\frac{\nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k-1}^{T}{} \mathbf{s_{k-1}}-\nabla E_{k}^{T}{} \mathbf{s_{k-1}}\nabla E_{k-1}^{T}{} \mathbf{s_{k-1}}}{\nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}{ \mathbf{s}^{T}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k}-\nabla E_{k}^{T}\mathbf{s_{k-1}}{\mathbf{s}^{T}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k}-\Vert \nabla E_k\Vert ^2{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}^{T}{} \mathbf{s_{k-1}}}. \end{aligned}$$

When \(\theta _{k-1}>0\), similar to the case of \(\theta _{k-1}\le 0\), the adaptive parameter \(\mu _k\) can be computed by

$$\begin{aligned} \mu _k= \frac{\nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k-1}^{T}{} \mathbf{s_{k-1}}-\nabla E_{k}^{T}{} \mathbf{s_{k-1}}\nabla E_{k-1}^{T}{} \mathbf{s_{k-1}}}{\nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}{ \mathbf{s}^{T}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k}-\nabla E_{k}^{T}{} \mathbf{s_{k-1}}{ \mathbf{s}^{T}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k}-\Vert \nabla E_k\Vert ^2{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}^{T}\mathbf{s_{k-1}}+\vartheta _k}, \end{aligned}$$

where \(\vartheta _k=3l_{k-1}\theta _{k-1}\nabla E_{k}^{T}{ \hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}-3l_{k-1}\theta _{k-1}\nabla E_{k}^{T}{} \mathbf{s_{k-1}}\).

In summary, the adaptive parameter \(\mu _k\) can be computed by the following formula:

$$\begin{aligned} \tilde{\mu }_k= {\left\{ \begin{array}{ll} \frac{\nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k-1}^{T}{} \mathbf{s_{k-1}}-\nabla E_{k}^{T}{} \mathbf{s_{k-1}}\nabla E_{k-1}^{T}{} \mathbf{s_{k-1}}}{\nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}{\mathbf{s}^{{T}}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k}-\nabla E_{k}^{T}{} \mathbf{s_{k-1}}{ \mathbf{s}^{{T}}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k}-\Vert \nabla E_k\Vert ^2{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}^{T}{} \mathbf{s_{k-1}}}, &{}\text {if} \quad \theta _{k-1}\le 0,\\ \frac{\nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k-1}^{T}{} \mathbf{s_{k-1}}-\nabla E^{T}_{k}{} \mathbf{s_{k-1}}\nabla E_{k-1}^{T}\mathbf{s_{k-1}}}{\nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}{ \mathbf{s}^{{T}}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k}-\nabla E_{k}^{T}{} \mathbf{s_{k-1}}{ \mathbf{s}^{{T}}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k}-\Vert \nabla E_k\Vert ^2{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}^{T}{} \mathbf{s_{k-1}}+\vartheta _k}, &{}\text {if} \quad \theta _{k-1}> 0, \end{array}\right. } \end{aligned}$$

where \(\vartheta _k=3l_{k-1}\theta _{k-1}\nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}-3l_{k-1}\theta _{k-1}\nabla E_{k}^{T}\mathbf{s_{k-1}}\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, W., Liu, Y., Yang, J. et al. A New Conjugate Gradient Method with Smoothing \(L_{1/2} \) Regularization Based on a Modified Secant Equation for Training Neural Networks. Neural Process Lett 48, 955–978 (2018). https://doi.org/10.1007/s11063-017-9737-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-017-9737-9

Keywords

Navigation