A New Conjugate Gradient Method with Smoothing $$L_{1/2} $$ Regularization Based on a Modified Secant Equation for Training Neural Networks

Li, Wenyu; Liu, Yan; Yang, Jie; Wu, Wei

doi:10.1007/s11063-017-9737-9

A New Conjugate Gradient Method with Smoothing $L_{1/2} $ Regularization Based on a Modified Secant Equation for Training Neural Networks

Published: 21 November 2017

Volume 48, pages 955–978, (2018)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Wenyu Li^1,2,
Yan Liu³,
Jie Yang¹ &
…
Wei Wu¹

392 Accesses
9 Citations
Explore all metrics

Abstract

Proposed in this paper is a new conjugate gradient method with smoothing $L_{1/2} $ regularization based on a modified secant equation for training neural networks, where a descent search direction is generated by selecting an adaptive learning rate based on the strong Wolfe conditions. Two adaptive parameters are introduced such that the new training method possesses both quasi-Newton property and sufficient descent property. As shown in the numerical experiments for five benchmark classification problems from UCI repository, compared with the other conjugate gradient training algorithms, the new training algorithm has roughly the same or even better learning capacity, but significantly better generalization capacity and network sparsity. Under mild assumptions, a global convergence result of the proposed training method is also proved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two Improved Nonlinear Conjugate Gradient Methods with the Strong Wolfe Line Search

Article 15 October 2021

A conjugate gradient method with sufficient descent property

Article 12 December 2014

An Improved Conjugate Gradient Neural Networks Based on a Generalized Armijo Search Method

References

Bishop CM (1995) Neural networks for pattern recognition. Clarendon Press, Oxford
MATH Google Scholar
Hmich A, Badri A, Sahel A (2011) Automatic speaker identification by using the neural network. In: Proceeding of 2011 IEEE international conference on multimedia computing and systems (ICMCS), pp 1–5
Zhou W, Zurada JM (2010) Competitive layer model of discrete-time recurrent neural networks with LT neurons. Neural Comput 22:2137–2160
Article MathSciNet Google Scholar
Li EY (1994) Artificial neural networks and their business applications. Inf Manag 27:303–313
Article Google Scholar
Svozil D, Kvasnicka V, Pospichal J (1997) Introduction to multi-layer feed-forward neural networks. Chemom Intell Lab 39(1):43–62
Article Google Scholar
Wu W (2003) Computation of neural networks. Higher Education Press, Beijing
Google Scholar
Xu ZB, Chang XY, Xu FM, Zhang H (2012) $L_{1/2}$ regularization: a thresholding representation theory and a fast solver. IEEE Trans Neural Netw Learn Syst 23(7):1013–1027
Article Google Scholar
Wu W, Fan QW, Zurada JM, Wang J, Yang DK, Liu Y (2014) Batch gradient method with smoothing $L_{1/2}$ regularization for training of feedforward neural networks. Neural Netw 50:72–78
Article Google Scholar
Reed R (1993) Pruning algorithms a survey. IEEE Trans Neural Netw 4:740–747
Article Google Scholar
Sakar A, Mammone RJ (1993) Growing and pruning neural tree networks. IEEE Trans Comput 42(3):291–299
Article Google Scholar
Arribas JI, Cid-Sueiro J (2005) A model selection algorithm for a posteriori probability estimation with neural networks. IEEE Trans Neural Netw 16(4):799–809
Article Google Scholar
Hinton G (1989) Connectionist learning procedures. Artif Intell 40:185–235
Article Google Scholar
Hertz J, Krogh A, Palmer R (1991) Introduction to the theory of neural computation. Addison Wesley, Redwood City
Google Scholar
Wang C, Venkatesh SS, Judd JS (1994) Optimal stopping and effective machine complexity in learning. Adv Neural Inf Process Syst 6:303–310
Google Scholar
Bishop CM (1995) Regularization and complexity control in feedforward networks. In: Proceedings of international conference on artificial neural networks ICANN’95. EC2 et Cie, pp 141–148
Nowlan SJ, Hinton GE (1992) Simplifying neural networks by soft weight-sharing. Neural Comput 4(4):473–493
Article Google Scholar
Fogel DB (1991) An information criterion for optimal neural network selection. IEEE Trans Neural Netw 2(2):490–7
Article Google Scholar
Seghouane AK, Amari SI (2007) The AIC criterion and symmetrizing the Kullback–Leibler divergence. IEEE Trans Neural Netw 18(1):97–106
Article Google Scholar
Ishikawa M (1996) Structural learning with forgetting. Neural Netw 9:509–521
Article Google Scholar
Mc Loone S, Irwin G (2001) Improving neural network training solutions using regularisation. Neurocomputing 37:71–90
Article Google Scholar
Saito K, Nakano R (2000) Second-order learning algorithm with squared penalty term. Neural Comput 12:709–729
Article Google Scholar
Wu W, Shao HM, Li ZX (2006) Convergence of batch BP algorithm with penalty for FNN training. In: King I, Wang J, Chan L, Wang DL (eds) Neural information processing. Springer, Berlin, pp 562–569
Chapter Google Scholar
Cid-Sueiro J, Arribas JI, Urbn-Muñoz S, Figueiras-Vidal AR (1999) Cost functions to estimate a posteriori probabilities in multiclass problems. IEEE Trans Neural Netw 10(3):645–656
Article Google Scholar
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Natarajan BK (1995) Sparse approximate solutions to linear systems. Siam J Sci Comput 24:227–234
Article MathSciNet Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol) 58:267–288
MathSciNet MATH Google Scholar
Yuan GX, Ho CH, Lin CJ (2012) Recent advances of large-scale linear classification. Proc IEEE 100(9):2584–2603
Article Google Scholar
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Methodol) (Stat Methodol) 67:301–320
Article MathSciNet Google Scholar
Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning: data mining, inference, and prediction. Springer, Heidelberg
MATH Google Scholar
Donoho DL (2006) For most large underdetermined systems of linear equations the minimal $l_1$-norm solution is also the sparsest solution. Commun Pure Appl Math 59(6):797–829
Article MathSciNet Google Scholar
Xu ZB, Zhang H, Wang Y, Chang XY (2010) $L_{1/2}$ regularizer. Sci China Ser F Inf Sci 52:1–9
Google Scholar
Igel C, Hüsken M (2003) Empirical evaluation of the improved Rprop learning algorithms. Neurocomputing 50:105–123
Article Google Scholar
Rumelhart DE, McClelland JL, PDP Research Group (1986) Parallel distributed processing: explorations in the microstructure of cognition: psychological and biological models. MIT Press, Cambridge
Vogl TP, Mangis JK, Rigler AK, Zink WT, Alkon DL (1988) Accelerating the convergence of the back-propagation method. Biol Cybern 59:257–263
Article Google Scholar
Shao H, Zheng G (2011) Convergence analysis of a back-propagation algorithm with adaptive momentum. Neurocomputing 74:749–752
Article Google Scholar
Sun W, Yuan Y (2006) Optimization theory and methods nonlinear programming. Springer, New York
MATH Google Scholar
Livieris IE, Pintelas P (2013) A new conjugate gradient algorithm for training neural networks based on a modified secant equation. Appl Math Comput 221:491–502
MathSciNet MATH Google Scholar
Jiang M, Gielen G, Zhang B, Luo Z (2003) Fast learning algorithms for feedforward neural networks. Appl Intell 18:37–54
Article Google Scholar
Battiti R (1989) Accelerated backpropagation learning: two optimization methods. Complex Syst 3(4):331–342
MATH Google Scholar
Battiti R (1992) First-and second-order methods for learning: between steepest descent and Newton’s method. Neural Comput 4(2):141–166
Article Google Scholar
Johansson EM, Dowla FU, Goodman DM (1991) Backpropagation learning for multilayer feed-forward neural networks using the conjugate gradient method. Int J Neural Syst 2(4):291–301
Article Google Scholar
Charalambous C (1992) Conjugate gradient algorithm for efficient training of artificial neural networks. Circuits Device Syst IEE Proc G 139(3):301–310
Article Google Scholar
Adeli H, Hung SL (1994) An adaptive conjugate gradient learning algorithm for efficient training of neural networks. Appl Math Comput 62:81–102
MATH Google Scholar
Moller MF (1993) A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw 6:525–533
Article Google Scholar
Kostopoulos AE, Grapsa TN (2009) Self-scaled conjugate gradient training algorithms. Neurocomputing 72:3000–3019
Article Google Scholar
Barzilai J, Borwein JM (1988) Two point step size gradient methods. IMA J Numer Anal 8:141–148
Article MathSciNet Google Scholar
Wang J, Wu W, Zurada JM (2011) Deterministic convergence of conjugate gradient method for feedforward neural networks. Neurocomputing 74:2368–2376
Article Google Scholar
Dai YH, Yuan Y (1999) A nonlinear conjugate gradient method with a strong global convergence property. Siam J Optim 10:177–182
Article MathSciNet Google Scholar
Xu C, Zhang J (2001) A survey of quasi-Newton equations and quasi-Newton methods for optimization. Ann Oper Res 103:213–234
Article MathSciNet Google Scholar
Yabe H, Sakaiwa N (2005) A new nonlinear conjugate gradient method for unconstrained optimization. J Oper Res Soc Jpn Keiei Kagaku 48:284–296
Article MathSciNet Google Scholar
Li WY, Wu W (2015) A parameter conjugate gradient method based on secant equation for unconstrained optimization. J Inf Comput Sci 12(16):5865–5871
Article Google Scholar
Fan Q, Zurada JM, Wu W (2014) Convergence of online gradient method for feedforward neural networks with smoothing $L_{1/2}$ regularization penalty. Neurocomputing 131:208–216
Article Google Scholar
Liu Y, Wu W, Fan Q, Yang D, Wang J (2014) A modified gradient learning algorithm with smoothing $L_{1/2}$ regularization for Takagi–Sugeno fuzzy models. Neurocomputing 138:229–237
Article Google Scholar
Dai YH, Yuan Y (2001) An efficient hybrid conjugate gradient method for unconstrained optimization. Ann Oper Res 103:33–47
Article MathSciNet Google Scholar
http://www.cs.toronto.edu/~hinton
Blake C, Merz C (1998) UCI repository of machine leaning database. http://www.ics.uci.edu/mlearn/MLReposito

Download references

Acknowledgements

The authors thank the editor and the anonymous reviewers for their careful reading and thoughtful comments. This research is supported by the National Natural Science Foundation of China (Projects 11171367, 91230103, 61473059 and 61403056), and the Fundamental Research Funds for the Central Universities of China.

Author information

Authors and Affiliations

School of Mathematical Sciences, Dalian University of Technology, Dalian, 116024, China
Wenyu Li, Jie Yang & Wei Wu
School of Mathematics and Statistics, Beihua University, Jilin, 132013, China
Wenyu Li
School of Information Science and Engineering, Dalian Polytechnic University, Dalian, 116034, China
Yan Liu

Authors

Wenyu Li
View author publications
You can also search for this author in PubMed Google Scholar
Yan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jie Yang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Wu.

Appendices

Appendix

Lemma 1

Suppose that assumptions (A1) and (A2) hold, then $\nabla E(\mathbf {W})$ is Lipschitz continuous in a neighborhood S of level set $\varGamma ,$ i.e. there exists a constant $L\ge 0$ such that

$$\begin{aligned} ||\nabla E(\mathbf{W^1})-\nabla E(\mathbf{W^2})||\le L||\mathbf{W^1}-\mathbf{W^2}||, \forall ~ {\mathbf{W^1}, \mathbf{W^2}}\in S. \end{aligned}$$

(31)

Proof

Note that $\mathbf{G}(\mathbf{U}{\varvec{\zeta }}^{j})=(\mathbf{G}({ \mathbf{w}^{T}_\mathbf{1}}{\varvec{\zeta }}^{j}),\mathbf{G}({ \mathbf{w}^{T}_\mathbf{2}}{\varvec{\zeta }}^{j}),\ldots ,\mathbf{G}({ \mathbf{w}^{T}_\mathbf{q}}{\varvec{\zeta }}^{j}))^T. $ By Lagrange mean value theorem and Cauchy–Buniakowsky–Schwarz inequality, we have

$$\begin{aligned} \begin{array}{llll} \Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})-\mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\Vert &{}=\left( \sum _{i=1}^{q}\left( \mathbf{G}\left( {\mathbf{w^1_i}}^T{\varvec{\zeta }^j}\right) -\mathbf{G}\left( {\mathbf{w^2_i}}^T{\varvec{\zeta }^j}\right) \right) ^2\right) ^\frac{1}{2}\\ &{}\le \left( \sum _{i=1}^{q}(g'(\xi _i))^2\left( {\mathbf{w^1_i}}^T{\varvec{\zeta }^j}-{\mathbf{w^2_i}}^T{\varvec{\zeta }^j}\right) ^2\right) ^\frac{1}{2}\\ &{}\le M\Vert {\varvec{\zeta }^j}\Vert \sum _{i=1}^{q}\left( \Vert \mathbf{w^1_i}-\mathbf{w^2_i}\Vert ^2\right) ^\frac{1}{2}\\ &{}\le M\Vert {\varvec{\zeta }^j}\Vert \sum _{i=1}^{q}\left\| \mathbf{w^1_i}-\mathbf{w^2_i}\right\| , \end{array} \end{aligned}$$

(32)

where $\xi _i\in ({\mathbf{w^1_i}}^T{\varvec{\zeta }^j}, {\mathbf{w^2_i}}^T{\varvec{\zeta }^j}), i=1, 2, \ldots , q.$

From the property of the activation functions, we know that $\mathbf{G}(\mathbf{U}{\varvec{\zeta }^j}), \mathbf{h'_j}(\mathbf{w_0}\cdot \mathbf{G}(\mathbf{U}{\varvec{\zeta }^j}))$ and $ g'(\mathbf{w_i}\cdot {\varvec{\zeta }^j})$ are bounded in S, where $j=1, 2, \ldots , J$ and $ i=1, 2, \ldots , q$. This means that there exists a constant M such that $\Vert \mathbf{G}(\mathbf{U}{\varvec{\zeta }^j})\Vert \le M, \Vert \mathbf{h'_j}(\mathbf{w_0}\cdot \mathbf{G}(\mathbf{U}{\varvec{\zeta }^j})\Vert \le M ~\text{ and }~ |g'(\mathbf{w_i}\cdot {\varvec{\zeta }^j})|\le M.$ Under the Assumptions (A1) and (A2), by (32), we have

$$\begin{aligned}&\left\| \mathbf{h'_j}\left( \mathbf{w^1_0}\cdot \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\right) \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})-\mathbf{h'_j}\left( \mathbf{w^2_0}\cdot \mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\right) \mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\right\| \nonumber \\&\quad \le \left\| \mathbf{h'_j}\left( \mathbf{w^1_0}\cdot \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\right) -\mathbf{h'_j}(\mathbf{w^2_0}\cdot \mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j}))\right\| \Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\Vert \nonumber \\&\qquad + \, \left| \mathbf{h'_j}\left( \mathbf{w^2_0}\cdot \mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\right) \right| \Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})-\mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\Vert \nonumber \\&\quad \le L_1\Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\Vert \left\| \mathbf{w^1_0}\cdot \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})-\mathbf{w^2_0}\cdot \mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\right\| \nonumber \\&\qquad + \, M\Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})-\mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\Vert \nonumber \\&\quad \le L_1\Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\Vert ^2\left\| \mathbf{w^1_0}-\mathbf{w^2_0}\right\| \nonumber \\&\qquad + \, \left( L_1\Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\Vert \left\| \mathbf{w^2_0}\right\| +M\right) \Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})-\mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\Vert \nonumber \\&\quad \le L_1\Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\Vert ^2\left\| \mathbf{w^1_0}-\mathbf{w^2_0}\right\| \nonumber \\&\qquad + \, MN_1\Vert {\varvec{\zeta }^j}\Vert \sum _{i=1}^{q}\left\| \mathbf{w^1_i}-\mathbf{w^2_i}\right\| \nonumber \\&\quad \le \max \{L_1\Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j}))\Vert ^2, MN_1\Vert {\varvec{\zeta }^j}\Vert \}\left( \left\| \mathbf{w^1_0}-\mathbf{w^2_0}\right\| +\sum _{i=1}^{q}\left\| \mathbf{w^1_i}-\mathbf{w^2_i}\right\| \right) \nonumber \\&\quad \le \sqrt{q+1}\max \{L_1\Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j}))\Vert ^2, MN_1\Vert {\varvec{\zeta }^j}\Vert \}\left( \sum _{i=0}^{q}\left\| \mathbf{w^1_i}-\mathbf{w^2_i}\right\| ^2\right) ^\frac{1}{2}\nonumber \\&\quad \le L_3\Vert \mathbf{W^1}-\mathbf{W^2}\Vert , \end{aligned}$$

(33)

where $N_1=L_1\Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\Vert \Vert \mathbf{w^2_0}\Vert +M, L_3=\sqrt{q+1}\max \{L_1\Vert \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j}))\Vert ^2, MN_1\Vert {\varvec{\zeta }^j}\Vert \}.$

Using (33), we have

$$\begin{aligned} \begin{array}{ll} &{}\Vert \nabla \bar{E}(\mathbf{W^1})_{\mathbf{w_0}}-\nabla \bar{E}(\mathbf{W^2})_{\mathbf{w_0}}\Vert \\ &{}\quad =\sum _{j=1}^{J}\left( \left\| \mathbf{h'_j}\left( \mathbf{w^1_0}\cdot \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\right) \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})-\mathbf{h'_j}\left( \mathbf{w^2_0}\cdot \mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\right) \mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\right\| \right) \\ &{}\quad \le JL_3\Vert \mathbf{W^1}-\mathbf{W^2}\Vert . \end{array} \end{aligned}$$

(34)

Similarly we have for some constant $L_4 $ that

$$\begin{aligned} \begin{array}{llll} \Vert \nabla \bar{E}(\mathbf{W^1})_{\mathbf{w_i}}-\nabla \bar{E}(\mathbf{W^2})_{\mathbf{w_i}}\Vert &{}=\sum _{j=1}^{J}\left( \left\| \mathbf{h'_j}\left( \mathbf{w^1_0}\cdot \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j})\right) \mathbf{w^1_{i0}} \mathbf{G}(\mathbf{U^1}{\varvec{\zeta }^j}){\varvec{\zeta }^j}\right. \right. \\ &{}\quad \left. \left. - \, \mathbf{h'_j}\left( \mathbf{w^2_0}\cdot \mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j})\right) \mathbf{w^2_{i0}} \mathbf{G}(\mathbf{U^2}{\varvec{\zeta }^j}){\varvec{\zeta }^j}\right\| \right) \\ &{}\le JL_4\Vert \mathbf{W^1}-\mathbf{W^2}\Vert , \end{array} \end{aligned}$$

(35)

where $i=1, 2, \ldots , q.$

By (14) and Lagrange mean value theorem, when the positive constant $\nu $ is sufficiently small, we have for any $\mathbf {x}$ and $\mathbf {y}$ that

$$\begin{aligned} \begin{array}{llllll} \left| \frac{f'(\mathbf{x})}{2f(\mathbf{x})^{\frac{1}{2}}}-\frac{f'(\mathbf{y})}{2f(\mathbf{y})^{\frac{1}{2}}}\right| &{}=\left| \frac{1}{2f(\mathbf{x})^{\frac{1}{2}}}-\frac{1}{2f(\mathbf{y})^{\frac{1}{2}}}\right| |f'(\mathbf{x})|+\left| \frac{1}{2f(\mathbf{y})^{\frac{1}{2}}}(f'(\mathbf{x})-f'(\mathbf{y}))\right| \\ &{}\le \left| \frac{1}{2f(\mathbf{x})^{\frac{1}{2}}}-\frac{1}{2f(\mathbf{y})^{\frac{1}{2}}}\right| +\frac{1}{2\frac{3}{8}\nu }\frac{3}{2\nu }|\mathbf{x}-\mathbf{y}|\\ &{}\le \left| \frac{|f(\mathbf{y})^{\frac{1}{2}}-f(\mathbf{x})^{\frac{1}{2}}|}{2|f(\mathbf{x})^{\frac{1}{2}}f(\mathbf{y})^{\frac{1}{2}}|}\right| +\frac{2}{\nu ^2}|\mathbf{x}-\mathbf{y}|\\ &{}=\frac{|f(\mathbf{y})-f(\mathbf{x})|}{2|(f(\mathbf{y})^{\frac{1}{2}}+f(\mathbf{x})^{\frac{1}{2}})f(\mathbf{x})^{\frac{1}{2}}f(\mathbf{y})^{\frac{1}{2}}|}+\frac{2}{\nu ^2}|\mathbf{x}-\mathbf{y}|\\ &{}\le \frac{1}{4\frac{3}{8}\nu \sqrt{\frac{3}{8}\nu }}|\mathbf{x}-\mathbf{y}|+\frac{2}{\nu ^2}|\mathbf{x}-\mathbf{y}|\\ &{}\le L_5|\mathbf{x}-\mathbf{y}|, \end{array} \end{aligned}$$

(36)

where $L_5=(\frac{2}{\nu ^2}+\frac{4}{9\nu ^2}\sqrt{6\nu }).$

Using (36), we have

$$\begin{aligned} \begin{array}{lll} \Vert \nabla F(\mathbf{{W^1}})-\nabla F(\mathbf {W^2})\Vert &{}=\left( \sum _{i=1}^{q}\sum _{k=0}^p \left( \frac{f'(\mathbf{w^1_{ik}})}{2f(\mathbf{w^1_{ik}})^{\frac{1}{2}}}-\frac{f'(\mathbf{w^2_{ik}})}{2f(\mathbf{w^2_{ik}})^{\frac{1}{2}}}\right) ^2\right) ^{\frac{1}{2}}\\ &{}\le q(1+p)L_5||\mathbf{W^1}-\mathbf{W^2}||. \end{array} \end{aligned}$$

(37)

From (34), (35) and (37), we have

$$\begin{aligned} \begin{array}{llll} ||\nabla E(\mathbf{W^1})-\nabla E(\mathbf{W^2})||&{}\le \Vert \nabla \bar{E}(\mathbf{W^1})_\mathbf{w_0}-\nabla \bar{E}(\mathbf{W^2})_\mathbf{w_0}\Vert \\ &{}\quad + \, \sum _{i=1}^q\Vert \nabla \bar{E}(\mathbf{W^1})_\mathbf{w_i}-\nabla \bar{E}(\mathbf{W^2})_\mathbf{w_i}\Vert \\ &{}\quad + \, \Vert \nabla F(\mathbf{W^1})-\nabla F(\mathbf{W^2})\Vert \\ &{}\le L\Vert \mathbf{W^1}-\mathbf{W^2}\Vert , \end{array} \end{aligned}$$

(38)

where the Lipschitz constant $L=JL_3+qJL_4+q(1+p)L_5.$ This completes the proof. $\square $

The following lemma shows that our training algorithm always generates descent directions.

Lemma 2

Let the sequence $\{W_k\}$ be generated by Algorithm MDY. Then the following estimate holds for all $k\ge 0$:

$$\begin{aligned} \nabla E_k^T\mathbf{d_k}<0 \end{aligned}$$

(39)

Proof

Let us prove the Lemma by an induction argument. It is obvious that $\nabla E_k^T\mathbf{d_k}=-||\nabla E_k||^2<0$ holds for $k=0$.

By Sub-algorithm MDY, there are three cases for the formula $\beta _k$: $\beta _k=\beta _k^{MDY}$, $\beta _k=\beta _k^{DY}$, and $\beta _k=|\beta _k^{HS}|$.

For the first case $\beta _k=\beta _k^{MDY}$. Assume that (39) holds for some $k-1$. By (24), (25), and $-1\le \mu _k\le 1 $ by Sub-algorithm MDY, we have

$$\begin{aligned} \begin{array}{ll} \mu _k\nabla E^T_k\mathbf{d_{k-1}}-\nabla E^T_{k-1}{} \mathbf{d_{k-1}} &{}\ge \mu _k\sigma {\nabla E^T_{k-1}}{} \mathbf{d_{k-1}}-{\nabla E^T_{k-1}} \mathbf{d_{k-1}}\\ &{}={(\mu _k\sigma -1)}{\nabla E^T_{k-1}}{} \mathbf{d_{k-1}}>0. \end{array} \end{aligned}$$

(40)

From Sub-algorithm MDY, we can easily conclude that the equality $\mu _k l_{k-1}\ge 0$. Then, using(27), (28), (29) and (40), we have

$$\begin{aligned} \begin{array}{lllll} \nabla E_k^T\mathbf{d_k}&{}=-||\nabla E_k||^2+\frac{\mu _k ||\nabla E_k||^2}{\mu _k\varrho _{k-1}+(\mu _k-1)\nabla E_{k-1}^T\mathbf{d_{k-1}}}{\nabla E_{k}}^T\mathbf{d_{k-1}}\\ &{}=-||\nabla E_k||^2+\frac{\mu _k ||\nabla E_k||^2}{\mu _k({ \mathbf{y}^{T}_{\mathbf{k}-\mathbf{1}}}^T \mathbf{d_{k-1}}+\frac{3l_{k-1}}{\eta _{k-1}}\max \{\theta _{k-1},0\})+(\mu _k-1)\nabla E_{k-1}^T\mathbf{d_{k-1}}}{\nabla E_{k}}^T\mathbf{d_{k-1}} \\ &{}= ||\nabla E_k||^2 \frac{-\mu _k {\mathbf{y}^{T}_{\mathbf{k}-\mathbf{1}}}^T\mathbf{d_{k-1}}-\mu _k\frac{3l_{k-1}}{\eta _{k-1}}\max \{\theta _{k-1},0\}-\mu _k \nabla E^T_{k-1}{} \mathbf{d_{k-1}}+\nabla E^T_{k-1}{} \mathbf{d_{k-1}}+\mu _k \nabla E_k^T\mathbf{d_{k-1}}}{\mu _k {\mathbf{y}^{T}_{\mathbf{k}-\mathbf{1}}}\mathbf{d_{k-1}}+(\mu _k-1)\nabla E^T_{k-1}\mathbf{d_{k-1}}+\mu _k\frac{3l_{k-1}}{\eta _{k-1}}\max \{\theta _{k-1},0\}}\\ &{}= ||\nabla E_k||^2 \frac{\nabla E_{k-1}^T\mathbf{d_{k-1}}-\mu _k\frac{3l_{k-1}}{\eta _{k-1}}\max \{\theta _{k-1},0\}}{\mu _k \nabla E_k^T\mathbf{d_{k-1}}-\nabla E^T_{k-1}\mathbf{d_{k-1}}+\mu _k\frac{3l_{k-1}}{\eta _{k-1}}\max \{\theta _{k-1},0\}}<0.\\ \end{array}\nonumber \\ \end{aligned}$$

(41)

For the second case $\beta _k=\beta _k^{DY}$. By (24), (27) and Sub-algorithm MDY, we have

$$\begin{aligned} \begin{array}{ll} \nabla E_k^T\mathbf{d_k} &{}= - \, \Vert \nabla E_k\Vert ^2+\frac{\Vert \nabla E_k\Vert ^2\nabla E_k^T\mathbf{d_{k-1}}}{\nabla E_k^T\mathbf{d_{k-1}}-\nabla E^T_{k-1}{} \mathbf{d_{k-1}}}\\ &{}=\frac{\nabla E_{k-1}^T\mathbf{d_{k-1}}}{\nabla E_k^T\mathbf{d_{k-1}}-\nabla E^T_{k-1}{} \mathbf{d_{k-1}}}\Vert \nabla E_k\Vert ^2<0. \end{array} \end{aligned}$$

For the third case $\beta _k=|\beta _k^{HS}|$. By Sub-algorithm MDY, we have by induction that

$$\begin{aligned} \left| \beta _k^{HS}\right| \le \beta _k^{DY}. \end{aligned}$$

Furthermore, we have by (40) that

$$\begin{aligned} \left| \beta _k^{HS}\right| \nabla E_k^T\mathbf{d_{k-1}}<\left| \beta _k^{HS}\right| \left( \nabla E_k^T\mathbf{d_{k-1}}-\nabla E_{k-1}^T\mathbf{d_{k-1}}\right) \le \Vert \nabla E_k\Vert ^2. \end{aligned}$$

i.e.

$$\begin{aligned} \nabla E_k^T\mathbf{d_k}=- \, \Vert \nabla E_k\Vert ^2+\left| \beta _k^{HS}\right| \nabla E_k^T\mathbf{d_{k-1}}<0. \end{aligned}$$

This completes the proof. $\square $

Lemma 3

Suppose that the sequence $\{{\mathbf{W}_\mathbf{k}}\}$ is generated by Algorithm MDY, and $\{\mu _k\}$ and $\{l_{k-1}\}$ are generated by Sub-algorithm MDY. Then, for sufficiently small $\sigma $, the sufficient descent condition

$$\begin{aligned} \nabla E_k^T\mathbf{d_k}<- \, \kappa \Vert \nabla E_k\Vert ^2. \end{aligned}$$

(42)

holds for $\kappa =\frac{1}{1+\sigma }.$

Proof

From (41)

$$\begin{aligned} \nabla E_k^T\mathbf{d_k}= ||\nabla E_k||^2 \frac{\nabla E_{k-1}^T\mathbf{d_{k-1}}-\mu _k\frac{3l_{k-1}}{\eta _{k-1}}\max \{\theta _{k-1},0\}}{\mu _k \nabla E_k^T\mathbf{d_{k-1}}-\nabla E^T_{k-1}{} \mathbf{d_{k-1}}+\mu _k\frac{3l_{k-1}}{\eta _{k-1}}\max \{\theta _{k-1},0\}}. \end{aligned}$$

(43)

Let us consider the following three cases (A), (B) and (C).

(A) The case $\theta _{k-1}\le 0$ and $|\tilde{\mu }_k|\le 1$.

(A-1) For the sub-case $\theta _{k-1}\le 0$ and $0\le \tilde{\mu }_k\le 1$, the Equality (43) reduces to

$$\begin{aligned} \nabla E_k^T\mathbf{d_k}= ||\nabla E_k||^2 \frac{\nabla E_{k-1}^T\mathbf{d_{k-1}}}{\mu _k \nabla E_k^T\mathbf{d_{k-1}}-\nabla E^T_{k-1}\mathbf{d_{k-1}}}. \end{aligned}$$

(44)

By (25), we have

$$\begin{aligned} \sigma \nabla E_k^T \mathbf{d_k}\le \nabla E(\mathbf{{W}}_k+\eta _k\mathbf{d_k})^T\mathbf{d_k} \le -\sigma \nabla E_k^T \mathbf{d_k}. \end{aligned}$$

(45)

Using Lemma 2, (44) and (45),

$$\begin{aligned} \nabla E_k^T\mathbf{d_k}\le -\kappa \Vert \nabla E_k\Vert ^2, \end{aligned}$$

(46)

where $\kappa =\frac{1}{1+\sigma }$.

(A-2) For the sub-case $\theta _{k-1}\le 0$ and $-1\le \tilde{\mu }_k<0$, (42) holds by a similar proof as that for (A-1).

(B) The case $\theta _{k-1}> 0$ and $|\tilde{\mu }_k|\le 1$.

(B-1) For the sub-case $l_{k-1}=-\eta _{k-1}\nabla E_{k-1}^T \mathbf{d_{k-1}}\ge 0$ and $0\le \tilde{\mu }_k\le 1$, by (41) we have

$$\begin{aligned} \nabla E_k^T\mathbf{d_k} = ||\nabla E_k||^2 \frac{\nabla E_{k-1}^T\mathbf{d_{k-1}}+{3\mu _k\theta _{k-1}\nabla E_{k-1}^T\mathbf{d_{k-1}}}}{\mu _k \nabla E_k^T\mathbf{d_{k-1}}-\nabla E^T_{k-1}{} \mathbf{d_{k-1}}-{3\mu _k\theta _{k-1}\nabla E_{k-1}^T\mathbf{d_{k-1}}}}. \end{aligned}$$

(47)

Then, (42) holds by noticing (45) and (47),.

(B-2) For the sub-case $l_{k-1}=\eta _{k-1}\nabla E_{k-1}^T \mathbf{d_{k-1}}\le 0$ and $-1\le \tilde{\mu }_k\le 0$, the result (42) holds by a similar proof as in (B-1).

(B-3) For the sub-case $l_{k-1}=0$ and $|\tilde{\mu }_k|\le 1$, the proof is similar as in (A).

(C) For the case $|\tilde{\mu }_k|>1$, we have $\mu _k=1, l_{k-1}=0$. Then $\beta _k=\beta _k^{DY}$ or $\beta _k=|\beta _k^{HS}|$. If $\beta _k=\beta _k^{DY}$, the conclusion (42) holds by (25). If $\beta _k=|\beta _k^{HS}|$, by (25), we have for sufficiently small $\sigma $ that

$$\begin{aligned} \begin{array}{ll} \nabla E_k^T\mathbf{d_k}&{}=- \, \Vert \nabla E_k\Vert ^2+\left| \beta _k^{HS}\right| \nabla E_k^T\mathbf{d_{k-1}}<0\\ &{}\le - \, \Vert \nabla E_k\Vert ^2-\left| \beta _k^{HS}\right| \sigma \nabla E_{k-1}^T\mathbf{d_{k-1}}\\ &{}\le - \, \frac{1}{1+\sigma }\Vert \nabla E_k\Vert ^2. \end{array} \end{aligned}$$

This completes the proof. $\square $

The following Lemma 4 can be directly proved by using Lemmas 1 and 2. Actually, the proof is similar to that for Lemma 3.2 in [48].

Lemma 4

Suppose that Assumptions (A1) and (A2) hold, the sequence $\{{\mathbf{W}_\mathbf{k}}\}$ is generated by (26) such that $\mathbf{d_k}$ satisfies $\nabla E_k^T\mathbf{d_k}<0$, and $\eta _k$ is generated by the Wolfe conditions (24) and (25). Then the following Zoutendijk condition holds

$$\begin{aligned} \sum _{k=0}^{+\infty }\frac{{\left( \nabla E_k^T\mathbf{d_k}\right) }^2}{||\mathbf{d_k}||^2}<+ \, \infty . \end{aligned}$$

(48)

Lemma 5

Let the sequence $\{W_k\}$ be generated by Algorithm MDY. Then

$$\begin{aligned} |\beta _k|\le \frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}. \end{aligned}$$

(49)

Proof

By (28) and (27), we have

$$\begin{aligned} {||\mathbf{d_k}||}^2={\left( \beta _k^{MDY}\right) }^2{||\mathbf{d_{k-1}}||}^2-2{\nabla E^T_k}{} \mathbf{d_k}-||\nabla E_k||^2. \end{aligned}$$

(50)

By (28), (29) and (40), we have

$$\begin{aligned} \begin{array}{ll} \beta _{k+1}^{MDY}&{}=\frac{\mu _k ||\nabla E_{k+1}||^2}{\mu _k\varrho _{k}+(\mu _k-1)\nabla E_{k}^T\mathbf {d_{k}}}\\ &{}=\frac{\mu _k ||\nabla E_{k+1}||^2}{\mu _k\left( {\mathbf{d}^{T}_\mathbf{k}}\mathbf{y_k}+\frac{3l_k}{\eta _k}\max \{\theta _k,0\}\right) +(\mu _k-1)\nabla E_{k}^T\mathbf {d_{k}}}\\ &{}=\frac{\mu _k ||\nabla E_{k+1}||^2}{\mu _k\nabla E^T_k\mathbf{d_{k-1}}-\nabla E^T_{k-1}\mathbf{d_{k-1}}+\mu _k\frac{3l_k}{\eta _k}\max \{\theta _k,0\})}.\\ \end{array} \end{aligned}$$

(51)

By (41) and (51), we have

$$\begin{aligned} \begin{array}{lll} \mu _k\nabla E_k^T\mathbf{d_k}&{}=\frac{\mu _k||\nabla E_k||^2\left( \nabla E_{k-1}^T\mathbf{d_{k-1}}-\mu _k\frac{3l_{k-1}}{\eta _{k-1}}\max \{\theta _{k-1},0\}\right) }{\mu _k\varrho _{k-1}+(\mu _k-1)\nabla E_{k-1}^T\mathbf{d_{k-1}}}\\ &{}=\beta _k^{MDY}\left( \nabla E_{k-1}^T\mathbf{d_{k-1}}-\mu _k\frac{3l_{k-1}}{\eta _{k-1}}\max \{\theta _{k-1},0\}\right) .\\ \end{array} \end{aligned}$$

(52)

Next, we consider the following three cases (i)–(iii) to prove (49).

(i)
For the case $\theta _{k-1}\le 0$ and $|\tilde{\mu }_k|\le 1$, the Equality (52) reduces to
$$\begin{aligned} \mu _k\nabla E_k^T\mathbf{d_k}=\beta _k^{MDY}\nabla E_{k-1}^T\mathbf{d_{k-1}}. \end{aligned}$$
Furthermore, since $|\tilde{\mu }_k|\le 1$ we have
$$\begin{aligned} \left| \beta _k^{MDY}\right| =\left| \mu _k\frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}\right| \le \frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}. \end{aligned}$$
(53)
(ii)
For the case $\theta _{k-1}> 0$ and $|\tilde{\mu }_k|\le 1$, the Equality (52) reduces to
$$\begin{aligned} \mu _k\nabla E_k^T\mathbf{d_k}=\beta _k^{MDY}\left( \nabla E_{k-1}^T\mathbf{d_{k-1}}-\mu _k\frac{3l_k}{\eta _k}\theta _{k-1}\right) , \end{aligned}$$
(54)
Now, we consider the following three sub-cases: The first sub-case $l_{k-1}=-\eta _{k-1}\nabla E_{k-1}^T \mathbf{d_{k-1}}\ge 0$ and $0\le \tilde{\mu }_k\le 1$. Then, from (54) we have
$$\begin{aligned} \mu _k\nabla E_k^T\mathbf{d_k}=\beta _k^{MDY}\nabla E_{k-1}^T\mathbf{d_{k-1}}(1+3\mu _k\theta _{k-1}). \end{aligned}$$
Furthermore,
$$\begin{aligned} \left| \beta _k^{MDY}\right| =\mu _k\frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}\frac{1}{1+3\mu _k\theta _{k-1}}\le \frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}. \end{aligned}$$
(55)
The second sub-case $l_{k-1}=\eta _{k-1}\nabla E_{k-1}^T \mathbf{d_{k-1}}\le 0$ and $-1\le \tilde{\mu }_k\le 0$. From (54) we have
$$\begin{aligned} \mu _k\nabla E_k^T\mathbf{d_k}=\beta _k^{MDY}\nabla E_{k-1}^T\mathbf{d_{k-1}}(1-3\mu _k\theta _{k-1}), \end{aligned}$$
and thus
$$\begin{aligned} \left| \beta _k^{MDY}\right| =\mu _k\frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}\frac{1}{1-3\mu _k\theta _{k-1}}\le \frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}. \end{aligned}$$
(56)
The third sub-case $l_{k-1}=0$ and $|\tilde{\mu }_k|\le 1$. The proof here is similar as in (i).
(iii)
For the case $|\tilde{\mu }_k|>1$, we have $\mu _k=1$ and $ l_{k-1}=0$. Then, $\beta _k=\beta _k^{DY}$ or $\beta _k=|\beta _k^{HS}|$. For the case $\beta _k=\beta _k^{DY}$, it is easy to prove that the equality
$$\begin{aligned} \beta _k^{DY}=\frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}} \end{aligned}$$
(57)
holds. For the case $\beta _k=|\beta _k^{HS}|$, by Sub-algorithm MDY,
$$\begin{aligned} \left| \beta _k^{HS}\right| \le \beta _k^{DY}=\frac{\nabla E_k^T\mathbf{d_k}}{\nabla E_{k-1}^T\mathbf{d_{k-1}}}. \end{aligned}$$
(58)
Finally, (49) holds by combing (53), (55), (56), (57) and (58).

$\square $

Proof of Theorem 2

Proof

We will prove the theorem by contradiction. If the theorem is not true, there exists a constant $c>0$ such that

$$\begin{aligned} ||\nabla E_k||\ge c, ~k=0,1,2,\ldots \end{aligned}$$

(59)

By Lemma 5, we have

$$\begin{aligned} |\beta _k|^2\le \frac{\left( \nabla E_k^T\mathbf{d_k}\right) ^2}{\left( \nabla E_{k-1}^T\mathbf{d_{k-1}}\right) ^2}. \end{aligned}$$

(60)

By (50), (60) and (51), we have

$$\begin{aligned} \frac{||\mathbf{d_k}||^2}{\left( {\nabla E^T_k}\mathbf{d_k}\right) ^2}\le & {} \frac{\left( \nabla E_k^T\mathbf{d_k}\right) ^2}{\left( \nabla E^T_{k-1}{} \mathbf{d_{k-1}}\right) ^2}\frac{{||\mathbf{d_{k-1}}||}^2}{{\left( {\nabla E^T_k}{} \mathbf{d_k}\right) }^2}-\frac{2}{{\nabla E^T_k}\mathbf{d_k}}-\frac{||\nabla E_k||^2}{{({\nabla E^T_k}{} \mathbf{d_k})}^2}\\= & {} {\frac{{||\mathbf{d_{k-1}}||}^2}{{\left( {\nabla E^T_{k-1}}\mathbf{d_{k-1}}\right) }^2}-{\left( \frac{1}{||\nabla E_k||}+\frac{||\nabla E_k||}{\left( {\nabla E^T_k}{} \mathbf{d_k}\right) }\right) }^2+\frac{1}{||\nabla E_k||^2}}\\\le & {} {\frac{{||\mathbf{d_{k-1}}||}^2}{{\left( {\nabla E^T_{k-1}}{} \mathbf{d_{k-1}}\right) }^2}+\frac{1}{||\nabla E_k||^2}}. \end{aligned}$$

It follows from

$$\begin{aligned} \frac{{||\mathbf {d_{0}}||}^2}{{\left( {\nabla E^T_{0}}\mathbf {d_{0}}\right) }^2}=\frac{1}{||\nabla E_0||^2} \end{aligned}$$

and (59) that

$$\begin{aligned} \frac{||\mathbf{d_k}||^2}{{\left( {\nabla E^T_k}\mathbf{d_k}\right) }^2}\le \sum _{j=0}^k\frac{1}{||\nabla E_j||^2}\le \frac{k+1}{c^2}. \end{aligned}$$

Furthermore,

$$\begin{aligned} \frac{{\left( {\nabla E^T_k}{} \mathbf{d_k}\right) }^2}{||\mathbf{d_k}||^2}\ge \frac{c^2}{k+1}, \end{aligned}$$

which indicates that

$$\begin{aligned} \sum _{k=0}^\infty \frac{{\left( \nabla E_k^T\mathbf{d_k}\right) }^2}{||\mathbf{d_k}||^2}=+\, \infty . \end{aligned}$$

This contradicts the Zoutendijk condition (48). Therefore, the conclusion (30) must hold. $\square $

The deduction of (19) for the adaptive parameter $\tilde{\mu }_k$:

The conjugate gradient direction (4) on the current epoch has been set to be equal to the quasi-Newton direction (17):

$$\begin{aligned} \mathbf{d_k}=\widehat{B}_{k}^{-1}\nabla E_k=-\, \nabla E_k+\beta _k^{MDY} \mathbf{d_{k-1}}. \end{aligned}$$

When $\theta _{k-1}\le 0$, using (10), we have

$$\begin{aligned} \widehat{B}_{k}^{-1}\nabla E_k=-\nabla E_k+\frac{\mu _k\Vert \nabla E_k\Vert ^2}{\mu _k {\mathbf{d}^{T}_{\mathbf{k}-\mathbf{1}}}\nabla E_k-\nabla E_{k-1}^{T}{} \mathbf{d_{k-1}}}{} \mathbf{d_{k-1}}, \end{aligned}$$

i.e.

$$\begin{aligned} \nabla E_k=\widehat{B}_{k}\nabla E_k-\frac{\mu _k\Vert \nabla E_k\Vert ^2\widehat{B}_{k} \mathbf{s_{k-1}}}{\mu _k { \mathbf{s}^{T}_{\mathbf{k}-\mathbf{1}}}\nabla E_k-\nabla E_{k-1}^{T} \mathbf{s_{k-1}}}. \end{aligned}$$

Multiplying the above equality by $\mathbf{s_{k-1}}$, we have

$$\begin{aligned} \nabla E_{k}^T \mathbf{s_{k-1}}=\nabla E_{k}^{T}\widehat{B}_k \mathbf{s_{k-1}}-\frac{\mu _k\Vert \nabla E_k\Vert ^2}{\mu _k { \mathbf{s}^{T}_{\mathbf{k}-\mathbf{1}}}\nabla E_k-\nabla E_{k-1}^{T}\mathbf{s_{k-1}}}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}^{T}{} \mathbf{s_{k-1}}, \end{aligned}$$

By the modified secant condition (6), we have

$$\begin{aligned} \nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}-\nabla E_{k}^{T}\mathbf{s_{k-1}}=\frac{\mu _k\Vert \nabla E_k\Vert ^2}{\mu _k { \mathbf{s}^{T}_{\mathbf{k}-\mathbf{1}}}\nabla E_k-\nabla E_{k-1}^{T}\mathbf{s_{k-1}}}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}^{T}{} \mathbf{s_{k-1}}, \end{aligned}$$

The adaptive parameter $\mu _k$ can be computed by

$$\begin{aligned} \mu _k=\frac{\nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k-1}^{T}{} \mathbf{s_{k-1}}-\nabla E_{k}^{T}{} \mathbf{s_{k-1}}\nabla E_{k-1}^{T}{} \mathbf{s_{k-1}}}{\nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}{ \mathbf{s}^{T}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k}-\nabla E_{k}^{T}\mathbf{s_{k-1}}{\mathbf{s}^{T}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k}-\Vert \nabla E_k\Vert ^2{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}^{T}{} \mathbf{s_{k-1}}}. \end{aligned}$$

When $\theta _{k-1}>0$, similar to the case of $\theta _{k-1}\le 0$, the adaptive parameter $\mu _k$ can be computed by

$$\begin{aligned} \mu _k= \frac{\nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k-1}^{T}{} \mathbf{s_{k-1}}-\nabla E_{k}^{T}{} \mathbf{s_{k-1}}\nabla E_{k-1}^{T}{} \mathbf{s_{k-1}}}{\nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}{ \mathbf{s}^{T}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k}-\nabla E_{k}^{T}{} \mathbf{s_{k-1}}{ \mathbf{s}^{T}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k}-\Vert \nabla E_k\Vert ^2{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}^{T}\mathbf{s_{k-1}}+\vartheta _k}, \end{aligned}$$

where $\vartheta _k=3l_{k-1}\theta _{k-1}\nabla E_{k}^{T}{ \hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}-3l_{k-1}\theta _{k-1}\nabla E_{k}^{T}{} \mathbf{s_{k-1}}$.

In summary, the adaptive parameter $\mu _k$ can be computed by the following formula:

$$\begin{aligned} \tilde{\mu }_k= {\left\{ \begin{array}{ll} \frac{\nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k-1}^{T}{} \mathbf{s_{k-1}}-\nabla E_{k}^{T}{} \mathbf{s_{k-1}}\nabla E_{k-1}^{T}{} \mathbf{s_{k-1}}}{\nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}{\mathbf{s}^{{T}}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k}-\nabla E_{k}^{T}{} \mathbf{s_{k-1}}{ \mathbf{s}^{{T}}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k}-\Vert \nabla E_k\Vert ^2{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}^{T}{} \mathbf{s_{k-1}}}, &{}\text {if} \quad \theta _{k-1}\le 0,\\ \frac{\nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k-1}^{T}{} \mathbf{s_{k-1}}-\nabla E^{T}_{k}{} \mathbf{s_{k-1}}\nabla E_{k-1}^{T}\mathbf{s_{k-1}}}{\nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}{ \mathbf{s}^{{T}}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k}-\nabla E_{k}^{T}{} \mathbf{s_{k-1}}{ \mathbf{s}^{{T}}_{\mathbf{k}-\mathbf{1}}}\nabla E_{k}-\Vert \nabla E_k\Vert ^2{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}^{T}{} \mathbf{s_{k-1}}+\vartheta _k}, &{}\text {if} \quad \theta _{k-1}> 0, \end{array}\right. } \end{aligned}$$

where $\vartheta _k=3l_{k-1}\theta _{k-1}\nabla E_{k}^{T}{\hat{\mathbf{y}}_{\mathbf{k}-\mathbf{1}}}-3l_{k-1}\theta _{k-1}\nabla E_{k}^{T}\mathbf{s_{k-1}}$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, W., Liu, Y., Yang, J. et al. A New Conjugate Gradient Method with Smoothing $L_{1/2} $ Regularization Based on a Modified Secant Equation for Training Neural Networks. Neural Process Lett 48, 955–978 (2018). https://doi.org/10.1007/s11063-017-9737-9

Download citation

Published: 21 November 2017
Issue Date: October 2018
DOI: https://doi.org/10.1007/s11063-017-9737-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A New Conjugate Gradient Method with Smoothing \(L_{1/2} \) Regularization Based on a Modified Secant Equation for Training Neural Networks

Abstract

Access this article

Similar content being viewed by others

Two Improved Nonlinear Conjugate Gradient Methods with the Strong Wolfe Line Search

A conjugate gradient method with sufficient descent property

An Improved Conjugate Gradient Neural Networks Based on a Generalized Armijo Search Method

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Lemma 5

Proof

Proof of Theorem 2

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A New Conjugate Gradient Method with Smoothing \(L_{1/2} \) Regularization Based on a Modified Secant Equation for Training Neural Networks

Abstract

Access this article

Similar content being viewed by others

Two Improved Nonlinear Conjugate Gradient Methods with the Strong Wolfe Line Search

A conjugate gradient method with sufficient descent property

An Improved Conjugate Gradient Neural Networks Based on a Generalized Armijo Search Method

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Lemma 5

Proof

Proof of Theorem 2

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation