Abstract
We study a class of monotone inclusions called “self-concordant inclusion” which covers three fundamental convex optimization formulations as special cases. We develop a new generalized Newton-type framework to solve this inclusion. Our framework subsumes three schemes: full-step, damped-step, and path-following methods as specific instances, while allows one to use inexact computation to form generalized Newton directions. We prove the local quadratic convergence of both full-step and damped-step algorithms. Then, we propose a new two-phase inexact path-following scheme for solving this monotone inclusion which possesses an \({\mathcal {O}}(\sqrt{\nu }\log (1/\varepsilon ))\)-worst-case iteration-complexity to achieve an \(\varepsilon \)-solution, where \(\nu \) is the barrier parameter and \(\varepsilon \) is a desired accuracy. As byproducts, we customize our scheme to solve three convex problems: the convex–concave saddle-point problem, the nonsmooth constrained convex program, and the nonsmooth convex program with linear constraints. We also provide three numerical examples to illustrate our theory and compare with existing methods.

Similar content being viewed by others
References
Auslender, A., Teboulle, M., Ben-Tiba, S.: A logarithmic-quadratic proximal method for variational inequalities. Comput. Optim. Appl. 12(1–3), 31–40 (1999)
Bauschke, H.H., Combettes, P.: Convex Analysis and Monotone Operators Theory in Hilbert Spaces, 2nd edn. Springer, Berlin (2017)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding agorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Becker, S., Fadili, M.J.: A quasi-Newton proximal splitting method. In: Proceedings of Neutral Information Processing Systems Foundation (NIPS) (2012)
Ben-Tal, A., Nemirovski, A.: Lectures on Modern Convex Optimization: Analysis, Algorithms, and Engineering Applications. Volume 3 of MPS/SIAM Series on Optimization. SIAM, Philadelphia (2001)
Bonnans, J.F.: Local analysis of Newton-type methods for variational inequalities and nonlinear programming. Appl. Math. Optim. 29, 161–186 (1994)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)
Boyd, S., Vandenberghe, L.: Convex Optimization. University Press, Cambridge (2004)
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)
Combettes, P., Pesquet, J.-C.: Signal recovery by proximal forward-backward splitting. In: Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer, Berlin (2011)
Combettes, P.L., Wajs, V.R.: Signal recovery by proximal forward–backward splitting. Multiscale Model. Simul. 4, 1168–1200 (2005)
De Luca, T., Facchinei, F., Kanzow, C.: A semismooth equation approach to the solution of nonlinear complementarity problems. Math. Program. 75(3), 407–439 (1996)
Dontchev, A.L., Rockafellar, R.T.: Implicit Functions and Solution Mappings: A View from Variational Analysis. Springer, Berlin (2014)
Eckstein, J., Bertsekas, D.: On the Douglas–Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55, 293–318 (1992)
Esser, J.E.: Primal-dual algorithm for convex models and applications to image restoration, registration and nonlocal inpainting. Ph.D. Thesis, University of California, Los Angeles (2010)
Facchinei, F., Pang, J.-S.: Finite-Dimensional Variational Inequalities and Complementarity Problems, vol. 1-2. Springer, Berlin (2003)
Frank, M., Wolfe, P.: An algorithm for quadratic programming. Nav. Res. Logist. Q. 3, 95–110 (1956)
Friedlander, M., Goh, G.: Efficient evaluation of scaled proximal operators. Electron. Trans. Numer. Anal. 46, 1–22 (2017)
Fukushima, M.: Equivalent differentiable optimization problems and descent methods for asymmetric variational inequality problems. Math. Program. 53, 99–110 (1992)
Goldstein, T., Li, M., Yuan, X., Esser, E., Baraniuk, R.: Adaptive primal-dual hybrid gradient methods for saddle-point problems, pp. 1–15 (2015). arXiv:1305.0546v2
Grant, M., Boyd, S., Ye, Y.: Disciplined convex programming. In: Liberti, L., Maculan, N. (eds.) Global Optimization: From Theory to Implementation, Nonconvex Optimization and Its Applications, pp. 155–210. Springer, Berlin (2006)
Hajek, B., Wu, Y., Xu, J.: Achieving exact cluster recovery threshold via semidefinite programming. IEEE Trans. Inf. Theory 62, 2788–2797 (2016)
Jaggi, M.: Revisiting Frank–Wolfe: projection-free sparse convex optimization. In: JMLR W&CP, vol. 28, no. 1, pp. 427–435 (2013)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems (NIPS), pp. 315–323 (2013)
Korpelevic, G.M.: An extragradient method for finding saddle-points and for other problems. Èkon. Mat. Metody 12(4), 747–756 (1976)
Kummer, B.: Newton’s method for non-differentiable functions. Adv. Math. Optim. 45, 114–125 (1988)
Löefberg, J.: YALMIP : a toolbox for modeling and optimization in MATLAB. In: Proceedings of the CACSD Conference, Taipei, Taiwan (2004)
Monteiro, R.D.C., Svaiter, B.F.: Iteration-complexity of a Newton proximal extragradient method for monotone variational inequalities and inclusion problems. SIAM J. Optim. 22(3), 914–935 (2012)
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Nemirovskii, A.: Prox-method with rate of convergence \({\cal{O}}(1/t)\) for variational inequalities with Lipschitz continuous monotone operators and smooth convex–concave saddle point problems. SIAM J. Optim. 15(1), 229–251 (2004)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Volume 87 of Applied Optimization. Kluwer Academic Publishers, Dordrecht (2004)
Nesterov, Y.: Dual extrapolation and its applications to solving variational inequalities and related problems. Math. Program. 109(2–3), 319–344 (2007)
Nesterov, Y.: Smoothing technique and its applications in semidefinite optimization. Math. Program. 110(2), 245–259 (2007)
Nesterov, Y.: Gradient methods for minimizing composite objective function. Math. Program. 140(1), 125–161 (2013)
Nesterov, Y., Nemirovski, A.: Interior-Point Polynomial Algorithms in Convex Programming. Society for Industrial Mathematics, Philadelphia (1994)
Nesterov, Y., Todd, M.J.: Self-scaled barriers and interior-point methods for convex programming. Math. Oper. Res. 22(1), 1–42 (1997)
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer Series in Operations Research and Financial Engineering, 2nd edn. Springer, New York (2006)
Pang, J.-S.: A B-differentiable equation-based, globally and locally quadratically convergent algorithm for nonlinear programs, complementarity and variational inequality problems. Math. Program. 51(1), 101–131 (1991)
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 123–231 (2013)
Qi, L., Sun, J.: A nonsmooth version of Newton’s method. Math. Program. 58, 353–367 (1993)
Ralph, D.: Global convergence of damped Newton’s method for nonsmooth equations via the path search. Math. Oper. Res. 19(2), 352–389 (1994)
Robinson, S.M.: Strongly regular generalized equations. Math. Oper. Res. 5(1), 43–62 (1980)
Robinson, S.M.: Newton’s method for a class of nonsmooth functions. Set Valued Var. Anal. 2, 291–305 (1994)
Rockafellar, R.T.: Convex Analysis. Volume 28 of Princeton Mathematics Series. Princeton University Press, Princeton (1970)
Rockafellar, R.T., Wets, R .J.-B.: Variational Analysis. Springer, Berlin (1997)
Shefi, R., Teboulle, M.: Rate of convergence analysis of decomposition methods based on the proximal method of multipliers for convex minimization. SIAM J. Optim. 24(1), 269–297 (2014)
Solodov, M.V., Svaiter, B.F.: A hybrid approximate extragradient-proximal point algorithm using the enlargement of a maximal monotone operator. Set Valued Var. Anal. 7(4), 323–345 (1999)
Sturm, F.: Using SeDuMi 1.02: A Matlab toolbox for optimization over symmetric cones. Optim. Methods Softw. 11–12, 625–653 (1999)
Su, W., Boyd, S., Candes, E.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. In: Advances in Neural Information Processing Systems (NIPS), pp. 2510–2518 (2014)
Toh, K.-C., Todd, M.J., Tütüncü, R.H.: On the implementation and usage of SDPT3—a Matlab software package for semidefinite-quadratic-linear programming. Technical Report 4, NUS Singapore (2010)
Tran-Dinh, Q., Kyrillidis, A., Cevher, V.: An inexact proximal path-following algorithm for constrained convex minimization. SIAM J. Optim. 24(4), 1718–1745 (2014)
Tran-Dinh, Q., Kyrillidis, A., Cevher, V.: A single phase proximal path-following framework. Math. Oper. Res. (2018) (accepted)
Tran-Dinh, Q., Necoara, I., Savorgnan, C., Diehl, M.: An inexact perturbed path-following method for Lagrangian decomposition in large-scale separable convex optimization. SIAM J. Optim. 23(1), 95–125 (2013)
Tseng, P.: Applications of splitting algorithm to decomposition in convex programming and variational inequalities. SIAM J. Control Optim. 29, 119–138 (1991)
Tseng, P.: Alternating projection-proximal methods for convex programming and variational inequalities. SIAM J. Optim. 7(4), 951–965 (1997)
Wen, Z., Goldfarb, D., Yin, W.: Alternating direction augmented Lagrangian methods for semidefinite programming. Math. Program. Comput. 2, 203–230 (2010)
Wen, Z., Yin, W., Zhang, Y.: Solving a low-rank factorization model for matrix completion by a nonlinear successive over-relaxation algorithm. Math. Program. Comput. 4(4), 333–361 (2012)
Womersley, R.S., Sun, D., Qi, H.: A feasible semismooth asymptotically Newton method for mixed complementarity problems. Math. Program. 94(1), 167–187 (2002)
Wright, S.J.: Applying new optimization algorithms to model predictive control. In: Kantor J.C., Garcia C.E., Carnahan B. (eds) Fifth International Conference on Chemical Process Control—CPCV, pp. 147–155. American Institute of Chemical Engineers (1996)
Xiu, N., Zhang, J.: Some recent advances in projection-type methods for variational inequalities. J. Comput. Appl. Math. 152(1), 559–585 (2003)
Yamashita, H., Yabe, H., Harada, K.: A primal-dual interior point method for nonlinear semidefinite programming. Math. Program. 135, 89–121 (2012)
Yang, L., Sun, D., Toh, K.-C.: SDPNAL+: a majorized semismooth Newton-CG augmented Lagrangian method for semidefinite programming with nonnegative constraints. Math. Program. Comput. 7(3), 331–366 (2015)
Acknowledgements
This work was supported in part by the NSF Grant, USA, Award Number: 1619884.
Author information
Authors and Affiliations
Corresponding author
Appendix: the proofs of technical results
Appendix: the proofs of technical results
This appendix provides the full proofs of all lemmas and theorems in the main text.
1.1 The proof of Lemma 1: the existence and uniqueness of the solution of (2).
Under Assumption A.1, the operator \(t\nabla {F}(\cdot ) + \mathcal {A}(\cdot )\) is maximally monotone for any \(t > 0\). We use [45, Theorem 12.51] to prove the solution existence of (2).
To this end, let \(\varvec{\omega }\ne \varvec{0}\) be chosen from the horizon cone of \(\mathrm {int}\left( \mathcal {Z}\right) \cap \mathrm {dom}(\mathcal {A})\). We need to find \(\mathbf {z}\in \mathrm {int}\left( \mathcal {Z}\right) \cap \mathrm {dom}(\mathcal {A})\) with \(\mathbf {v}\in t \nabla {F}(\mathbf {z}) + \mathcal {A}(\mathbf {z})\) such that \(\langle \mathbf {v}, \varvec{\omega }\rangle > 0\). By assumption, there exists \(\hat{\mathbf {z}}\in \mathrm {int}\left( \mathcal {Z}\right) \cap \mathrm {dom}(\mathcal {A})\) with \(\hat{\mathbf {a}}\in \mathcal {A}(\hat{\mathbf {z}}) \) such that \(\langle \hat{\mathbf {a}}, \varvec{\omega }\rangle >0\).
First, we show that \(\mathbf {z}_\tau = \hat{\mathbf {z}}+ \tau \varvec{\omega }\) belongs to \(\mathrm {int}\left( \mathcal {Z}\right) \cap \mathrm {dom}(\mathcal {A})\) for any \(\tau >0\). To see this, note that the assumption \(\mathrm {int}\left( \mathcal {Z}\right) \cap \mathrm {dom}(\mathcal {A})\ne \emptyset \) implies that \(\mathrm {int}\left( \mathcal {Z}\right) \cap \mathrm {ri} \ \mathrm {dom}(\mathcal {A})\ne \emptyset \), which implies that the closure of \(\mathrm {int}\left( \mathcal {Z}\right) \cap \mathrm {dom}(\mathcal {A})\) is exactly \(\mathcal {Z}\cap \mathrm {cl}\!\left( \mathrm {dom}(\mathcal {A})\right) \). Choose \(\tau '>\tau \); by definition of the horizon cone, \(\mathbf {z}_{\tau '}\) belongs to the closure of \(\mathrm {int}\left( \mathcal {Z}\right) \cap \mathrm {dom}(\mathcal {A})\), so \(\mathbf {z}_{\tau '}\in \mathcal {Z}\) and \(\mathbf {z}_{\tau '}\in \mathrm {cl}\!\left( \mathrm {dom}(\mathcal {A})\right) \). Since \(\mathbf {z}_{\tau }\) is a convex combination of \(\hat{\mathbf {z}}\) and \(\mathbf {z}_{\tau '}\), it belongs to \(\mathrm {int}\left( \mathcal {Z}\right) \cap \mathrm {dom}(\mathcal {A})\), where we use the assumption that \(\mathrm {dom}(\mathcal {A})\) is either closed or open.
Next, for any \(\mathbf {a}_\tau \in \mathcal {A}(\mathbf {z}_\tau )\), we have
On the other hand, \(\langle t \nabla {F}(\mathbf {z}_\tau ), \varvec{\omega }\rangle = \langle t \nabla {F}(\mathbf {z}_\tau ), \tau ^{-1}(\mathbf {z}_\tau -\hat{\mathbf {z}})\rangle \ge -\,\tau ^{-1} t\nu \) by [31, Theorem 4.2.4]. Combining the above two inequalities, we can see that
as long as \(\tau ^{-1}t\nu < \langle \hat{\mathbf {a}},\varvec{\omega }\rangle \). We have thereby verified the condition in [45, Theorem 12.51], which needs to guarantee (2) for having a nonempty (and bounded) solution set. Since \(\nabla F\) is strictly monotone, the solution of (2) is unique.
Note that \(\mathbf {z}^{\star }_t\) is the solution of (2) and \(\mathbf {z}^{\star }_t \in \mathrm {int}\left( \mathcal {Z}\right) \), we have \(-\,t\nabla {F}(\mathbf {z}^{\star }_t) \in \mathcal {A}(\mathbf {z}^{\star }_t) = \mathcal {A}_{\mathcal {Z}}(\mathbf {z}^{\star }_t)\). Hence, \(\mathrm {dist}_{\mathbf {z}^{\star }_t}(\varvec{0}, \mathcal {A}_{\mathcal {Z}}(\mathbf {z}^{\star }_t)) \le t\left\| \nabla {F}(\mathbf {z}^{\star }_t)\right\| _{\mathbf {z}^{\star }_t}^{*} \le t\sqrt{\nu }\) due to the property of F [31]. Using Definition 4, we have the last conclusion. \(\square \)
1.2 The proof of Lemma 3: approximate solution
First, since \(\bar{\mathbf {z}}_{+}\) is a zero point of \(\widehat{\mathcal {A}}_t(\cdot ;z)\), i.e., \(0 \in \widehat{\mathcal {A}}_t(\bar{\mathbf {z}}_{+},z)\), we have \(-\,t\nabla {F}(\mathbf {z}) - t\nabla ^2{F}(\mathbf {z})(\bar{\mathbf {z}}_{+} - \mathbf {z}) \in \mathcal {A}(\bar{\mathbf {z}}_{+})\). Second, since \(\mathbf {z}_{+}\) is a \(\delta \)-solution to (23), there exists \(\mathbf {e}\) such that \(\mathbf {e}\in t\nabla {F}(\mathbf {z}) + t\nabla ^2{F}(\mathbf {z})(\mathbf {z}_{+} - \mathbf {z}) + \mathcal {A}(\mathbf {z}_{+})\) with \(\Vert \mathbf {e}\Vert _{\mathbf {z}}^{*} \le t\delta \) by Definition 5. Combining these expressions, and using the monotonicity of \(\mathcal {A}\) in Definition 1, we can show that \(\langle t[\nabla {F}(\mathbf {z}) + \nabla ^2{F}(\mathbf {z})(\mathbf {z}_{+} - \mathbf {z}) - \nabla {F}(\mathbf {z}) - \nabla ^2{F}(\mathbf {z})(\bar{\mathbf {z}}_{+} - \mathbf {z})] -\mathbf {e}, \bar{\mathbf {z}}_{+} - \mathbf {z}_{+} \rangle \ge 0\). This inequality leads to
which implies \(\Vert \mathbf {z}_{+} - \bar{\mathbf {z}}_{+}\Vert _{\mathbf {z}} \le t^{-1}\Vert \mathbf {e}\Vert _{\mathbf {z}}^{*}\). Hence, \(\Vert \mathbf {e}\Vert _{\mathbf {z}}^{*}\le t\delta \) implies \(\Vert \mathbf {z}_{+} - \bar{\mathbf {z}}_{+}\Vert _{\mathbf {z}} \le \delta \).
Next, since \(\mathbf {z}_{+}\) is a \(\delta \)-approximate solution to (23) at t in the sense of Definition 5 up to the accuracy \(\delta \), there exists \(\mathbf {e}\in \mathbb {R}^p\) such that
In addition, we have \(\mathbf {z}_{+} \in \mathrm {int}\left( \mathcal {Z}\right) \) due to Theorem 1. Hence, we have \(\mathcal {A}_{\mathcal {Z}}(\mathbf {z}_{+}) = \mathcal {A}(\mathbf {z}_{+})\). Using this relation and the above inclusion, we can show that
Here, we have used \(\left\| \nabla {F}(\mathbf {z})\right\| _{\mathbf {z}}^{*} \le \sqrt{\nu }\), and \(\left\| \mathbf {z}_{+} - \bar{\mathbf {z}}_{+}\right\| _{\mathbf {z}} \le \delta \) by the first part of this lemma. Note that if \(\lambda _t(\mathbf {z}) + \delta < 1\), then \(\mathrm {dist}_{\mathbf {z}_{+}}(\varvec{0}, \mathcal {A}_{\mathcal {Z}}(\mathbf {z}_{+})) \le (1-\lambda _t(\mathbf {z}) -\delta )^{-1}\mathrm {dist}_{\mathbf {z}}(\varvec{0}, \mathcal {A}_{\mathcal {Z}}(\mathbf {z}_{+}))\). Combining this inequality and the last estimate, we obtain (25). Finally, if we choose \(t \le (1-\lambda _t(\mathbf {z})-\delta )\left( \sqrt{\nu } + \lambda _{t}(\mathbf {z}) + 2\delta \right) ^{-1}\varepsilon \), then \(\mathrm {dist}_{\mathbf {z}_{+}}(\varvec{0}, \mathcal {A}_{\mathcal {Z}}(\mathbf {z}_{+})) \le \varepsilon \). Hence, \(\mathbf {z}_{+}\) is an \(\varepsilon \)-solution to (1) in the sense of Definition 4. \(\square \)
1.3 The proof of Theorem 1: a key estimate of generalized Newton-type schemes
First, similar to [2], we can easily show the the following non-expansive property holds
Note that \(\Vert \mathbf {z}_{+} - \mathbf {z}\Vert _{\mathbf {z}} \le \Vert \bar{\mathbf {z}}_{+} - \mathbf {z}\Vert _{\mathbf {z}} + \Vert \mathbf {z}_{+} - \bar{\mathbf {z}}_{+}\Vert _{\mathbf {z}} = \lambda _t(\mathbf {z}) + \delta (\mathbf {z}) < 1\) by our assumption. This shows that \(\mathbf {z}_{+}\in \mathrm {int}\left( \mathcal {Z}\right) \) due to [31, Theorem 4.1.5 (1)].
Next, we consider the generalized gradient mappings \(G_{\mathbf {z}}(\mathbf {z};t_{+})\) and \(G_{\mathbf {z}_{+}}(\mathbf {z}_{+}; t_{+})\) at \(\mathbf {z}\) and \(\mathbf {z}_{+}\), respectively defined by (20) as follows:
Let \(r_{\mathbf {z}}(\bar{\mathbf {z}}_{+}) := \nabla {F}(\mathbf {z}) + \nabla ^2{F}(\mathbf {z})(\bar{\mathbf {z}}_{+} - \mathbf {z})\). Then, by using \(\bar{\mathbf {z}}_{+} := \mathcal {P}_{\mathbf {z}}\big (\mathbf {z}- \nabla ^2{F}(\mathbf {z})^{-1}\nabla {F}(\mathbf {z}); t_{+}\big )\) from (26), we can show that
Clearly, we can rewrite (69) as \(\bar{\mathbf {z}}_{+} -\nabla ^2{F}(\mathbf {z}_{+})^{-1}r_{\mathbf {z}}(\bar{\mathbf {z}}_{+}) \in \bar{\mathbf {z}}_{+} + t_{+}^{-1}\nabla ^2{F}(\mathbf {z}_{+})^{-1}\mathcal {A}(\bar{\mathbf {z}}_{+})\). Then, using the definition (16) of \(\mathcal {P}_{\mathbf {z}_{+}}(\cdot ) := \left( \mathbb {I}+ t_{+}^{-1}\nabla ^2{F}(\mathbf {z}_{+})^{-1}\mathcal {A}\right) ^{-1}(\cdot )\), we can derive
Now, we can estimate \(\lambda _{t_{+}}(\mathbf {z}_{+})\) defined by (21) using (68), (70), (67), and (69) as follows:
Here, in the last equality of (71), we have used the fact that \(\Vert \mathbf {w}\Vert _{\mathbf {z}_{+}}^2 = \langle \nabla ^2{F}(\mathbf {z}_{+})\mathbf {w}, \mathbf {w}\rangle \le (1-\Vert \mathbf {z}_{+} - \mathbf {z}\Vert _{\mathbf {z}})^{-2}\langle \nabla ^2{F}(\mathbf {z})\mathbf {w}, \mathbf {w}\rangle = (1-\Vert \mathbf {z}_{+} - \mathbf {z}\Vert _{\mathbf {z}})^{-2}\Vert \mathbf {w}\Vert _{\mathbf {z}}^2\) for any \(\mathbf {w}\) and \(\mathbf {z}, \mathbf {z}_{+}\) such that \(\Vert \mathbf {z}_{+} - \mathbf {z}\Vert _{\mathbf {z}} < 1\), and the analogous fact for the dual norms. Both facts can be derived from [31, Theorem 4.1.6]. The condition \(\Vert \mathbf {z}_{+} - \mathbf {z}\Vert _{\mathbf {z}} < 1\) is guaranteed since \(\Vert \mathbf {z}_{+} - \mathbf {z}\Vert _{\mathbf {z}} \le \Vert \mathbf {z}- \bar{\mathbf {z}}_{+}\Vert _{\mathbf {z}} + \Vert \mathbf {z}_{+} - \bar{\mathbf {z}}_{+}\Vert _{\mathbf {z}} = \lambda _{t_{+}}(\mathbf {z}) + \delta (\mathbf {z}) < 1\) by our assumption.
Similar to the proof of [31, Theorem 4.1.14], we can show that
Next, we need to estimate \(B := \Vert (\nabla ^2{F}(\mathbf {z}_{+}) - \nabla ^2{F}(\mathbf {z}))(\bar{\mathbf {z}}_{+} - \mathbf {z}_{+})\Vert _{\mathbf {z}}^{*}\). We define
By [31, Theorem 4.1.6], we can show that
Using this inequality we can estimate B as
which implies
Substituting (72) and (73) into (71) we get
Finally, we note that \(\lambda _{t_{+}}(\mathbf {z}) := \Vert G_{\mathbf {z}}(\mathbf {z}; t_{+})\Vert _{\mathbf {z}}^{*} = \Vert \mathbf {z}- \mathcal {P}_{\mathbf {z}}\left( \mathbf {z}- \nabla ^2{F}(\mathbf {z})^{-1}\nabla {F}(\mathbf {z}); t_{+} \right) \Vert _{\mathbf {z}} = \Vert \mathbf {z}- \bar{\mathbf {z}}_{+} \Vert _{\mathbf {z}}\) due to (26). Using the triangle inequality we have \(\Vert \mathbf {z}_{+}-\mathbf {z}\Vert _{\mathbf {z}} \le \Vert \mathbf {z}- \bar{\mathbf {z}}_{+}\Vert _{\mathbf {z}} + \Vert \mathbf {z}_{+} - \bar{\mathbf {z}}_{+}\Vert _{\mathbf {z}} = \lambda _{t_{+}}(\mathbf {z}) + \delta (\mathbf {z}) < 1\). Since the right-hand side of (74) is monotonically increasing with respect to \(\Vert \mathbf {z}_{+}-\mathbf {z}\Vert _{\mathbf {z}}\), using the last inequality into (74), we obtain (27). \(\square \)
1.4 The proof of Theorem 2: local quadratic convergence of FGN
We first prove (a). Given a fixed parameter \(t > 0\) sufficiently small, our objective is to find \(\beta \in (0, 1)\) such that if \(\lambda _{t}(\mathbf {z}^k) \le \beta \), then \(\lambda _{t}(\mathbf {z}^{k+1}) \le \beta \). Indeed, using the key estimate (27) with t instead of \(t_{+}\), we can see that to guarantee \(\lambda _{t}(\mathbf {z}^{k+1}) \le \beta \), we require
Since the left-hand side of this inequality is monotonically increasing when \(\lambda _{t}(\mathbf {z}^k)\) and \(\delta (\mathbf {z}^k)\) are increasing, we can overestimate it by
Using the identity \(\frac{\beta +\delta }{1-\beta -\delta } = \frac{\beta }{1-\beta } + \frac{\delta }{(1-\beta )(1-\beta -\delta )}\), we can write the last inequality as
Clearly, the left-hand side of (75) is positive if \(0< \delta < 1-\beta \). Hence, we need to choose \(\beta \in (0, 0.5(3-\sqrt{5}))\) such that the right-hand side of (75) is also positive. Now, we choose \(\delta \ge 0\) such that \(\delta \le \beta (1-\beta ) < 1-\beta \). Then, (75) can be one more time overestimated by
which implies
This inequality suggests that we can choose \(\delta := \frac{\beta (1 - 3\beta + \beta ^2)(1-\beta )^4}{2\beta ^3 - 5\beta ^2 + 3\beta + 1} > 0\). In this case, we also have \(\delta (\mathbf {z}) + \lambda _t(\mathbf {z}) \le \delta + \beta < 1\), which guarantees the condition of Theorem 1. Hence, we can conclude that \(\lambda _t(\mathbf {z}^k) \le \beta \) implies \(\lambda _t(\mathbf {z}^{k+1}) \le \beta \). In other words, \(\left\{ \mathbf {z}^k\right\} \) belongs to \(\mathcal {Q}_{t}(\beta )\).
(b) Next, to guarantee a quadratic convergence, we can choose \(\delta _k\) such that \(\delta (\mathbf {z}^k) \le \delta _k \le \bar{\delta }_k := \frac{\lambda _{t}(\mathbf {z}^k)^2}{1-\lambda _{t}(\mathbf {z}^k)}\). Substituting the upper bound \(\bar{\delta }_k\) of \(\delta (\mathbf {z}^k)\) into (27) we obtain
Let us consider the function \(s(r) := \frac{(2 - 4r + r^2)r^2}{(1 - 2r)^3}\) on [0, 1]. We can easily check that \(s(r) < 1\) for all \(r \in [0, 1]\). Hence, \(\lambda _{t}(\mathbf {z}^{k+1}) < 1\) as long as \(\lambda _{t}(\mathbf {z}^k) < 1\). This proves the estimate (30).
Now, let us choose some \(\beta \in (0, 1)\) such that \(\lambda _t(\mathbf {z}^k) \le \beta \). Then (30) leads to
where \(c := \frac{2-4\beta +\beta ^2}{(1-2\beta )^3} > 0\). We need to choose \(\beta \in (0, 1)\) such that \(c\lambda _t(\mathbf {z}^k) < 1\). Since \(\lambda _t(\mathbf {z}^k) \le \beta \), we choose \(\beta \) such that \(c\beta < 1\), which is equivalent to \(9\beta ^3 - 16\beta ^2 + 8\beta - 1 < 0\). If \(\beta \in (0, 0.18858]\), then \(9\beta ^3 - 16\beta ^2 + 8\beta - 1 < 0\). Therefore, the radius of the quadratic convergence region of \(\left\{ \lambda _t(\mathbf {z}^k)\right\} \) is \(r := 0.18858\).
(c) Finally, for any \(\beta \in (0, 0.18858]\), we can write \(c\lambda _t(\mathbf {z}^{k+1}) \le (c\lambda _t(\mathbf {z}^k))^2\). By induction, \(c\lambda _t(\mathbf {z}^k) \le (c\lambda _t(\mathbf {z}^0))^{2^k} \le c^{2^k}\beta ^{2^k} < 1\). We obtain \(\lambda _t(\mathbf {z}^k) \le c^{2^{k-1}}\beta ^{2^k}\). Let us choose \(\delta _k := \frac{\lambda _t(\mathbf {z}^k)^2}{1-\lambda _t(\mathbf {z}^k)}\). For \(\epsilon \in (0, \beta )\), assume that \(c^{2^{k-1}}\beta ^{2^k}\le \epsilon \). From Lemma 3, we can choose \(t := (1-\epsilon )(\sqrt{\nu } + \epsilon + 2\epsilon ^2/(1-\epsilon ))^{-1}\varepsilon \) . Then, \(\mathbf {z}^k\) is an \(\varepsilon \)-solution of (1). It remains to use the fact that \(c^{2^{k-1}}\beta ^{2^k}\le \epsilon \) to upper bound the number of iterations \(k := {\mathcal {O}}\left( \ln \left( \ln (1/\epsilon )\right) \right) \). \(\square \)
1.5 The proof of Theorem 3: local quadratic convergence of DGN
(a) Given a fixed parameter \(t > 0\) sufficiently small, it follows from DGN and (70) that
Hence, using these notations and the same proof as (74) with t instead of \(t_{+}\), and assuming \(\Vert \mathbf {z}^{k+1} - \mathbf {z}^k\Vert _{\mathbf {z}^k}<1\), we can derive
Now, let us define \(\tilde{\lambda }_t(\mathbf {z}^k) := \Vert \tilde{\mathbf {z}}^{k+1} - \mathbf {z}^k\Vert _{\mathbf {z}^k}\) and \(\alpha _k := (1 + \tilde{\lambda }_t(\mathbf {z}^k))^{-1}\) as in DGN. From the update \(\mathbf {z}^{k+1} := (1-\alpha _k)\mathbf {z}^k + \alpha _k\tilde{\mathbf {z}}^{k+1}\) of DGN, we have
Substituting these expressions into (76) we get
Substituting \(\alpha _k := (1 + \tilde{\lambda }_{t}(\mathbf {z}^k))^{-1}\) into the last inequality and simplifying the result, we get
Next, by the triangle inequality, it follows from (68) and the definition of \(\lambda _t(\mathbf {z})\) and \(\tilde{\lambda }_t(\mathbf {z})\) that \(\tilde{\lambda }_t(\mathbf {z}^{k+1}) = \Vert \tilde{\mathbf {z}}^{k+2} - \mathbf {z}^{k+1}\Vert _{\mathbf {z}^{k+1}} \le \Vert \bar{\mathbf {z}}^{k+2} - \mathbf {z}^{k+1}\Vert _{\mathbf {z}^{k+1}} + \Vert \tilde{\mathbf {z}}^{k+2} - \bar{\mathbf {z}}^{k+2}\Vert _{\mathbf {z}^{k+1}} = \Vert \bar{\mathbf {z}}^{k+2} - \mathbf {z}^{k+1}\Vert _{\mathbf {z}^{k+1}} + \delta (\mathbf {z}^{k+1})\). Combining this estimate and the above inequality we get
If we choose \(\delta (\mathbf {z}^k) \le \delta _k \le \frac{\tilde{\lambda }_{t}(\mathbf {z}^k)^2}{1+ \tilde{\lambda }_{t}(\mathbf {z}^k)}\), then, by induction, \(\delta (\mathbf {z}^{k+1}) \le \delta _{k+1} \le \frac{\tilde{\lambda }_{t}(\mathbf {z}^{k+1})^2}{1+\tilde{\lambda }_{t}(\mathbf {z}^{k+1})}\). Substituting these bounds into the last inequality and simplifying the result, we obtain
which is indeed (31).
From (31), after a few elementary calculations, we can see that \(\tilde{\lambda }_{t}(\mathbf {z}^{k+1}) \le \tilde{\lambda }_t(\mathbf {z}^k)\) if \(\tilde{\lambda }_t(\mathbf {z}^k)(1 + \tilde{\lambda }_t(\mathbf {z}^k))( 2\tilde{\lambda }_t(\mathbf {z}^k)^2 + 4\tilde{\lambda }_t(\mathbf {z}^k) + 3) \le 1\). Note that the function \(s(\tau ) := \tau (1+\tau )(2\tau ^2 + 4\tau +3)\) is increasing on \([0, 0.5(3-\sqrt{5}))\). By numerically computing \(\tilde{\lambda }_t(\mathbf {z}^k)\) we can observe that if \(\tilde{\lambda }_t(\mathbf {z}^k) \in [0, 0.21027]\), then \(\tilde{\lambda }_{t}(\mathbf {z}^{k+1}) \le \tilde{\lambda }_t(\mathbf {z}^k)\). Hence, if \(\tilde{\lambda }_t(\mathbf {z}^k) \le \beta \) then \(\tilde{\lambda }_t(\mathbf {z}^{k+1}) \le \beta \). In other words, we can say that \(\left\{ \mathbf {z}^k\right\} \subset {\varOmega }_t(\beta )\).
We now prove (b). Indeed, if we take any \(\beta \in (0, 0.21027]\), we can show from (31) that
where \(\bar{c} := \left( \frac{2\beta ^2 + 4\beta + 3}{1 - \beta ^2\left( 2\beta ^2 + 4\beta + 3\right) }\right) \in (0, +\,\infty )\). To guarantee \(\bar{c}\beta < 1\), we need to choose \(\beta > 0\) such that \(2\beta ^4 + 6\beta ^3 + 7\beta ^2 + 3\beta - 1 < 0\). This condition leads to \(\beta \in (0, 0.21027]\). Hence, for any \(0 < \beta \le 0.21027\), if \(\mathbf {z}^0\in \mathcal {Q}_{t}(\beta )\), then \(\tilde{\lambda }_{t}(\mathbf {z}^{k+1}) \le \bar{c}\tilde{\lambda }_{t}(\mathbf {z}^k)^2 < 1\) and, therefore, \(\big \{\tilde{\lambda }_{t}(\mathbf {z}^k)\big \}\) quadratically converges to zero.
(c) To prove the last conclusion in (c), from (66), we can show that
Since \(\tilde{\lambda }_t(\mathbf {z}^k) \le \bar{c}^{2^k-1}\lambda _t(\mathbf {z}^0)^{2^k} \le \bar{c}^{2^k-1}\beta ^{2^k}\), \(\delta _k \le \frac{\tilde{\lambda }_t(\mathbf {z}^k)^2}{1 + \tilde{\lambda }_t(\mathbf {z}^k)}\), and \(\alpha _k = \frac{\tilde{\lambda }_t(\mathbf {z}^k)}{1 + \tilde{\lambda }_t(\mathbf {z}^k)}\), we obtain the last conclusion as a consequence of Lemma 3 with the same proof as in Theorem 2. \(\square \)
1.6 The proof of Lemma 4: the update rule for the penalty parameter
Let us define \(\bar{\mathbf {u}}^k := \mathcal {P}_{\mathbf {z}^k}\left( \mathbf {z}^k - \nabla ^2{F}(\mathbf {z}^k)^{-1}\nabla {F}(\mathbf {z}^k); t_k\right) \). Then, \(\lambda _{t_k}(\mathbf {z}^k)\) defined by (21) becomes \(\lambda _{t_k}(\mathbf {z}^k) := \Vert G_{\mathbf {z}^k}(\mathbf {z}^k; t_k)\Vert _{\mathbf {z}^k}^{*} = \Vert \mathbf {z}^k - \mathcal {P}_{\mathbf {z}^k}\left( \mathbf {z}^k - \nabla ^2{F}(\mathbf {z}^k)^{-1}\nabla {F}(\mathbf {z}^k); t_k\right) \Vert _{\mathbf {z}^k} = \Vert \mathbf {z}^k - \bar{\mathbf {u}}^k \Vert _{\mathbf {z}^k}\). Note that \(\bar{\mathbf {u}}^k = \mathcal {P}_{\mathbf {z}^k}\left( \mathbf {z}^k - \nabla ^2{F}(\mathbf {z}^k)^{-1}\nabla {F}(\mathbf {z}^k); t_k\right) \) leads to
Combining this inclusion and (69) and using the monotonicity of \(\mathcal {A}\), we can derive
By rearranging this expression using \(t_{k+1} := (1-\sigma _{\beta })t_k\) from PFGN, we finally obtain
where the last inequality follows from the elementary Cauchy–Schwarz inequality. This inequality eventually leads to
Now, by the triangle inequality, we have \(\Vert \bar{\mathbf {z}}^{k+1} - \mathbf {z}^k\Vert _{\mathbf {z}^k} \le \Vert \bar{\mathbf {z}}^{k+1} - \bar{\mathbf {u}}^k\Vert _{\mathbf {z}^k} + \Vert \bar{\mathbf {u}}^k - \mathbf {z}^k\Vert _{\mathbf {z}^k}\). This inequality is equivalent to \(\lambda _{t_{k+1}}(\mathbf {z}^k) \le \Vert \bar{\mathbf {z}}^{k+1} - \bar{\mathbf {u}}^k\Vert _{\mathbf {z}^k} + \lambda _{t_k}(\mathbf {z}^k)\) due to the definitions \(\lambda _{t_{k+1}}(\mathbf {z}^k) = \Vert \bar{\mathbf {z}}^{k+1} - \mathbf {z}^k\Vert _{\mathbf {z}^k}\) and \(\lambda _{t_k}(\mathbf {z}^k) = \Vert \bar{\mathbf {u}}^k - \mathbf {z}^k\Vert _{\mathbf {z}^k}\). Using the last estimate in the above inequality we get
which is (32). The second inequality of (32) follows from the fact that \(\Vert \nabla {F}(\mathbf {z}^k)\Vert _{\mathbf {z}^k}^{*}\le \sqrt{\nu }\).
Let us denote by \(\gamma _k := \left( \frac{\sigma _{\beta }}{1-\sigma _{\beta }}\right) \left( \sqrt{\nu } + \lambda _{t_k}(\mathbf {z}^k)\right) \). For a given \(\beta \in (0, 1)\), we now assume that \(\lambda _{t_k}(\mathbf {z}^k) \le \beta \). Then, by using (32) in (27), and the monotonic increase of its right-hand side with respect to \(\lambda _{t_{k+1}}(\mathbf {z}^k)\), we can derive
as long as \(\beta + \vert \gamma _k\vert + \delta _k <1\). Let us denote \(\theta _k := \beta + \vert \gamma _k\vert \). By using the identity \(\frac{\beta + \vert \gamma _k\vert + \delta _k}{1 - \beta - \vert \gamma _k\vert -\delta _k} = \frac{\beta + \vert \gamma _k\vert }{1 - \beta - \vert \gamma _k\vert } + \frac{\delta _k}{(1-\theta _k)(1-\theta _k-\delta _k)}\), we can rewrite the last inequality as
If we choose \(\delta _k\) such that \(0\le \delta _k \le \theta _k(1-\theta _k)<1-\theta _k\), then the above inequality implies
Take any \(c\in (0, 1)\), .e.g., \(c := 0.95\), and choose \(\delta _k\) such that \(0 \le \delta _k \le \frac{(1-c^2)}{c^2M_k}\left( \frac{\theta _k}{1-\theta _k}\right) ^2\). Hence, in order to guarantee \(\lambda _{t_{k+1}}(\mathbf {z}^{k+1}) \le \beta \), by using (77), we can impose the condition \(\left( \frac{\theta _k}{1 - \theta _k }\right) ^2 + M_k\delta _k \le \frac{1}{c^2}\left( \frac{\theta _k}{1-\theta _k}\right) ^2 \le \beta \), which is equivalent to \(\frac{\theta _k}{1 - \theta _k} \le c\sqrt{\beta }\). This condition leads to \(\theta _k \ge \frac{c\sqrt{\beta }}{1+c\sqrt{\beta }} \), and therefore, \(\vert \gamma _k\vert \le \frac{c\sqrt{\beta }}{1+c\sqrt{\beta }} - \beta \). Since \(\vert \gamma _k\vert > 0\), we need to choose \(\beta \) such that \(0< \beta < 0.5(1 + 2c^2 - \sqrt{1 +4c^2})\).
Next, by the choice of \(\delta _k\), we require \(0 \le \delta _k \le \min \left\{ \frac{(1-c^2)}{c^2M_k}\left( \frac{\theta _k}{1-\theta _k}\right) ^2, \theta _k(1-\theta _k)\right\} \). Using the fact that \(M_k = \frac{2\theta _k(1-\theta _k)^2 + \theta _k(1-\theta _k) + 1}{(1-\theta _k)^6}\) from (77) and \(0 \le \theta _k \le \frac{c\sqrt{\beta }}{1+c\sqrt{\beta }}\), we can show that the condition on \(\delta _k\) holds if we choose
On the other hand, we have \(\vert \gamma _k\vert = \left| \left( \frac{\sigma _{\beta }}{1-\sigma _{\beta }}\right) \left( \sqrt{\nu } + \lambda _{t_k}(\mathbf {z}^k)\right) \right| \le \left( \frac{\sigma _{\beta }}{1-\sigma _{\beta }}\right) \left( \sqrt{\nu } + \beta \right) \). In order to guarantee that \(\vert \gamma _k\vert \le \frac{c\sqrt{\beta }}{1+c\sqrt{\beta }} - \beta \), we use the above estimate to impose a condition \(\left( \frac{\sigma _{\beta }}{1-\sigma _{\beta }}\right) \le \frac{1}{\sqrt{\nu } + \beta }\left( \frac{c\sqrt{\beta }}{1 + c\sqrt{\beta }} - \beta \right) \), which leads to
This estimate is exactly the right-hand side of (33). Finally, using (32) and the definition of \(\gamma _k\), we can easily show that \(\lambda _{t_{k+1}}(\mathbf {z}^k) \le \lambda _{t_k}(\mathbf {z}^k) + \left| \gamma _k\right| \le \beta + \left| \gamma _k\right| \equiv \theta _k \le \frac{c\sqrt{\beta }}{1+c\sqrt{\beta }}\).
\(\square \)
1.7 The proof of Theorem 4: the worst-case iteration-complexity of PFGN
By Lemma 3 and \(\lambda _{t_{k+1}}(\mathbf {z}^k) \le \frac{c\sqrt{\beta }}{1 + c\sqrt{\beta }}\), we can see that \(\mathbf {z}^k\) is an \(\varepsilon \)-solution of (1) if \(t_k := M_0^{-1}\varepsilon \), where \(M_0 := \left( 1 - \frac{c\sqrt{\beta }}{1 + c\sqrt{\beta }}\right) ^{-1}\left( \sqrt{\nu } + \frac{c\sqrt{\beta }}{1 + c\sqrt{\beta }} + 2\bar{\delta }_t(\beta )\right) = {\mathcal {O}}(\sqrt{\nu })\).
On the other hand, by induction, it follows from the update rule \(t_{k+1} = (1-\sigma _{\beta })t_k\) of PFGN that \(t_k = (1-\sigma _{\beta })^kt_0\). Hence, \(\mathbf {z}^k\) is an \(\varepsilon \)-solution of (1) if we have \(t_k = (1-\sigma _{\beta })^k t_0 \le \frac{\varepsilon }{M_0}\). This condition leads to \(k\ln (1-\sigma _{\beta }) \ge \ln \left( \frac{\varepsilon }{M_0t_0}\right) \), which implies \(k \le \frac{\ln (\varepsilon /(M_0t_0))}{\ln (1-\sigma _{\beta })}\). Using an elementary inequality \(\ln (1-\sigma _{\beta }) \le -\sigma _{\beta }\), we can upper bound k as
Consequently, the worst-case iteration-complexity of PFGN is \({\mathcal {O}}\left( \sqrt{\nu }\ln \left( \frac{\sqrt{\nu } t_0}{\varepsilon }\right) \right) \).
\(\square \)
1.8 The proof of Theorem 5: finding an initial point for PFGN
From (35), if we define \(\nabla {\hat{F}}(\hat{\mathbf {z}}^j) := \nabla {F}(\hat{\mathbf {z}}^k) - t_0^{-1}\tau _{k+1}\zeta _0\), then we still have \(\nabla ^2{\hat{F}}(\hat{\mathbf {z}}^j) = \nabla ^2{F}(\hat{\mathbf {z}}^j)\). Hence, the estimate (27) still holds for \(\hat{\lambda }_{\tau }(\hat{\mathbf {z}}^j)\).
Next, if we define \(\bar{\mathbf {v}}^j := \mathcal {P}_{\hat{\mathbf {z}}^j}\left( \hat{\mathbf {z}}^j - \nabla ^2{F}(\hat{\mathbf {z}}^j)^{-1}\left( \nabla {F}(\hat{\mathbf {z}}^j)- \tau _jt_0^{-1}\hat{\zeta }^0\right) ; t_0\right) \), then, by the definition of \(\mathcal {P}_{\hat{\mathbf {z}}^j}\), we have
Similarly, since \(\bar{\hat{\mathbf {z}}}^{j+1} := \mathcal {P}_{\hat{\mathbf {z}}^j}\left( \hat{\mathbf {z}}^j - \nabla ^2{F}(\hat{\mathbf {z}}^j)^{-1}\left( \nabla {F}(\hat{\mathbf {z}}^j)- \tau _{j+1}t_0^{-1}\hat{\zeta }^0\right) ; t_0\right) \), we have
Using (78), (79), and the monotonicity of \(\mathcal {A}\), we have
Using \(\tau _{j+1} := \tau _j - {\varDelta }_j\) and the Cauchy–Schwarz inequality, the last inequality leads to
Now, similar to the proof of Lemma 4, using (80), we can derive
By the same argument as the proof of (33), we can show that with \(\hat{\gamma }_k := \frac{{\varDelta }_j}{t_0}\Vert \hat{\zeta }_0\Vert _{\hat{\mathbf {z}}^j}^{*}\), we have \(\left| \hat{\gamma }_k\right| \le \frac{c\sqrt{\eta }}{1+c\sqrt{\eta }} - \eta \). This shows that \({\varDelta }_j \le \frac{t_0}{\Vert \hat{\zeta }_0\Vert _{\hat{\mathbf {z}}^j}^{*}}\left( \frac{c\sqrt{\eta }}{1+c\sqrt{\eta }} - \eta \right) \), which is the first estimate of (37). The second estimate of (37) can be derived as in Lemma 4 using \(\eta \) instead of \(\beta \).
We prove (38). From (21) and (36), using the triangle inequality, we can upper bound
which proves the first inequality of (38).
By [31, Corollary 4.2.1], we have \(\Vert \hat{\zeta }^0\Vert _{\hat{\mathbf {z}}^j}^{*} \le \kappa \Vert \hat{\zeta }^0\Vert _{\bar{\mathbf {z}}_F^{\star }}^{*}\), where \(\bar{\mathbf {x}}_F^{\star }\) and \(\kappa \) are given by (15) and below (15), respectively. Hence, \(\bar{{\varDelta }}_{\eta } := \frac{\mu _{\eta }}{\kappa \Vert \hat{\zeta }^0\Vert _{\bar{\mathbf {z}}_F^{\star }}^{*}} \le \bar{{\varDelta }}_{j}\). The second estimate of (38) follows from \(\tau _j := \tau - \sum _{l=0}^{j-1}{\varDelta }_j \le 1 - j\bar{{\varDelta }}_{\eta }\) due to the update rule (35) with \({\varDelta }_j := \bar{{\varDelta }}_{j} \ge \bar{{\varDelta }}_{\eta }\). In order to guarantee \(\lambda _{t_0}(\mathbf {z}^0) \le \beta \), it follows from (38) and the update rule of \(\tau _j\) that
Finally, substituting \(\bar{{\varDelta }}_{\eta } = \frac{t_0}{\kappa \Vert \hat{\zeta }_0\Vert _{\bar{\mathbf {z}}^{\star }_F}^{*}}\left( \frac{c\sqrt{\eta }}{1+c\sqrt{\eta }} - \eta \right) \) into this estimate and after simplifying the result, we obtain the remaining conclusion of Theorem 5. \(\square \)
1.9 The proof of Theorem 7: primal recovery for (4) in Algorithm 2
By the definition of \(\varphi \), we have \(\varphi (\mathbf {y}) := f^{*}(\mathbf {c}- L^{*}\mathbf {y}) = f^{*}(t^{-1}(\mathbf {c}- L^{*}\mathbf {y})) - \nu \ln (t)\) due to the self-concordant logarithmic homogeneity of f. Using the property of the Legendre transformation \(f^{*}\) of f, we can express this function as
We show that the point \(\mathbf {x}^k\) given by (56) solves the above maximization problem. We can write down the optimality condition of the above maximization problem as
which leads to \(\nabla {f}(\mathbf {x}^{k+1}) = t_{k+1}^{-1}(\mathbf {c}- L^{*}\mathbf {y}^{k+1})\). On the other hand, by the well-known property of f [31], we have \(\mathbf {x}^{k+1} = \nabla {f^{*}}(\nabla {f}(\mathbf {x}^{k+1})) = \nabla {f^{*}}\left( t_{k+1}^{-1}(\mathbf {c}- L^{*}\mathbf {y}^{k+1})\right) \in \mathrm {int}\left( \mathcal {K}\right) \).
Now, we prove (57). Note that \(\mathbf {c}- L^{*}\mathbf {y}^{k+1} - t_{k+1}\nabla {f}(\mathbf {x}^{k+1}) = 0\) and \(\Vert \nabla {f}(\mathbf {x})\Vert _{\mathbf {x}}^{*}\le \sqrt{\nu }\), which leads to
Since \(t_{k+1} \le \varepsilon \), this estimate leads to the first inequality of (57).
From (24), there exists \(\mathbf {e}^k\in \mathbb {R}^p\) such that \(\mathbf {e}^k \in \nabla {\varphi }(\mathbf {y}^k) + \nabla ^2{\varphi }(\mathbf {y}^k)(\mathbf {y}^{k+1} - \mathbf {y}^k) + t_{k+1}^{-1}\partial {\psi }(\mathbf {y}^{k+1})\) and \(\Vert \mathbf {e}^k \Vert _{\mathbf {y}^{k}}^{*} \le \delta _k\). This condition leads to
Therefore, we have
To estimate the right-hand side of this inequality, we define \(M_k := \Vert \nabla {\varphi }(\mathbf {y}^{k+1}) - \nabla {\varphi }(\mathbf {y}^k) - \nabla ^2{\varphi }(\mathbf {y}^k)(\mathbf {y}^{k+1} - \mathbf {y}^k)\Vert _{\mathbf {y}^{k+1}}^{*}\). With the same proof as [31, Theorem 4.1.14], we can show that
Here, we use \(\Vert \mathbf {y}^{k+1} - \mathbf {y}^k\Vert _{\mathbf {y}^k} \le \Vert \mathbf {y}^{k+1} - \bar{\mathbf {y}}^{k+1}\Vert _{\mathbf {y}^k} + \Vert \bar{\mathbf {y}}^{k+1} - \mathbf {y}^k\Vert _{\mathbf {y}^k} = \delta (\mathbf {y}^k) + \lambda _{t_{k+1}}(\mathbf {y}^k)\) by the definitions of \(\lambda _{t_{+}}(\mathbf {y})\) in (21) and of \(\delta (\mathbf {y})\) above (27). Substituting (83) into (82) we get
Next, it remains to estimate \(\Vert \mathbf {e}^k\Vert _{\mathbf {y}^{k+1}}^{*}\). Indeed, we have
Using this estimate into (84) and \(\lambda _{t_{k+1}}(\mathbf {y}^k) \le c\sqrt{\beta }(1+c\sqrt{\beta })^{-1}\) from Lemma 4, we obtain
Substituting an upper bound \(\delta _t := \frac{(1-c^2)\beta }{(1+c\sqrt{\beta })^3\left[ 3c\sqrt{\beta } + c^2\beta + (1+c\sqrt{\beta })^3\right] }\) of \(\delta _k\) from Lemma 4 into the last estimate and simplifying the result, we get
where \(\theta (c,\beta )\) is defined as
Using the fact that \(c \in (0, 1)\) and \(0 \le \beta < 0.5(1 + 2c^2 - \sqrt{1 + 4c^2})\), we have \(\theta (c,\beta ) \le 1\). Since \(\nabla {\varphi }(\cdot ) = -L\nabla {f^{*}}( \mathbf {c}-L^{*}(\cdot ) ) = -t_{k+1}^{-1}L\nabla {f^{*}}(t_{k+1}^{-1}(\mathbf {c}-L^{*}(\cdot )))\) due to (48), using (56) we can show that \(\nabla {\varphi }(\mathbf {y}^{k+1}) = t_{k+1}^{-1}L\mathbf {x}^{k+1}\). Plugging this expression into (85) and noting that \(\partial {\psi }(\cdot ) = \partial {g}^{*}(\cdot ) + \mathbf {b}\), we obtain
Let \(\mathbf {s}^{k+1} = \pi _{\partial {g^{*}}(\mathbf {y}^{k+1})}(L\mathbf {x}^{k+1} - \mathbf {b})\) be the projection of \(L\mathbf {x}^{k+1} - \mathbf {b}\) onto \(\partial {g^{*}}(\mathbf {y}^{k+1})\). Then, \(\mathbf {s}^{k+1} \in \partial {g^{*}}(\mathbf {y}^{k+1})\), and hence, \(\mathbf {y}^{k+1} \in \partial {g}(\mathbf {s}^{k+1})\), which shows the second term of (57). Using this relation in the last inequality and the definition of \(\mathbf {s}^{k+1}\), we obtain \(\Vert L\mathbf {x}^{k+1} - \mathbf {b}- \mathbf {s}^{k+1}\Vert _{\mathbf {y}^{k+1}}^{*} \le t_{k+1}\theta (c,\beta )\), which is the third term of (57). Finally, since \(\theta (c,\beta ) \le 1\), we have \(\max \left\{ \sqrt{\nu }, \theta (c,\beta )\right\} = \sqrt{\nu }\). Using (57), we can conclude that \((\mathbf {x}^k, \mathbf {s}^k)\) is an \(\varepsilon \)-solution of (3) if \(\sqrt{\nu }t_k \le \varepsilon \). \(\square \)
Rights and permissions
About this article
Cite this article
Tran-Dinh, Q., Sun, T. & Lu, S. Self-concordant inclusions: a unified framework for path-following generalized Newton-type algorithms. Math. Program. 177, 173–223 (2019). https://doi.org/10.1007/s10107-018-1264-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-018-1264-6
Keywords
- Self-concordant inclusion
- Generalized Newton-type methods
- Path-following schemes
- Monotone inclusion
- Constrained convex programming
- Saddle-point problems