1 Introduction

First-order optimization methods, stochastic, proximal, accelerated or otherwise, are a popular class of algorithms; especially for the large-scale optimization models that arise in modern machine learning applications. The ease of implementation in distributed architectures and the ability to obtain a reasonably accurate solution quickly are the main reasons for the dominance of first-order methods in machine learning applications. In the last few years, second-order methods based on variants of the Newton method have also been proposed. Second-order methods, such as the Newton method, offer the potential of quadratic convergence rates, and scale invariance. Both of these features are highly desirable in optimization algorithms and are not present in first-order methods.

Scale invariance is crucial because it means that the algorithm is not sensitive to the input data (see [30] for a thorough discussion of the consequences of scale invariance in machine learning applications). Unfortunately, the conventional Newton method has huge storage and computational demands and does not scale to applications that have both large and dense Hessian matrices. To improve the convergence rates, and robustness of the optimization algorithms used in machine learning applications many authors have recently proposed modifications of the classical Newton method. We refer the interested reader to the recent survey in [4] for a thorough review.

Despite the recent interest, and developments in the application of machine learning applications, existing approaches suffer from one or both of the following shortfalls that we address in this paper.

I: Lack of scale-invariant convergence analysis without restrictive assumptions. The Newton algorithm can be analyzed using the elegant theory of self-concordant functions. Convergence proofs using the theory of self-concordant functions enable the derivation of convergence rates that are independent of the Lipschitz constants and strong convexity parameters of the model [27]. In machine learning applications these constants are related to the input data of the problem (e.g., the dictionary in supervised learning applications), so having a theory that is not affected by the scaling of the data is quite important both for practical reasons (e.g., choice of step-sizes), and theory (rates derived using this approach do not depend on these constants).

II: Lack of global convergence guarantees accompanied with a local super-linear convergence rate without ad-hoc assumptions regarding the spectral properties of the input data. The second major feature of second-order methods is their global theory and their extremely fast and convergence rates near the minimizer. Indeed, the analysis of second-order methods should exhibit this feature in order to justify its higher per-iteration and storage costs when compared to first-order methods.

The above shortcomings have significant implications regarding the practical performance of second-order methods in machine learning applications (see [1, 2] for additional discussion). To address the two shortcomings we propose a Newton-based multilevel algorithm that can scale to realistic convex models that arise in large-scale optimization. The method is general in the sense that it does not assume that the objective function is a sum of functions, and we make no assumptions regarding the data regime. Our theoretical analysis is based on the theory of self-concordant functions and we are able to prove a local super-linear convergence of the algorithm. Specifically, the super-linear convergence rate is achieved when search directions are computed in both a deterministic and randomized manner. For the latter case, we prove convergence in probability. We obtain our theoretical results by drawing parallels between the second-order methods used in machine learning, and the so-called Galerkin model from the multigrid optimization literature.

1.1 Related Work

In this section we discuss the methods most related to our approach and discuss the theoretical and practical limitations of the current state-of-the-art. We begin with multigrid optimization methods (also known as multilevel methods). The philosophy behind multigrid methods constitutes the basis of our approach.

Multilevel optimization algorithms have been shown to be very efficient for large-scale problems and in many cases outperform other state-of-the-art methods, see for instance [16, 19, 25, 34]. Nash in [25] introduced the MG/Opt method for solving unconstrained convex optimization. However, no convergence rates were given in [25]. Furthermore, a “smoothing” step is required when switching to different levels which is inefficient for large-scale applications. In [16], the authors expand the work in [25] for trust region methods, but the expensive smoothing step is still required. The work in [34] proposes a line search method with global linear convergence rate. Additionally, instead of taking the smoothing step, the authors introduce new conditions which produce effective search directions and yield faster iterations compared to [25, 34]. But even the improved conditions in [34] can still be inefficient in practice. The most related multilevel method to ours is proposed in [19]. There, the authors show a sub-linear rate that is followed by a local composite (linear-quadratic) convergence rate for strongly convex functions. However, their analysis achieves a slow rate in the first-phase and they do not prove a super-linear rate.

Similar to multilevel methods, subspace Newton methods have also been proposed to overcome the limitations of the standard Newton method [15, 17, 31]. SDNA [31] randomly selects principal sub-matrices of a matrix \(\textbf{M}\) which is, by assumption, an upper bound on the Hessian matrix. However, the sketch matrices are not directly applied to the Hessian. An extension that overcomes this limitation was proposed for the Randomized Subspace Newton method (RSN) [15], which, as in this work, allows the user to select from a larger set of sketch matrices. Similarly, a stochastic subspace variant of the Cubic Newton method was proposed in [17]. When the regularization parameter of the cubic term is set to zero the method reduces to our approach and RSN. Nevertheless, the convergence rate of the above methods is linear.

Several variants of the Newton method based on sub-sampling or sketching have been proposed [2, 3, 6, 7, 10, 22, 30, 32]. Sub-sampled Newton methods are not directly comparable with our approach because they are not reduced-order (or subspace) methods. Nevertheless, we discuss them below due to their well established theory and fast convergence rates. The authors in [6, 7] introduced sub-sampling techniques on the Hessian matrix of the Newton method, however, they do not provide convergence rates. The work in [10] proves a local composite rate with high probability, however, the objective function must be written as a sum of m functions and further the method is efficient only for \(m \gg n\), where n is the problem dimension. The authors in [22] show a local super-linear convergence rate of the sub-sampled Newton method, and convergence is attained even when the batch size is equal to one, but still the objective function needs be written as a sum of functions. The method proposed in [3] not only sub-samples the Hessian matrix but also the gradient vector. The method enjoys a local super-linear convergence rate and the results are given in expectation. The authors in [32] extended the results in [3], and provide a local super-linear convergence rate with high probability. The method with the most related theoretical results to ours is Newton Sketch in [30]. The authors assume self-concordant functions and thus they are able to provide a global and scale-invariant analysis with a local super-linear convergence rate as we do in this paper. However, the square root of the Hessian matrix must be available. Furthermore, the method has expensive iterations as it requires matrix multiplications to form the sketched Hessian which is much slower compared to randomly selecting data points. To this end, all the above methods fail to address one or both of the shortcomings discussed in the previous section, and when they do so, further assumptions are required such as regarding the objective function (sum of functions) or knowledge of the square root of the Hessian matrix.

1.2 Contributions

In view of the related work, in this paper we propose the Self-concordant Iterative-minimization Galerkin-based Multilevel Algorithm (SIGMA) for unconstrained high-dimensional convex programs, and we provide convergence analysis that addresses both shortcomings discussed earlier in this section. Below we discuss in detail the main contributions of this paper. We begin with the main theoretical properties of SIGMA.

The proposed approach is based on the multilevel framework of [19], but instead, we assume that the objective function is strictly convex and self-concordant. Similar to [19], we employ the Galerkin model as the target coarse-grained model. We show that the Galerkin model is first- and second-order coherent and we provide connections with the classical Newton method. Such connections are important as they enhance our intuition regarding the subspace methods discussed in the previous section. Moreover, in [19, 34], the authors make use of conditions that yield effective coarse directions. Instead, in this work we propose alternative conditions that are tailored to self-concordant functions. In Lemma 3.9, we show that, when the Galerkin model is constructed in a deterministic manner, SIGMA enjoys a globally convergent first phase (damped phase) with linear rate followed by a local composite rate. The theorem shows that the proposed method can achieve a local super-linear or quadratic convergence rate. A complete convergence behavior, that accounts for when the method alternates between SIGMA and the Newton method is presented in Theorem 3.2. In Lemma 3.5 and Lemma 3.6, we prove a local quadratic convergence rate in the subspace (lower level). A proof of the Armijo-rule criterion is also provided.

We also study the Galerkin model that is generated by the Nyström method (sampling according to some distribution function) and we show that the coarse direction can be seen as a randomized approximation of the Newton direction. In Theorem 3.3, we prove the local super-linear convergence rate of SIGMA with probability \(1-\rho \). More importantly, the theorem indicates how to construct the random coarse directions (i.e., how to select the sampling distribution) in order to achieve convergence guarantees with high-probability and also probability one. In particular, we consider the uniform distribution without replacement for constructing the coarse direction (conventional SIGMA) and we discuss cases for which \(\rho \) may be small. Moreover, we introduce a new adaptive sampling strategy in order to ensure convergence with probability one. We also introduce a mixed strategy that interpolates between the uniform and adaptive strategies and offers better results than those of the conventional SIGMA method. The idea of introducing new sampling strategies in place of the uniform sampling scheme has been studied in the context of coordinate descent methods [11, 29]. However, to the best of our knowledge, this is the first paper that shows the benefits of alternative sampling schemes on multilevel or subspace methods. Furthermore, constructing coarse directions using the Nyström method, we significantly improve the computational complexity of the algorithm, since the expensive matrix multiplications can be abandoned. Under this regime, we identify instances such that the expensive conditions which provide effective coarse direction at each iteration are no longer necessary. We note that this is not possible for other multilevel methods such as [19, 34]. Lastly, in Theorem 3.4 we provide a simple convergence result in expectation which provides a complete picture of the convergence behavior of SIGMA in this regime.

The method is suitable for large-scale optimization. For instance, consider the standard setting in machine learning applications: \(f: \textbf{x} \rightarrow \textbf{Ax}\), where \(\textbf{A} \in \mathbb {R}^{m \times N}\) is the dataset matrix and let \(m, N \gg 1\). If we assume that \(n \le N\) (or even \(n \ll N\)) are the dimensions of the model in the coarse level, then SIGMA requires \(\mathcal {O}(mn^2)\) operations to form the Hessian, and, therefore, solving the corresponding system of linear equations to obtain search directions can be carried out efficiently. If the objective function cannot be written as a sum of functions we show that the complexity of iterates of SIGMA using the Nyström method is \(\mathcal {O}(n^3 + nN)\) which is a significant improvement compared to the Newton method that requires \(\mathcal {O}(N^3 + N^2)\) operations. Note also that the standard sub-sampled Newton methods require approximately \(\mathcal {O}(N^3)\) operations for solving the system of linear equations, and thus they have the same limitations with the conventional Newton method. Thus, the proposed method significantly improves the applicability of second-order methods in high-dimensional settings. We illustrate this with numerical experiments for solving the problem of maximum likelihood estimation in large-scale machine learning. In particular, numerical experiments based on standard benchmark problems and other state-of-the-art second-order methods suggest that the conventional SIGMA compares favorably with the state-of-the-art. For specific problem structures, say problems with nearly low-rank Hessian matrices, SIGMA is typically several times faster and more robust. In contrast with the sub-sampled Newton methods that they are suitable only when \(m \gg n\), we show that SIGMA is suitable for a wide range of problems and both regimes such as \(m<N\) and \(m>N\). We perform comparison between the conventional SIGMA and the one that employs the new sampling strategies. Numerical results show substantial improvements in the convergence rates of SIGMA using the new sampling strategies (up to four times faster). These numerical results are on-par with our improved theoretical results. Hence, the alternative sampling strategies can considerably accelerate convergence of the multilevel or subspace methods, such as [15, 17, 19, 34]. Finally, numerical experiments clearly show that SIGMA achieves super-linear and quadratic convergence rates. In this case too, the practice is in agreement with the theory.

1.3 Notation and Preliminaries

Throughout this paper, all vectors are denoted with bold lowercase letters, i.e., \(\textbf{x} \in \mathbb {R}^n\) and all matrices with bold uppercase letters, i.e., \(\textbf{A} \in \mathbb {R}^{m \times n}\). The function \(\Vert \textbf{x} \Vert _2 = \langle \textbf{x}, \textbf{x} \rangle ^{1/2} \) is the \(\ell _2\)- or Euclidean norm of \(\textbf{x}\). The spectral norm of \(\textbf{A}\) is the norm induced by the \(\ell _2\)-norm on \(\mathbb {R}^n\) and it is defined as \(\Vert \textbf{A} \Vert _2:= \max _{\Vert \textbf{x} \Vert _2 = 1} \Vert \textbf{A} \textbf{x} \Vert _2\). It can be shown that \(\Vert \textbf{A} \Vert _2 = \sigma _1(\textbf{A})\), where \(\sigma _1(\textbf{A})\) (or simply \(\sigma _1\)) is the largest singular value of \(\textbf{A}\), see [20, Section 5.6]. For two symmetric matrices \(\textbf{A}\) and \(\textbf{B}\), we write \(\textbf{A} \succeq \textbf{B} \) when \(\textbf{x}^T (\textbf{A} - \textbf{B}) \textbf{x} \ge 0\), for all \(\textbf{x} \in \mathbb {R}^n\), or otherwise when the matrix \(\textbf{A} - \textbf{B}\) is positive semi-definite. Below we present the main properties and inequalities for self-concordant functions. For a complete analysis see [5, 26]. A univariate convex function \(\phi : \mathbb {R} \rightarrow \mathbb {R}\) is called self-concordant with constant \(M_{\phi } \ge 0\) if and only if,

$$\begin{aligned} | \phi '''(x) | \le M_{\phi } \phi ''(x)^{3/2}. \end{aligned}$$
(1)

Examples of self-concordant functions include but not limited to the linear, convex quadratic, negative logarithmic and negative log-determinant functions. Based on the above definition, a multivariate convex function \(f: \mathbb {R}^n \rightarrow \mathbb {R} \) is called self-concordant if \(\phi (t):= f(\textbf{x} + t\textbf{u})\) satisfies Eq.(1) for all \(\textbf{x} \in {\text {dom}} f, \textbf{u} \in \mathbb {R}^n\) and \(t \in \mathbb {R}\) such that \(\textbf{x} +t \textbf{u} \in {\text {dom}} f \). Furthermore, self-concordance is preserved under composition with any affine function. In addition, for any convex self-concordant function \(\tilde{\phi }\) with constant \(M_{\tilde{\phi }} \ge 0\) it can be shown that \(\phi (x):= ({M_{\tilde{\phi }}^2}/{4})\, \tilde{\phi }(x)\) is self-concordant with constant \(M_\phi = 2\). Next, given \(\textbf{x} \in {\text {dom}}f\) and assuming that \(\nabla ^2 f(\textbf{x})\) is positive-definite we can define the following norms

$$\begin{aligned} \Vert \textbf{u} \Vert _\textbf{x}:= \langle \nabla ^2 f(\textbf{x}) \textbf{u}, \textbf{u} \rangle ^{1/2} \ \ \text {and} \ \ \Vert \textbf{v} \Vert _\textbf{x}^*:= \langle [\nabla ^2 f(\textbf{x})]^{-1} \textbf{v}, \textbf{v} \rangle ^{1/2}, \end{aligned}$$
(2)

for which it holds that \( | \langle \textbf{u}, \textbf{v} \rangle | \le \Vert \textbf{u} \Vert _\textbf{x}^* \Vert \textbf{v} \Vert _\textbf{x}\). Therefore, the Newton decrement can be written as

$$\begin{aligned} \lambda _f(\textbf{x}):= \Vert \nabla f(\textbf{x}) \Vert _\textbf{x}^* = \Vert [\nabla ^2 f(\textbf{x})]^{-1/2} \nabla f(\textbf{x}) \Vert _2. \end{aligned}$$
(3)

In addition, we take into consideration two auxiliary functions, both introduced in [26]. Define the functions \(\omega \) and \(\omega _*\) such that

$$\begin{aligned} \omega (x):= x - \log (1+x) \quad \text {and} \quad \omega _*(x):= -x - \log (1-x), \end{aligned}$$
(4)

where \( {\text {dom}} \omega = \{ x \in \mathbb {R}: x\ge 0 \}\) and \({\text {dom}}\omega _* = \{ x \in \mathbb {R}: 0 \le x < 1 \}\), respectively, and \(\log (x)\) denotes the natural logarithm of x. Moreover, note that both functions are convex and their range is the set of positive real numbers. Further, from the definition Eq. (1), for \(M_{\phi } = 2\), we have that

$$\begin{aligned} \left| \frac{d}{dt} \left( \phi ''(t)^{-1/2} \right) \right| \le 1, \end{aligned}$$

from which, after integration, we obtain the following bounds

$$\begin{aligned} \frac{\phi ''(0)}{(1 + t \phi ''(0)^{1/2})^2} \le \phi ''(t) \le \frac{\phi ''(0)}{(1 - t \phi ''(0)^{1/2})^2}, \end{aligned}$$
(5)

where the lower and the uppers bounds hold for \(t\ge 0\) and \(t \in [0, \phi ''(0)^{-1/2})\), with \(t \in {\text {dom}}\phi \), respectively (see also [5]). Consider now self-concordant functions on \(\mathbb {R}^n\) and let \( S(\textbf{x}) := \left\{ \textbf{y} \in \mathbb {R}^n: \Vert \textbf{y} - \textbf{x} \Vert _\textbf{x} < 1 \right\} \). For any \(\textbf{x} \in {\text {dom}}f\) and \(\textbf{y} \in S(\textbf{x})\), we have that (see [26])

$$\begin{aligned} (1 - \Vert \textbf{y} - \textbf{x} \Vert _\textbf{x})^2 \nabla ^2 f(\textbf{x}) \preceq \nabla ^2 f(\textbf{y}) \preceq \frac{1}{(1 - \Vert \textbf{y} - \textbf{x} \Vert _\textbf{x})^2} \nabla ^2 f(\textbf{x}). \end{aligned}$$
(6)

Throughout this paper, we refer to notions such as super-linear and quadratic convergence rates. In what follows, we define these notions. Denote \(\textbf{x}_k\) the iterate generated by an iterative process at the \(k^{\text {th}}\) iteration. The sub-optimality gap of the Newton method for self-concordant functions satisfies the bound \(f(\textbf{x}_k) - f(\textbf{x}^*) \le \lambda _f(\textbf{x}_k)^2\), which holds for \(\lambda _f(\textbf{x}_k) \le 0.68\) [5], and thus one can estimate the convergence rate in terms of the local norm of the gradient. It is known that the Newton method achieves a local quadratic convergence rate. In this setting, we say that a process converges quadratically if \(\lambda _f(\textbf{x}_{k+1}) \le R \lambda _f(\textbf{x}_{k})^2\), for some \(R > 0\). Furthermore, for \(0< R_1 < 1\) and \( R_2 > 0\) fixed, a process achieves composite convergence rate if \(\lambda _f(\textbf{x}_{k+1}) \le R_1 \lambda _f(\textbf{x}_{k}) + R_2 \lambda _f(\textbf{x}_{k})^2\). In addition to quadratic and composite convergence rates, we say that a process converges with super-linear rate if \(\lambda _f(\textbf{x}_{k+1}) / \lambda _f(\textbf{x}_{k}) \le R(k)\), \(\lambda _f(\textbf{x}_{k}) \ne 0\), for some \(R(k) \downarrow 0\) as \(k \rightarrow \infty \). In particular, we say that the sequence \((\lambda _f(\textbf{x}_{k}))_{k \in \mathbb {N}}\) convergences with a Q-super-linear rate and, if \(f(\textbf{x}_k) - f(\textbf{x}^*) \le \lambda _f(\textbf{x}_k)^2\), then the sequence \((f(\textbf{x}_k))_{k \in \mathbb {N}}\) converges with R-super-linear rate, for more details see [28].

2 Multilevel Models for Unconstrained Optimization

In this section we summarize results for multilevel methods in optimization. Sects. 2.12.4 constitute background material and can be found also in [19, 34]. In Sect. 2.5, we describe the Nyström method which will be used to construct the coarse direction, and introduce the new sampling strategies. In Sect. 2.6, we introduce the conditions that guarantee effective coarse directions and provide general results for self-concordant functions.

2.1 Problem Framework and Settings

Let \(\textbf{x}_h^*\) be defined as the solution of the following optimization problem,

$$\begin{aligned} \textbf{x}_h^* = \underset{\textbf{x}_h \in \mathbb {R}^N}{{\text {arg \ min}}} f_h(\textbf{x}_h), \end{aligned}$$

where \(f_h: \mathbb {R}^N \rightarrow \mathbb {R}\) is a continuous, differentiable and strictly convex self-concordant function. Further, we suppose that it is bounded below so that a minimizer \(\textbf{x}_h^*\) exists. Below we state our main assumption formally.

Assumption 1

The function \(f_h\) is strictly convex and self-concordant with constant \(M_f = 2\).

Since this work constitutes an extension of the results in [19] to self-concordant functions, we adopt similar notation. We clarify that the subscript h of \(f_h\) denotes the fine or original model we wish to minimize. In this paper, unlike the idea of multigrid methods, where a hierarchy of several discretized problems is constructed, we consider only two levels. The model in the lower level (lower dimension) is called the coarse model. Thus, the idea is to use information from the coarse model to solve the fine model. As with h, we use the subscript H to refer to the coarse level and \(f_H\) to refer to the coarse model. Moreover, the dimensions related to the fine and coarse models are denoted with N and n, respectively, that is, \({\text {dom}} f_H = \{ \textbf{x}_H \in \mathbb {R}^n \}\) and \({\text {dom}} f_h = \{ \textbf{x}_h \in \mathbb {R}^N \}\), where \(n \le N\). In traditional multigrid methods the coarse model is typically derived by varying a discretization parameter. In machine learning applications a coarse model can be derived by varying the number of pixels in image processing applications [13], or by varying the dictionary size and fidelity in statistical pattern recognition (see e.g. [21] for examples in face recognition, and background extraction from video).

To “transfer” information from coarse to fine model and vice versa we define \(\textbf{P}\) and \(\textbf{R}\) to be the prolongation and restriction operators, respectively. The matrix \(\textbf{P} \in \mathbb {R}^{N \times n}\) defines a mapping from coarse to fine level and matrix \(\textbf{R} \in \mathbb {R}^{n \times N}\) from fine to coarse. The following assumption on the aforementioned operators is typical for multilevel methods, see for instance [19, 34].

Assumption 2

The restriction and prolongation operators \(\textbf{R}\) and \(\textbf{P}\) are connected via the following relation

$$\begin{aligned} \textbf{P} = \sigma \textbf{R}^T, \end{aligned}$$

where \(\sigma >0\), and \(\textbf{P}\) has full column rank, i.e., \( {\text {rank}}(\textbf{P}) = n.\)

For simplification purposes and without loss of generality we assume that \(\sigma = 1\). Using the above operators we construct the coarse model as follows. First, let \(\textbf{x}_{h,k}\) be the \(k^{\text {th}}\) iterate with associated gradient \(\nabla f_h (\textbf{x}_{h,k})\). We initiate the coarse model with initial point \(\textbf{x}_{H,0}:= \textbf{R} \textbf{x}_{h,k}\). Then, the optimization problem, at the coarse level, at iteration k, takes the following form

$$\begin{aligned} \underset{\textbf{x}_H \in \mathbb {R}^n}{{\text {min}}} \psi _H(\textbf{x}_H):= f_H(\textbf{x}_H) + \langle \textbf{u}_H, \textbf{x}_H - \textbf{x}_{H,0} \rangle , \end{aligned}$$
(7)

where \(\textbf{u}_H:= \textbf{R} \nabla f_h(\textbf{x}_{h,k}) - \nabla f_H (\textbf{x}_{H,0})\) and \(f_H: \mathbb {R}^n \rightarrow \mathbb {R} \). Note that the above objective function is not just \(f_H(\textbf{x}_H)\), but, in order for the coarse model to be first-order coherent, the quantity \(\langle \textbf{u}_H, \textbf{x}_H - \textbf{x}_{H,0} \rangle \) is added, which ensures that

$$\begin{aligned} \nabla \psi _H (\textbf{x}_{H,0}) = \textbf{R} \nabla f_h (\textbf{x}_{h,k}). \end{aligned}$$

In addition to the first-order coherency condition, we also assume that the coarse model is second-order coherent, i.e., \( \nabla ^2 \psi _H (\textbf{x}_{H,0}) = \textbf{R} \nabla ^2 f_h (\textbf{x}_{h,k}) \textbf{P}\). Later we discuss how the so-called Galerkin model satisfies both first- and second-order coherency conditions.

2.2 The Multilevel Iterative Scheme

The philosophy behind multilevel algorithms is to make use of the coarse model Eq. (7) to provide the search directions. Such a direction is called a coarse direction. If the search direction is computed using the original model then it will be called a fine direction.

To obtain the coarse direction we first compute

$$\begin{aligned} \hat{\textbf{d}}_{H,k}:= \textbf{x}_H^* - \textbf{x}_{H,0}, \end{aligned}$$
(8)

where \(\textbf{x}_H^*\) is the minimizer of Eq. (7), and then we move to the fine level by applying the prolongation operator

$$\begin{aligned} \hat{\textbf{d}}_{h,k}:= \textbf{P} (\textbf{x}_H^* - \textbf{x}_{H,0}). \end{aligned}$$
(9)

Note that the difference in the subscripts in the above definitions is because \(\hat{\textbf{d}}_{H,k} \in \mathbb {R}^n\) while \(\hat{\textbf{d}}_{h,k} \in \mathbb {R}^N\). We also clarify that \(\textbf{d}_{h,k}\), i.e., with the “hat” omitted, refers to the fine direction. The update rule of the multilevel scheme is

$$\begin{aligned} \textbf{x}_{h,k+1} = \textbf{x}_{h,k} + t_{h,k} \hat{\textbf{d}}_{h,k}, \end{aligned}$$
(10)

where \(t_{h,k}>0\) is the step-size parameter.

It has been shown in [34] that the coarse direction is a descent direction. However, this result does not suffice for \(\hat{\textbf{d}}_{h,k}\) to always lead to reduction in the value (of the objective) function. By the first-order coherent condition, it is easy to see that when \(\nabla f_h (\textbf{x}_{h,k}) \ne 0\) and \(\nabla f_h (\textbf{x}_{h,k}) \in {\text {null}}(\textbf{R})\) (i.e., \(\textbf{R} \nabla f_h (\textbf{x}_{h,k}) = 0\)) we have that \(\textbf{x}_H^* - \textbf{x}_{H,0} = 0\), and thus \(\hat{\textbf{d}}_{h,k} = 0\), which implies no progress for the multilevel scheme, Eq. 10. To overcome this issue we may replace the coarse direction \(\hat{\textbf{d}}_{h,k}\) with the fine direction \(\textbf{d}_{h,k}\) when the former is ineffective. Examples of fine directions include search directions arising from Newton, quasi-Newton and gradient descent methods. This approach is very common in the multigrid literature for solving PDEs [16, 23, 34]. The following conditions, proposed in [34], determine whether or not the fine direction should be employed, i.e., we use \(\textbf{d}_{h,k}\) when,

$$\begin{aligned} \left\| \textbf{R} \nabla f_h (\textbf{x}_{h,k}) \right\| _2 \le \mu \left\| \nabla f_h (\textbf{x}_{h,k}) \right\| _2 \ \ \text {or} \ \ \left\| \textbf{R} \nabla f_h (\textbf{x}_{h,k}) \right\| _2 \le \epsilon , \end{aligned}$$
(11)

where \(\mu \in (0, {\text {min}}(1, \Vert \textbf{R} \Vert _2))\). Hence, the above conditions need to be checked at each iteration and prevent the use of the coarse direction when \(\textbf{R} \nabla f_h (\textbf{x}_{h,k}) = 0\), while \(\nabla f_h (\textbf{x}_{h,k}) \ne 0\) and \(\textbf{x}_{H,0}\) is sufficiently close to the solution \(\textbf{x}_{H}^*\), according to some tolerance \(\epsilon \in (0, 1)\). In PDE problems, alternating between the coarse and the fine direction is necessary for obtaining the optimal solution. In Sect. 3.5, we show it is possible to select the prolongation operator such that the fine direction is never taken, yet the algorithm can still achieve a super-linear rate with probability 1.

2.3 Coarse Model and Variable Metric Methods

In this section we discuss connections between the multilevel and variable metric methods, see also [19]. The descent direction of a standard variable metric method is given by,

$$\begin{aligned} \begin{aligned} \textbf{d}_{h,k}&= \underset{\textbf{d} \in \mathbb {R}^N }{{\text {arg \ min}}} \left\{ \frac{1}{2} \Vert \textbf{Q}^{1/2} \textbf{d} \Vert _2^2 + \langle \nabla f_h (\textbf{x}_{h,k}), \textbf{d} \rangle \right\} \\&= - \textbf{Q}^{-1} \nabla f_h (\textbf{x}_{h,k}), \end{aligned} \end{aligned}$$
(12)

where \(\textbf{Q} \in \mathbb {R}^{N\times N}\) is a positive definite matrix. If, for instance, \(\textbf{Q} = \nabla ^2 f_h (\textbf{x}_{h,k})\) we obtain the Newton method. If \(\textbf{Q}\) is chosen as the identity matrix we obtain the steepest descent method. Based on the construction of the coarse model in Eq. (8), we define \(f_H\) to be a quadratic model as follows

$$\begin{aligned} f_H(\textbf{x}_H):= \frac{1}{2} \left\| \textbf{Q}_H^{1/2} (\textbf{x}_H - \textbf{x}_{H,0}) \right\| _2^2, \end{aligned}$$

where \(\textbf{x}_{H,0} = \textbf{R} \textbf{x}_{h,k} \) and \(\textbf{Q}_H \in \mathbb {R}^{n \times n}\) is a positive definite matrix. Then, the coarse model Eq. (7) takes the following form

$$\begin{aligned} \underset{\textbf{x}_H \in \mathbb {R}^n }{{\text {min}}} \psi _H (\textbf{x}_H) = \underset{\textbf{x}_H \in \mathbb {R}^n }{{\text {min}}} \left\{ \frac{1}{2} \left\| \textbf{Q}_H^{1/2} (\textbf{x}_H - \textbf{x}_{H,0}) \right\| _2^2 + \langle \textbf{R} \nabla f_h(\textbf{x}_{h,k}), \textbf{x}_H - \textbf{x}_{H,0} \rangle \right\} .\nonumber \\ \end{aligned}$$
(13)

By Eq. (8) and since Eq. (12) has a closed form solution, we obtain

$$\begin{aligned} \begin{aligned} \hat{\textbf{d}}_{H,k}&= \underset{\textbf{d}_H \in \mathbb {R}^n }{{\text {arg \ min}}} \left\{ \frac{1}{2} \Vert \textbf{Q}_H^{1/2} \textbf{d}_{H} \Vert _2^2 + \langle \textbf{R} \nabla f_h (\textbf{x}_{h,k}), \textbf{d}_{H} \rangle \right\} \\&= - \textbf{Q}_H^{-1} \textbf{R} \nabla f_h (\textbf{x}_{h,k}). \end{aligned} \end{aligned}$$

Further, by construction of the coarse direction in Eq. (9) we obtain the coarse direction

$$\begin{aligned} \hat{\textbf{d}}_{h,k} = - \textbf{P}\textbf{Q}_H^{-1} \textbf{R} \nabla f_h (\textbf{x}_{h,k}). \end{aligned}$$
(14)

Note that if we naively set \(n=N\) and \(\textbf{P} = \textbf{I}_{N \times N}\) we obtain exactly equation Eq. (12).

2.4 The Galerkin Model

Here we present the Galerkin model which will be used to provide improved convergence results. The Galerkin model was introduced in multigrid algorithms for optimization in [34] where it was experimentally tested and found to compare favorably with other methods when constructing coarse-grained models. The Galerkin model can be considered as a special case of the coarse model Eq. (13) under a specific choice of the matrix \(\textbf{Q}_H\). In particular, let \(\textbf{Q}_H\) be as follows

$$\begin{aligned} \textbf{Q}_H(\textbf{x}_{h,k}):= \textbf{R} \nabla ^2 f_h(\textbf{x}_{h,k}) \textbf{P}. \end{aligned}$$
(15)

Below we present the Galerkin model and show links with the (randomized) Newton method (see also [19, 34]). We start by showing the positive definiteness of \(\textbf{Q}_H\).

Lemma 2.1

Let \(f_h: \mathbb {R}^N \rightarrow \mathbb {R}\) satisfy Assumption 1 and suppose that Assumption 2 also holds. Then, the matrix \(\textbf{Q}_H(\textbf{x}_{h, k})\) is positive definite.

Proof

This is a direct result of Assumption 2 and \(\nabla ^2 f_h(\textbf{x}_{h}) \succ 0\). \(\square \)

Using the definition Eq. (15) in the coarse model Eq. (13) one can obtain the Galerkin model

$$\begin{aligned} \underset{\textbf{x}_H \in \mathbb {R}^n }{{\text {min}}} \psi _H (\textbf{x}_H) := \frac{1}{2} \left\| [\textbf{Q}_H(\textbf{x}_{h,k})]^{1/2} (\textbf{x}_H - \textbf{x}_{H,0}) \right\| _2^2 + \langle \textbf{R} \nabla f_h(\textbf{x}_{h,k}), \textbf{x}_H - \textbf{x}_{H,0} \rangle , \end{aligned}$$
(16)

where, by Lemma 2.1, the Galerkin model satisfies Assumption 1, and since \(\textbf{Q}_H(\textbf{x}_{h,k})\) is invertible, Eq. (16) has a closed form solution. That is, we can derive \(\hat{\textbf{d}}_{H,k}\) and \(\hat{\textbf{d}}_{h,k}\) as is shown below

$$\begin{aligned} \hat{\textbf{d}}_{H,k} = - [\textbf{R} \nabla ^2 f_h(\textbf{x}_{h,k}) \textbf{P}]^{-1} \textbf{R} \nabla f_h (\textbf{x}_{h,k}), \end{aligned}$$
(17)

and then we prolongate \(\hat{\textbf{d}}_{H,k}\) to obtain the coarse direction

$$\begin{aligned} \hat{\textbf{d}}_{h,k}:= - \textbf{P} \hat{\textbf{d}}_{H,k} = - \textbf{P} [\textbf{R} \nabla ^2 f_h(\textbf{x}_{h,k}) \textbf{P}]^{-1} \textbf{R} \nabla f_h (\textbf{x}_{h,k}). \end{aligned}$$
(18)

Observe also that Eq. (17) is equivalent to solving the following linear system

$$\begin{aligned} \textbf{Q}_H(\textbf{x}_{h,k}) \textbf{d}_{H} = - \textbf{R} \nabla f_h (\textbf{x}_{h,k}), \end{aligned}$$
(19)

which, by the positive-definiteness of \(\textbf{Q}_H(\textbf{x}_{h,k})\), has a unique solution. At first glance it may seem that computing \(\hat{\textbf{d}}_{h,k}\) may require the computation of the full Hessian matrix. However, we discuss in Remark 2.1 how the computation of \(\textbf{Q}_H(\textbf{x}_{h,k})\) can be done in \(\mathcal {O}(N)\) operations (in parallel), and thus computing \(\hat{\textbf{d}}_{h,k}\) requires \(\mathcal {O}(n^3)\) operations. Using the Galerkin model in Eq. (16) one ensures that

$$\begin{aligned} \nabla ^2 \psi _H (\textbf{x}_{H,0}) = \textbf{R} \nabla ^2 f_h (\textbf{x}_{h,k}) \textbf{P}, \end{aligned}$$

i.e., the second-order coherency condition is satisfied. Similar to Eq. (14), for \(\textbf{P} = \textbf{I}_{N \times N}\) we obtain the Newton direction. If \(\textbf{P}\) is a random matrix we obtain a randomized Newton method.

2.5 The Nyström Method

In this section we discuss the Nyström method for the low-rank approximation of a positive-definite matrix. We show connections with the Galerkin model, and how to construct the prolongation and restriction operators. The Nyström method builds a rank-n approximation of a positive definite matrix \(\textbf{A} \in \mathbb {R}^{N \times N}\) as follows (see [9] for an introduction and [19] for a similar analysis)

$$\begin{aligned} \textbf{A} \approx \textbf{A}_n = \textbf{A} \textbf{Y} \left( \textbf{Y}^T \textbf{A} \textbf{Y} \right) ^{-1} \textbf{Y}^T \textbf{A}, \end{aligned}$$
(20)

where \(\textbf{Y} \in \mathbb {R}^{N \times n}\), \({\text {rank}}(\textbf{Y}) = n < N\). To see the connection between the Galerkin model and the Nyström method set \(\textbf{A} = \nabla ^2 f(\textbf{x}_{h,k})\), \(\textbf{Y} = \textbf{P}\), in Eq. (20) and multiply left and right with \( \left[ \nabla ^2 f(\textbf{x}_{h,k}) \right] ^{-1}\), respectively. Then,

$$\begin{aligned} \left[ \nabla ^2 f(\textbf{x}_{h,k}) \right] ^{-1} \approx \textbf{P} \left( \textbf{R} \nabla ^2 f(\textbf{x}_{h,k}) \textbf{P} \right) ^{-1} \textbf{R}. \end{aligned}$$

Thus, when \(\textbf{P} \left( \textbf{R} \nabla ^2 f(\textbf{x}_{h,k}) \textbf{P} \right) ^{-1} \textbf{R}\) is a good approximation of \(\left[ \nabla ^2 f(\textbf{x}_{h,k}) \right] ^{-1}\) then we expect \(\textbf{d}_{h,k} \approx \varvec{\hat{d}}_{h,k}\), i.e., the coarse direction based on the Galerkin model that is generated through the Nyström method is a good approximation of the Newton direction. For random sampling techniques dedicated on the choice of \(\textbf{P}\), see [14, 33, 35]. In this work we are interested in cases were samples are drawn according to some discrete probability distribution \(\textbf{p}\). Formally, we construct the prolongation and restriction operators as described below.

Definition 2.1

Let \(S_N = \left\{ 1,2, \ldots , N \right\} \) and denote \(S_n \subset S_N\), with the property that the \(n < N\) elements are randomly selected from \(S_N\) without replacement according to \(\textbf{p}\). Further, assume that \(s_i\) is the \(i^{\text {th}}\) element of \(S_n\). Then the prolongation operator \(\textbf{P}\) is generated as follows: The \(i^{\text {th}}\) column of \(\textbf{P}\) is the \(s_i\) column of \(\textbf{I}_{N \times N}\) and, further, it holds that \(\textbf{R} = \textbf{P}^T\).

The above definition of \(\textbf{P}\) and \(\textbf{R}\) clearly satisfies Assumption 2. If we consider iteration-dependent operators, \(\textbf{R}_k\), then, effectively, the above definition indicates that the scheme in (10) will update n from N entries of \(\textbf{x}_{h,k}\) at each iteration. The selection will be decided according to the assigned probability distribution. The simplest way to construct \(\textbf{P}\) is by using uniform sampling. In this case all entries of \(\textbf{x}_{h,k}\) have the same probability to be selected. However, although the uniform distribution is preferred in most randomized methods (see for instance [6, 10, 15, 17, 19]), in this work we consider alternative sampling techniques. Let \(p_{i, k}, i = 1, \ldots , N\) be the probability of selecting the \(s_i\) column of \(\textbf{I}_{N \times N}\), such that \(\sum _{i=1}^{N} p_i = 1\). Below we specify the three different sampling strategies that we consider in this paper.

  • Uniform distribution: \(p_{i, k} = \frac{1}{N}\), which ensures that all entries of \(\textbf{x}_{h,k}\) will be updated equally. In this case the Nyström method is commonly referred to as naive.

  • Adaptive distribution: \(p_{i, k} = \frac{|g_i|}{\sum _{i=1}^N |g_i|}\), where \(g_i = \frac{\partial {f(\textbf{x}_{h,k})}}{\partial {x_i}}\). Using the adaptive distribution we assign larger probability to coordinates whose partial derivatives are large. Thus, the entries of \(\textbf{x}_{h,k}\) for which \(g_i = 0\) will never be selected.

  • Mixed distribution: \(p_{i, k} = (1-\tau ) \frac{1}{N} + \tau \frac{|g_i|}{\sum _{i=1}^N |g_i|}\), where \(\tau \in (0,1)\). The mixed strategy effectively interpolates between the uniform and adaptive schemes.

The idea of adaptive sampling strategies has been proposed previously to show significant improvements in convergence rates of coordinate descent methods [11, 29]. Here, we will use the adaptive and mixed strategies to significantly improve the theoretical and numerical results of multilevel or subspace methods that rely on a uniform sampling strategy.

Moreover, when \(\textbf{P}\) is chosen randomly then the direction \(\hat{\textbf{d}}_{h,k}\) computed in Eq. (18) is also random. Thus, the randomness of \(\textbf{P}\) will imply randomness in the step-size \(t_{h,k}\) and the iterates \(\textbf{x}_{h,k}\) Eq. (10). If at iteration k the prolongation matrix \(\textbf{P}_k\) is used, then the algorithm generates a sequence of random variables \(M_k(\omega ) = (\textbf{x}_{h,k}(\omega ), t_{h,k}(\omega ), \hat{\textbf{d}}_{h,k}(\omega ), \textbf{P}_k(\omega ))\). We use \(\mathcal {F}_k:= \sigma (M_0, M_1, \ldots , M_k)\) to denote the \(\sigma \)-algebra generated by Eq. (10) up to iteration k. Below we describe how to efficiently compute the reduced Hessian matrix (Eq. (15)) given the above definition.

Remark 2.1

We note that it is expensive to first compute the Hessian matrix and then form the reduced Hessian matrix \(\textbf{Q}_H(\textbf{x}_{h,k})\) as it requires \(\mathcal {O}(N^2)\) operations. Instead, using the definition of \(\textbf{P}\), one may first compute the product \(\nabla f(\textbf{x})^T \textbf{P} \) by sampling n from N entries of the gradient vector. Then, for \(\textbf{P} = [\textbf{p}_1 \ \textbf{p}_2 \cdots \textbf{p}_n]\), computing the gradient of \(\nabla f(\textbf{x})^T \textbf{p}_i, i = 1, \ldots , n \) requires \(\mathcal {O}(N)\) operations and thus, in total, \(\mathcal {O}(nN)\) operations are required to compute the product \( \tilde{\textbf{P}}:= \nabla ^2 f(\textbf{x}) \textbf{P} \). We then form the \(n \times n\) reduced Hessian matrix in equation Eq. (15) by sampling rows of the matrix \(\tilde{\textbf{P}}\) according to Definition 2.1. Therefore, the total per-iteration cost to form the reduced Hessian matrix will be \(\mathcal {O}(nN)\). In addition, solving the linear system of equation in Eq. (19) to obtain the coarse direction requires \(\mathcal {O}(n^3)\) operations. Further, note that, for \( i = 1, \ldots , n \) the computation of \( \nabla ^2 f(\textbf{x}) \textbf{p}_i\) can be carried out independently resulting in \(\mathcal {O}(N)\) operations which is the same as the complexity of forming the gradient. As a result, the total per-iteration cost of the proposed method using the naive Nyström method will be \(\mathcal {O}(n^3 + (1+n)N) \approx \mathcal {O}(n^3 +nN)\) or \(\mathcal {O}(n^3 + 2N)\) when performing the computations in parallel. The technique described above assumes the use of an automatic differentiation routine to compute the Hessian-vector products. It is explained in detail in [8].

Building a rank-n Hessian matrix approximation using the Nyström method provides an inexpensive way for constructing the coarse direction in Eq. (18), as it performs sampling without replacement, which significantly reduces the total computational complexity (recall that the total per-iteration cost of the Newton method is \(\mathcal {O}(N^2 + N^3)\)). In addition, in terms of complexity, the proposed method has two main advantages compared to sub-sampled Newton methods. Firstly, sub-sampled Newton methods require approximately \(\mathcal {O}(N^3)\) operations for computing the search direction which means that, when N is large, sub-sampled methods have the same limitations as those of the conventional Newton method. Secondly, they offer improved complexity of iterates when the objective function is written as a sum of functions. On the other hand, SIGMA is able to overcome the limitations associated with the Hessian matrix since the computations are performed in the coarse level, and, in addition, it offers improved complexity of iterates without requiring the objective to be written as a sum of functions. The aforementioned advantages significantly increase the applicability of multilevel methods in comparison to the sub-sampled or sketch Newton methods in machine learning and large-scale optimization problems.

The convergence analysis with \(\textbf{P}\) constructed as in Definition 2.1 and the three sampling regimes is given in Sect. 3.5. The above definition of the prolongation operator will be used for the numerical experiments in Sect. 4.

2.6 Fine Search Direction

As discussed in previous section, condition Eq. (11) guarantees the progress of the multilevel method by using the fine direction \(\textbf{d}_{h,k}\) in place of the coarse direction \(\hat{\textbf{d}}_{h,k}\) when the latter appears to be ineffective. Since in this work we consider self-concordant functions we propose alternative conditions to Eq. (11). In particular, we replace the standard Euclidean norm with the norms defined by the matrices \(\textbf{Q}_H(\textbf{x}_{h,k})\) and \(\nabla ^2 f_h(\textbf{x}_{h,k})\). We begin by defining the approximate decrement, a quantity analogous to the Newton decrement in Eq. (3)

$$\begin{aligned} \begin{aligned} \hat{\lambda }_{f_{h}}(\textbf{x}_{h,k})&:= \left[ (\textbf{R} \nabla f_h (\textbf{x}_{h,k}))^T [\textbf{Q}_H(\textbf{x}_{h,k})]^{-1} \textbf{R} \nabla f_h (\textbf{x}_{h,k}) \right] ^{1/2}.\\ \end{aligned} \end{aligned}$$
(21)

We clarify that for the rest of this paper, unless specified differently, we denote the fine direction be the Newton direction, \(\textbf{d}_{h,k} = - [\nabla ^2 f_h(\textbf{x}_{h,k})]^{-1} \nabla f_h (\textbf{x}_{h,k}) \), and, in addition, for simplification, we omit the subscript \(f_h\) from both approximate and Newton decrements. Note also that the approximate and Newton decrements can be rewritten as

$$ \begin{aligned} \hat{\lambda }(\textbf{x}_{h,k}):= \left\| \textbf{R} \nabla f_h (\textbf{x}_{h,k}) \right\| _{[\textbf{Q}_H(\textbf{x}_{h,k})]^{-1}} \ \ \& \ \ \lambda (\textbf{x}_{h,k}):= \left\| \nabla f_h (\textbf{x}_{h,k}) \right\| _{[\nabla ^2 f_h(\textbf{x}_{h,k})]^{-1}},\nonumber \\ \end{aligned}$$
(22)

respectively, where, by positive-definiteness of \(\textbf{Q}_H(\textbf{x}_{h,k})\) and \(\nabla ^2 f_h(\textbf{x}_{h,k})\), both norms are well-defined and they serve the same purpose as with \(\left\| \textbf{R} \nabla f_h (\textbf{x}_{h,k}) \right\| _2\) and \(\left\| \nabla f_h (\textbf{x}_{h,k}) \right\| _2\), respectively. The new conditions are presented below and they are useful when minimizing self-concordant functions.

Condition 1

Let \(\mu \in (0,1)\) and \(\nu \in (0,1)\). The iterative scheme Eq. (10) employs the coarse direction Eq. (18) if

$$\begin{aligned} \hat{\lambda }(\textbf{x}_{h,k})> \mu \lambda (\textbf{x}_{h,k}) \quad \ \ \text {and} \quad \ \ \hat{\lambda }(\textbf{x}_{h,k}) > \nu . \end{aligned}$$

The above condition is analogous to the original conditions Eq. (11) and thus it prevents the use of the coarse direction when \(\hat{\lambda }(\textbf{x}_{h,k}) = 0\) while \(\lambda (\textbf{x}_{h,k}) \ne 0\) and also when \(\textbf{x}_{H,0} = \textbf{x}_{H}^*\). In multilevel methods \(\mu \) is a user-defined parameter that determines whether the algorithm selects the coarse or the fine step. The following lemma gives insights on how to select \(\mu \) such that the coarse direction is always performed as long as \(\hat{\lambda }(\textbf{x}_{h,k})\) is positive.

Lemma 2.2

Suppose \(\hat{\lambda }(\textbf{x}_{h,k}) > \nu \) for some \(\nu \in (0,1)\). Then, for any \(\mu \in (0, \min \{ 1, \frac{\nu }{\lambda (\textbf{x}_{h,0})} \} )\) we have that

$$\begin{aligned} \hat{\lambda }(\textbf{x}_{h,k}) > \mu \lambda (\textbf{x}_{h,k}), \end{aligned}$$

for any \(k \in \mathbb {N}\).

The above result verifies our intuition that coarse steps are more likely to be taken when \(\mu \) is selected sufficiently small. Note that \(\mu < \frac{\nu }{\lambda (\textbf{x}_{h,0})}\) is a sufficient condition such that the coarse step is always taken. Nevertheless it does not identify all the values of \(\mu \) such that Condition 1 holds. There might exist \(\mu \in (\frac{\nu }{\lambda (\textbf{x}_{h,0})}, 1) \) such that \( \hat{\lambda }(\textbf{x}_{h,k}) > \mu \lambda (\textbf{x}_{h,k})\) remains true. In addition, as a corollary of Lemma 2.2, one can select some \(r \in \mathbb {N}\) and \( \mu < \frac{\nu }{\lambda (\textbf{x}_{h,r})}\) which ensures that the coarse step will be always taken for all \(k \ge r\).

Lemma 2.3

For any \(k \in \mathbb {N}\) the approximate decrement in Eq. (21) is bounded as follows

$$\begin{aligned} \hat{\lambda }(\textbf{x}_{h,k}) \le \lambda (\textbf{x}_{h,k}), \end{aligned}$$

where \(\lambda (\textbf{x}_{h,k})\) is the Newton decrement in Eq. (3).

The above result shows that \(\hat{\lambda }(\textbf{x}_{h,k})\) can be as much as \(\lambda (\textbf{x}_{h,k})\). Therefore if the user-defined parameter is selected larger than one then SIGMA will perform fine steps only.

3 SIGMA: Convergence Analysis

In this section we provide convergence analysis of SIGMA for strictly convex self-concordant functions. Unless mentioned otherwise, throughout this section we assume that Assumption 1 and Assumption 2 always hold, and the method alternates between coarse and fine steps according to Condition 1. Furthermore, we provide two theoretical results that hold (i) for any \(\textbf{P}\) that satisfies Assumption 2 (see Sect. 3.3), and (ii) when \(\textbf{P}\) is selected randomly at each iteration as in Definition 2.1. In the latter scenario, we show probabilistic convergence results as well as convergence in expectation (see Sect. 3.5). In both cases, we prove that SIGMA achieves a local super-linear convergence rate. The idea of the proof is similar to that of the classical Newton method where convergence is split into two phases. The full algorithm including a step-size strategy is specified in Algorithm 1. Technical proofs are relegated to the appendix.

Remark 3.1

We make an important remark regarding the practical implementation of Algorithm 1. Notice that checking the condition \(\hat{\lambda }(\textbf{x}_{h,k}) > \mu \lambda (\textbf{x}_{h,k})\) is inefficient to perform at each iteration as it computes the expensive Newton decrement. One can compute the gradients using the Euclidean norms instead Eq. (11) as they serve the same purpose and are cheap to compute. Furthermore, Lemma 2.2 suggests that when \(\mu \) is selected sufficiently small it is likely that only coarse steps will be taken. However, note that we make no assumptions about the coarse model beyond what has already been discussed until now. For example, the fine model dimension N could be very large, while n could be just a single dimension. In such an extreme case, performing only coarse directions will yield a slow progress of the multilevel algorithm and hence, to avoid slow convergence, larger value of \(\mu \) will be required. In Sect. 3.5 we describe that, given \(\textbf{P}_k\) is as in Definition 2.1 and specific problem structures, SIGMA can reach solutions with high accuracy without ever using fine correction steps. Finally, for obtaining our theoretical results we assume the Newton direction as the fine direction, nevertheless, in practice, \(\textbf{d}_{h,k}\) can be a direction arising from variable metric methods (see Sect. 2.3).

3.1 Globally Convergent First-Phase

Algorithm 1
figure a

SIGMA

We begin by showing reduction in the value of the objective function of Algorithm 1 when the backtracking line search (Armijo-rule) is satisfied. We emphasize that this result is global. The idea of the proofs in the following lemmas are parallel with those in [5, 26].

Lemma 3.1

For any \( \eta > 0\) there exists \(\gamma > 0\) such that for any \(k \in \mathbb {N}\) with \(\hat{\lambda }(\textbf{x}_{h,k}) > \eta \ \) the coarse direction \(\varvec{\hat{d}}_{h,k}\) will yield the following reduction in the value of the objective function

$$\begin{aligned} f_h(\textbf{x}_{h,k} + t_{h,k} \varvec{\hat{d}}_{h,k}) - f_h(\textbf{x}_{h,k}) \le -\gamma . \end{aligned}$$

We proceed by estimating the sub-optimality gap. In particular, we show it can be bounded in terms of the approximate decrement.

Lemma 3.2

Let \(\hat{\lambda }(\textbf{x}_{h,k}) < 1\). Then,

$$\begin{aligned} \omega (\hat{\lambda }(\textbf{x}_{h,k})) \le f_h(\textbf{x}_{h,k}) - f_h(\textbf{x}_h^*) \le \omega _*(\hat{\lambda }(\textbf{x}_{h,k})), \end{aligned}$$

where the mappings \(\omega \) and \(\omega _*\) are defined as in Eq. (4).

The above result is similar to [26, Theorem 4.1.11] but with \( \hat{\lambda }(\textbf{x}_{h,k})\) in place of \(\lambda (\textbf{x}_{h,k})\). Alternatively, similar to the analysis in [5], the sub-optimality gap can be given as follows.

Lemma 3.3

If \(\lambda (\textbf{x}_{h,k}) \le 0.68\), then

$$\begin{aligned} f_h(\textbf{x}_{h,k}) - f_h(\textbf{x}_h^*) \le \lambda (\textbf{x}_{h,k})^2. \end{aligned}$$

As a result, \(\lambda (\textbf{x}_{h,k})^2\) can be used as an exit condition of Algorithm 1. In addition, with similar arguments, one can show that for any \(\hat{\lambda }(\textbf{x}_{h,k}) \le 0.68\) it holds \(f_h(\textbf{x}_{h,k}) - f_h(\textbf{x}_h^*) \le \hat{\lambda }(\textbf{x}_{h,k})^2\). To this end, we use \(\hat{\lambda }(\textbf{x}_{h,k})^2 < \epsilon \) whenever the coarse direction is performed, and \(\lambda (\textbf{x}_{h,k})^2 < \epsilon \), otherwise, to guarantee that on exit \(f_h(\textbf{x}_{h,k}) - f_h(\textbf{x}_h^*) \le \epsilon \), for some tolerance \(\epsilon \in (0, 0.68^2)\).

3.2 Quadratic Convergence Rate on the Coarse Subspace

In this section we show that the coarse model achieves a local quadratic convergence rate. We start with the next lemma in which we examine the required condition for Algorithm 1 to accept the unit step.

Lemma 3.4

Suppose that the coarse direction, \(\hat{\textbf{d}}_{h,k}\), is employed. If

$$\begin{aligned} \hat{\lambda }(\textbf{x}_{h,k}) \le \frac{1}{2} (1 - 2 \alpha ), \end{aligned}$$

where \(\alpha \in (0, 1/2)\), then Algorithm 1 accepts the unit step, \(t_{h,k} = 1\).

Using the lemma above we shall now prove quadratic convergence of the coarse model onto the subspace spanned by the columns of \(\textbf{R}\). The next result shows quadratic convergence when coarse steps are always taken.

Lemma 3.5

Let \(\lambda (\textbf{x}_{h,k})<1\). Suppose that the sequence \((\textbf{x}_{h,k} )_{k \in \mathbb {N}}\) is generated by Algorithm 1 and \(t_{h,k} = 1\). Suppose also that the coarse direction, \(\varvec{\hat{d}}_{h,k}\), is employed. Then,

$$\begin{aligned} \hat{\lambda }(\textbf{x}_{h,k+1}) \le \left( \frac{\hat{\lambda }(\textbf{x}_{h,k})}{1 - \hat{\lambda }(\textbf{x}_{h,k})} \right) ^2. \end{aligned}$$

According to Lemma 3.5, we can infer the following about the convergence rate of the coarse model: first, note that the root of \(\lambda /(1-\lambda )^2=1\) can be found at \(\lambda = \frac{3 - \sqrt{5}}{2} \). Hence, we come up with an explicit expression about the region of quadratic convergence, that is, when \(\hat{\lambda }(\textbf{x}_{h,k})< \frac{3 - \sqrt{5}}{2} \), we can guarantee that \(\hat{\lambda }(\textbf{x}_{h,k+1})<\hat{\lambda }(\textbf{x}_{h,k})\) and specifically this process converges quadratically with

$$\begin{aligned} \hat{\lambda }(\textbf{x}_{h,k+1}) \le \frac{\delta }{(1-\delta )^2}\hat{\lambda }(\textbf{x}_{h,k}), \end{aligned}$$

for some \(\delta \in (0, \lambda )\). Similar to Lemma 3.5, below we show a quadratic convergence rate of the coarse model when only fine steps are taken.

Lemma 3.6

Let \(\lambda (\textbf{x}_{h,k})<1\). Suppose that the sequence \((\textbf{x}_{h,k} )_{k \in \mathbb {N}}\) is generated by Algorithm 1 and \(t_{h,k} = 1\). Suppose also that the fine direction, \(\textbf{d}_{h,k}\), is employed. Then,

$$\begin{aligned} \hat{\lambda }(\textbf{x}_{h,k+1}) \le \frac{\lambda (\textbf{x}_{h,0})}{\nu (1 - \lambda (\textbf{x}_{h,k}))^2} \hat{\lambda }(\textbf{x}_{h,k})^2, \end{aligned}$$

where \(\nu \) is defined in Condition 1.

The following theorem summarizes the results of Lemma 3.5 and Lemma 3.6.

Theorem 3.1

Let \(\lambda (\textbf{x}_{h,k})<1\). Suppose that the sequence \((\textbf{x}_{h,k} )_{k \in \mathbb {N}}\) is generated by Algorithm 1 with \(t_{h,k} = 1\). Then we obtain reduction of the approximate decrement as follows:

  1. (i)

    if the coarse step is taken then

    $$\begin{aligned} \hat{\lambda }(\textbf{x}_{h,k+1}) \le \left( \frac{\hat{\lambda }(\textbf{x}_{h,k})}{1 - \hat{\lambda }(\textbf{x}_{h,k})} \right) ^2, \end{aligned}$$
  2. (ii)

    if the fine step is taken then

    $$\begin{aligned} \hat{\lambda }(\textbf{x}_{h,k+1}) \le \frac{\lambda (\textbf{x}_{h,0})}{\nu (1 - \lambda (\textbf{x}_{h,k}))^2} \hat{\lambda }(\textbf{x}_{h,k})^2, \end{aligned}$$

    where \(\nu \) is defined in Condition 1.

Proof

The proof of the theorem follows directly from Lemma 3.5 and Lemma 3.6. \(\square \)

Theorem 3.1 provide us with a description about the convergence of \(\Vert \nabla f_h(\textbf{x}_{h,k}) \Vert \) onto the space spanned by the rows of \(\textbf{R}\). As such, in the next section we examine the convergence of \(\Vert \nabla f_h(\textbf{x}_{h,k}) \Vert \) on the entire space \(\mathbb {R}^N\).

3.3 Composite Convergence Rate on \(\mathbb {R}^n\)

In this section we study the convergence of SIGMA on \(\mathbb {R}^N\) and specifically we establish its composite and super-linear convergence rate for \(\hat{\lambda }(\textbf{x}_{h,k}) \le \eta \), for some \(\eta >0\). We start with the following auxiliary lemma that will be useful in the discussion regarding the convergence results which follows this section.

Lemma 3.7

For any \(k \in \mathbb {N}\) it holds that

$$\begin{aligned} \left\| \ [\nabla ^2 f_h ({\textbf{x}_{h,k}})]^{1/2} \left( \varvec{\hat{d}}_{h,k} - \textbf{d}_{h,k} \right) \right\| _2 = \sqrt{\lambda (\textbf{x}_{h,k})^2 - \hat{\lambda }(\textbf{x}_{h,k})^2}, \end{aligned}$$

where \(\varvec{\hat{d}}_{h,k}\) as in Eq. (18) and \(\textbf{d}_{h,k}\) is the Newton direction.

The next lemma constitutes the core of our theorem.

Lemma 3.8

Let \(\lambda (\textbf{x}_{h,k})<1\). Suppose that the coarse direction, \(\varvec{\hat{d}}_{h,k}\), is employed and, in addition, that the line search selects \(t_{h,k} = 1\). Then,

$$\begin{aligned} \lambda (\textbf{x}_{h,k+1}) \le \frac{ \sqrt{\lambda (\textbf{x}_{h,k})^2 - \hat{\lambda }(\textbf{x}_{h,k})^2}}{\left( 1 - \hat{\lambda }(\textbf{x}_{h,k}) \right) ^2} + \frac{\hat{\lambda }(\textbf{x}_{h,k})}{\left( 1 - \hat{\lambda }(\textbf{x}_{h,k}) \right) ^2} \lambda (\textbf{x}_{h,k}). \end{aligned}$$

We now use the above result to obtain the two phases of the convergence of SIGMA when coarse steps are always taken. More precisely, the region of super-linear convergence is governed by \(\eta := \frac{3 - \sqrt{5 + 4\varepsilon }}{2}\), where \(\varepsilon := \sqrt{1 - \mu ^2}\).

Lemma 3.9

Suppose that the sequence \((\textbf{x}_{h,k} )_{k \in \mathbb {N}}\) is generated by Algorithm 1 and that the coarse direction, \(\varvec{\hat{d}}_{h,k}\), is employed. For any \(\mu \in (0,1)\), there exist constants \(\gamma >0\) and \(\eta \in (0, \frac{3 - \sqrt{5}}{2})\) such that

  1. (i)

    if \(\hat{\lambda }(\textbf{x}_{h,k}) > \eta \), then

    $$\begin{aligned} f_h(\textbf{x}_{h,k+1}) - f_h(\textbf{x}_{h,k}) \le -\gamma , \end{aligned}$$
  2. (ii)

    if \(\hat{\lambda }(\textbf{x}_{h,k}) \le \eta \), then Algorithm 1 selects the unit step and

    $$\begin{aligned} \hat{\lambda }(\textbf{x}_{h,k+1})&< \left( \frac{\hat{\lambda }(\textbf{x}_{h,k})}{1 - \hat{\lambda }(\textbf{x}_{h,k})} \right) ^2 < \hat{\lambda }(\textbf{x}_{h,k}). \end{aligned}$$
    (23)

    We further have

    $$\begin{aligned} \lambda (\textbf{x}_{h,k+1})< \frac{\varepsilon + \hat{\lambda }(\textbf{x}_{h,k})}{\left( 1 - \hat{\lambda }(\textbf{x}_{h,k}) \right) ^2} \lambda (\textbf{x}_{h,k}) < \lambda (\textbf{x}_{h,k}). \end{aligned}$$
    (24)

Proof

The result in the first phase, (1), is already proved in Lemma 3.1 and in particular it holds for \(\gamma = \alpha \beta \frac{\eta ^2}{1 + \eta }\). Further, for phase (ii) of the algorithm, by Lemma 3.4 and for some \(\eta \in (0, \frac{3 - \sqrt{5}}{2})\) we see that Algorithm 1 selects the unit step. Additionally, Lemma 3.5 guarantees reduction in the approximate decrement as required by inequality Eq. (23), and specifically, this process converges quadratically. In addition, since the coarse direction is taken, \(\hat{\lambda }(\textbf{x}_{h,k}) > \mu \lambda (\textbf{x}_{h,k})\). Applying this inequality to the result of Lemma 3.8 reads

$$\begin{aligned} \lambda (\textbf{x}_{h,k+1}) \le \frac{\varepsilon + \hat{\lambda }(\textbf{x}_{h,k})}{\left( 1 - \hat{\lambda }(\textbf{x}_{h,k}) \right) ^2} \lambda (\textbf{x}_{h,k}). \end{aligned}$$

Thus, convergence is achieved if the fraction in the above inequality is less than one. By assumption, \(\hat{\lambda }(\textbf{x}_{h,k}) \le \eta \) and since \(x \rightarrow \frac{\varepsilon + x}{(1-x)^2}\) is monotone increasing we obtain

$$\begin{aligned} \lambda (\textbf{x}_{h,k+1}) \le \frac{\varepsilon + \eta }{\left( 1 - \eta \right) ^2} \lambda (\textbf{x}_{h,k}). \end{aligned}$$

Setting \(\eta = \frac{3 - \sqrt{5 + 4\varepsilon }}{2}\) we see that \((\varepsilon + \eta )/\left( 1 - \eta \right) ^2 < 1\). Finally, \(\mu \in (0,1)\) implies that \(\varepsilon \in (0,1)\) and thus inequality Eq. (24) holds for \(\eta \in (0, \frac{3 - \sqrt{5}}{2})\) which concludes the proof of the theorem.

\(\square \)

According to Lemma 3.9, for some \(\eta \in (0, \frac{3 - \sqrt{5}}{2})\), we can infer the following about the convergence of Algorithm 1: In the first phase, for \(\hat{\lambda }(\textbf{x}_{h,k}) > \eta \), the objective function is reduced as \( f_h(\textbf{x}_{h,k+1}) - f_h(\textbf{x}_{h,k}) \le -\gamma \) and thus the number of steps of this phase is bounded by \(\frac{1}{\gamma } [f_h(\textbf{x}_{h, 0}) - f_h(\textbf{x}_{h}^*)]\). In the second phase, for \(\hat{\lambda }(\textbf{x}_{h,k}) \le \eta \) the reduction is given in Eq. (24). It is easy to show that Algorithm 1 obtains a local composite convergence rate: to see this, we combine Lemma 2.3 and Eq. (24) to get

$$\begin{aligned} \lambda (\textbf{x}_{h,k+1}) \le \frac{\varepsilon + \hat{\lambda }(\textbf{x}_{h,k})}{\left( 1 - \hat{\lambda }(\textbf{x}_{h,k}) \right) ^2} \lambda (\textbf{x}_{h,k}) \le \frac{\varepsilon + \lambda (\textbf{x}_{h,k})}{\left( 1 - \lambda (\textbf{x}_{h,k}) \right) ^2} \lambda (\textbf{x}_{h,k}). \end{aligned}$$

The above result shows convergence of Algorithm 1 when the coarse step is always taken. In the case where Algorithm 1 alternates between coarse and fine (Newton) steps, then a combined convergence behavior is expected. The complete convergence behavior of Algorithm 1 is summarized in the following theorem. Note that in this case the local region of the fast convergence rate will be given according to the magnitude of the Newton decrement.

Theorem 3.2

Suppose that the sequence \((\textbf{x}_{h,k} )_{k \in \mathbb {N}}\) is generated by Algorithm 1. There exist constants \(\gamma _{\text {S}}, \gamma _\text {N} >0\) and \(\bar{\eta } \in (0, \frac{3 - \sqrt{5}}{2})\) such that

  1. (i)

    If \(\lambda (\textbf{x}_{h,k}) > \bar{\eta }\) and the coarse step is taken then \( f_h(\textbf{x}_{h,k+1}) - f_h(\textbf{x}_{h,k}) \le -\gamma _\text {S}\)

  2. (ii)

    If \(\lambda (\textbf{x}_{h,k}) > \bar{\eta }\) and the fine step is taken then \( f_h(\textbf{x}_{h,k+1}) - f_h(\textbf{x}_{h,k}) \le -\gamma _\text {N}\)

  3. (iii)

    If \(\lambda (\textbf{x}_{h,k}) \le \bar{\eta }\) and the coarse step is taken, then Algorithm 1 selects the unit step and

    $$\begin{aligned} \lambda (\textbf{x}_{h,k+1}) \le \frac{\varepsilon + \lambda (\textbf{x}_{h,k})}{\left( 1 - \lambda (\textbf{x}_{h,k}) \right) ^2} \lambda (\textbf{x}_{h,k})<\lambda (\textbf{x}_{h,k}), \end{aligned}$$
  4. (iv)

    If \(\lambda (\textbf{x}_{h,k}) \le \bar{\eta }\) and the fine step is taken, then Algorithm 1 selects the unit step and

    $$\begin{aligned} \lambda (\textbf{x}_{h,k+1}) \le \left( \frac{\lambda (\textbf{x}_{h,k})}{ 1 - \lambda (\textbf{x}_{h,k}) } \right) ^2<\lambda (\textbf{x}_{h,k}). \end{aligned}$$

Proof

If \(\lambda (\textbf{x}_{h,k})<1\) and the coarse direction is taken, we take that

$$\begin{aligned} \lambda (\textbf{x}_{h,k+1}) \le \frac{\varepsilon + \hat{\lambda }(\textbf{x}_{h,k})}{\left( 1 - \hat{\lambda }(\textbf{x}_{h,k}) \right) ^2} \lambda (\textbf{x}_{h,k}) \le \frac{\varepsilon + \lambda (\textbf{x}_{h,k})}{\left( 1 - \lambda (\textbf{x}_{h,k}) \right) ^2} \lambda (\textbf{x}_{h,k}), \end{aligned}$$

where the last inequality follows since \(\hat{\lambda }(\textbf{x}_{h,k}) \le \lambda (\textbf{x}_{h,k})\). By Lemma 3.9, we obtain reduction in the Newton decrement and thus a composite convergence rate if \(\lambda (\textbf{x}_{h,k}) \le \eta \in (0, \frac{3 - \sqrt{5}}{2})\). On the other hand, the Newton method enters its local quadratic region of convergence if \(\lambda (\textbf{x}_{h,k}) \le \eta _\text {N} \in (0, \frac{3 - \sqrt{5}}{2})\) [26]. Setting \(\bar{\eta }:= \min \{ \eta , \eta _\text {N} \}\) and using \(\lambda (\textbf{x}_{h,k}) \le \bar{\eta }\) proves items (iii) and (iv).

Further, in Item (i), since the coarse direction is taken it holds that \(\hat{\lambda }(\textbf{x}_{h,k}) \ge \mu \lambda (\textbf{x}_{h,k}) > \mu \eta \). Combining this inequality with the result of Lemma 3.1, the result holds for \(\gamma _\text {S} = \alpha \beta \frac{\mu ^2 \eta ^2}{1 + \mu \eta }\). Lastly, Item (ii) holds for \(\gamma _\text {N} = \alpha \beta \frac{ \eta _\text {N}^2}{1 + \eta _\text {N}}\) (see [5]). \(\square \)

3.4 Discussion on the Convergence Results

Notice that the two phases of Lemma 3.9 depend on the user-defined parameter \(\mu \). That is, the region of the composite convergence is proportional to the value of \(\mu \) that satisfies Condition 1. Specifically, as \(\mu \rightarrow 1\), we see that \(\varepsilon \rightarrow 0\) and \(\eta \in (0, \frac{3 - \sqrt{5}}{2})\) and thus SIGMA approaches the fast convergence of the full Newton method. On the other hand, as \(\mu \rightarrow 0\), Lemma 3.9 indicates a restricted region of the second phase and thus slower convergence. Therefore, as a consequence of Lemma 3.9, the user is able to select a-priori the desired region of the fast convergence rate through \(\mu \). However, recall that larger values in the user-defined parameter \(\mu \) may yield more expensive iterations (fine steps). Therefore, there is a trade-off between the number of coarse steps and the choice of \(\mu \). Note also that the theorem shows that it is possible for the method to achieve a super-linear convergence rate. This will be attained if we select iteration-dependent \(\textbf{R}_{k}\) such that \(\varepsilon _k\) converges to zero as k goes to zero. We examine this case in the next section. The following lemma offers further insights on the composite convergence rate.

Lemma 3.10

For all \(k \in \mathbb {N}\) it holds that

$$\begin{aligned} \left\| [\nabla ^2 f_h ({\textbf{x}_{h,k}})]^{1/2} \left( \varvec{\hat{d}}_{h,k} - \textbf{d}_{h,k} \right) \right\| _2 \le \lambda (\textbf{x}_{h,k}). \end{aligned}$$

The above result shows that, without using Condition 1, the quantity of interest in Lemma 3.8 can, in the worst case, be as much as the Newton decrement. However, if \(\hat{\lambda }(\textbf{x}_{h,k}) \le \eta \), to obtain reduction in the Newton decrement, we must have \(\left\| [\nabla ^2 f_h ({\textbf{x}_{h,k}})]^{1/2} \left( \varvec{\hat{d}}_{h,k} - \textbf{d}_{h,k} \right) \right\| _2 \le \varepsilon \lambda (\textbf{x}_{h,k})\), where \(0 \le \varepsilon < 1\). Analyzing this inequality for \(\varepsilon =1\) we identify the following cases:

  1. (i)

    \( \varvec{\hat{d}}_{h,k} = c \textbf{d}_{h,k} \), where \(\varepsilon = 1\) holds for \(c=0\) or \(c=2\). To see this, by Lemma 3.10 we take,

    $$\begin{aligned} \left\| [\nabla ^2 f_h ({\textbf{x}_{h,k}})]^{1/2} \left( c-1 \right) \textbf{d}_{h,k} \right\| _2 = \left| c-1 \right| \lambda (\textbf{x}_{h,k}) \le \lambda (\textbf{x}_{h,k}), \end{aligned}$$

    and thus \(\varepsilon =1\) holds for \(c=0\) or \(c=2\). The case \(c=0\) indicates that \(\varvec{\hat{d}}_{h,k} = 0\) while \(\textbf{d}_{h,k} \ne 0\) as also discussed in Sect. 2.2. Alternatively, \(c=0\) implies \(\varvec{\hat{d}}_{h,k} = \textbf{d}_{h,k} = 0\) which is attained in limit. Further, \(c=2\) also implies \( \left\| [\nabla ^2 f_h ({\textbf{x}_{h,k}})]^{1/2} \left( \varvec{\hat{d}}_{h,k} - \textbf{d}_{h,k} \right) \right\| _2 \le \lambda (\textbf{x}_{h,k})\), however, we note that this is an extreme case that rarely holds in practice.

  2. (ii)

    Similarly, we have that \(\hat{\lambda }(\textbf{x}_{h,k}) = 0\) while \(\lambda (\textbf{x}_{h,k}) \ne 0\) or \(\hat{\lambda }(\textbf{x}_{h,k}) = \lambda (\textbf{x}_{h,k}) = 0\). This can be derived directly from Lemma 3.7 and it is exactly the case (i) for \(c=0\).

When either of the above cases hold, the multilevel algorithm will not progress. Notice that composite convergence rate holds for any choice of the prolongation operator and thus, since we make no further assumptions, this result verifies our intuition, i.e., there exists a choice of \(\textbf{P}\) that may lead to an ineffective coarse step. Therefore, to prevent this, we make use of the Condition 1. Nevertheless, we notice that the above cases rarely hold in practice and thus we should expect the multilevel algorithm to enter the second-phase near the minimizer. Additionally, the above cases indicate that in the worst case scenario the multilevel method will converge to a sub-optimal solution. However, it will never diverge. This result is important considering we take no assumptions on \(\textbf{P}\) whatsoever.

As discussed above, in this setting, SIGMA requires checking Condition 1 (or \(\Vert \textbf{R} \nabla f (\textbf{x}_{h,k}) \Vert _2 > \mu \Vert \nabla f (\textbf{x}_{h,k}) \Vert _2 \), see Remark 3.1) at each iteration. However, the checking process can be expensive for large scale optimization. In the next section we present a randomized version of the multilevel algorithm and we study problem instances where the checking process can be omitted. In such cases there exists \(0 \le \varepsilon < 1\) and thus SIGMA is guaranteed to always converge to the solution even if fine steps are never taken. This result is new when analyzing multilevel methods and it further improves the efficacy of the proposed algorithm for large-scale optimization.

3.5 Super-linear Convergence Rate through the Nyström Method

The computational bottlenecks in Algorithm 1 are: (i) the construction of the coarse step in Eq. (18), and (ii) the checking process in order to avoid an ineffective search direction. To overcome both, we select the prolongation operator as in Definition 2.1, thus the Galerkin model is generated based on the low-rank approximation of the Hessian matrix through the Nyström method. In particular, we show that if SIGMA enters the super-linear phase, then the checking process can be omitted. Moreover, performing the Nyström method we are able to overcome the computational issues related to the construction of the coarse direction (see Sect. 2.5, Remark 2.1 and Remark C.1 for details on the efficient implementation of Algorithm 1 using the Nyström method).

Given \(\textbf{R}_k\) as in Definition 2.1, let \(\rho \in [0,1]\) be the probability that \(\hat{\lambda }(\textbf{x}_{h,k}) \le \nu \in (0,0.68^2)\), for any \(k \in \mathbb {N}\) such that \(\textbf{x}_{h,k} \ne \textbf{x}_h^*\). Recall from the results of the previous section that \(\hat{\lambda }(\textbf{x}_{h,k}) = 0\) implies no progress for the multilevel scheme and thus convergence to a sub-optimal solution. For this reason Condition 1 was necessary in our previous analysis. Given now the fact that \(\textbf{R}_k\) is constructed randomly according to Definition 2.1, it is possible to abandon Condition 1 and provide probabilistic results that arise from some discrete probability distribution. Note that by Eq. (22), \(\rho \) effectively is the probability of selecting sufficiently small partial derivatives. Given the sampling strategies introduced in Sect. 2.5, \(\rho \) is expected to be small enough or zero.

Lemma 3.11

Assume that \(\textbf{R}_k\), \(k \in \mathbb {N}\), is constructed as in Definition 2.1. Then, there exists \(\mu _k \in \left( 0, \frac{\hat{\lambda }(\textbf{x}_{h,k})}{\lambda (\textbf{x}_{h,k})} \right] \) such that \( 0 < \mu _k \le 1\) and

$$\begin{aligned} \sqrt{ \lambda (\textbf{x}_{h,k})^2 - \hat{\lambda }(\textbf{x}_{h,k})^2}&\le \sqrt{1 - \mu _k^2} \lambda (\textbf{x}_{h,k}) \end{aligned}$$

with probability \(1-\rho \).

Since there exist iteration-dependent scalars \(\mu _k\), we are able to show a super-linear convergence rate. This is true because \(0<\mu _k\le 1\), and thus it is possible to select \(\textbf{R}_k\) such that \(\lim _{k \rightarrow \infty } \mu _k = 1\). Therefore, to show the desired result, we additionally need to impose the assumption that \(\lim _{k \rightarrow \infty } \mu _k = 1\). The theorem below presents an instance of SIGMA that achieves a local super-linear rate.

Theorem 3.3

Suppose that the coarse direction is constructed with \(\textbf{R}_k\) as in Definition 2.1 such that \(\mu _k = 1 - \frac{1}{2 \ln (2 + k)}\). Moreover, suppose that the sequence \(( \textbf{x}_{h,k} )_{k \in \mathbb {N}}\) is generated by Algorithm 1. Then, there exist constants \(\hat{\gamma } >0\) and \(\hat{\eta } \in (0, \frac{3 - \sqrt{5}}{2})\) such that

  1. (i)

    If \(\lambda (\textbf{x}_{h,k}) > \hat{\eta }\), then, with probability \(1-\rho \),

    $$\begin{aligned} f_h(\textbf{x}_{h,k+1}) - f_h(\textbf{x}_{h,k}) \le -\hat{\gamma }, \end{aligned}$$
  2. (ii)

    If \(\lambda (\textbf{x}_{h,k}) \le \hat{\eta }\), then Algorithm 1 selects the unit step and

    $$\begin{aligned} \hat{\lambda }(\textbf{x}_{h,k+1})&< \left( \frac{\hat{\lambda }(\textbf{x}_{h,k})}{1 - \hat{\lambda }(\textbf{x}_{h,k})} \right) ^2 < \hat{\lambda }(\textbf{x}_{h,k}), \end{aligned}$$

    where after \(K_0\) iterations this process converges quadratically with probability \((1-\rho )^{K_0}\). Setting \(\varepsilon _k:= \sqrt{1 - \mu _k^2}\), we further have

    $$\begin{aligned} \lambda (\textbf{x}_{h,k+1})< \frac{\varepsilon _k + \lambda (\textbf{x}_{h,k})}{\left( 1 - \lambda (\textbf{x}_{h,k}) \right) ^2} \lambda (\textbf{x}_{h,k}) < \lambda (\textbf{x}_{h,k}), \end{aligned}$$

    where after \(K_0\) iterations this process achieves a super-linear convergence rate with probability \((1-\rho )^{K_0}\).

Proof

Recall the following inequality from Lemma 3.1

$$\begin{aligned} f_h(\textbf{x}_{h,k} + t_{h,k} \varvec{\hat{d}}_{h,k}) - f_h(\textbf{x}_{h,k}) \le - \alpha \beta \frac{\hat{\lambda }(\textbf{x}_{h,k})^2}{1 + \hat{\lambda }(\textbf{x}_{h,k})}. \end{aligned}$$

Then from Lemma 3.11 we have that \(\hat{\lambda }(\textbf{x}_{h,k}) \ge \mu _k \lambda (\textbf{x}_{h,k}) \ge \mu _k \hat{\eta } \ge \mu _0 \hat{\eta } \), where the last equality holds since \(\mu _k\) forms an increasing sequence of real numbers. Thus, using \(\hat{\gamma }:= \alpha \beta \frac{\mu _0^2 \hat{\eta }^2}{1 + \mu _0 \hat{\eta }}\) and since \(x \rightarrow \frac{x^2}{1+x}\) is an increasing function, the result of the first phase follows with probability \(1 - \rho \). Let \(K \in \mathbb {N}\) denote the first iteration that satisfies \(\lambda (\textbf{x}_{h,K})<1\). Then, for all \(k \ge K\), we have that \(\mu _K \lambda (\textbf{x}_{h,k}) \le \mu _k \lambda (\textbf{x}_{h,k}) \le \hat{\lambda }(\textbf{x}_{h,k}) \le \lambda (\textbf{x}_{h,k}) < 1\), where, in fact, \(\mu _K = 1 - \frac{1}{\ln (2 + K)}\) is the smallest of such \(\mu _k\). Combining the last inequality, Lemma 3.11 and Lemma 3.4, the method accepts the unit step if \(\lambda (\textbf{x}_{h,k}) \le \frac{1 - 2\alpha }{2 \mu _K}\). This proves Lemma 3.5 with probability \(1-\rho \). Since for every iteration the probability is \(1-\rho \), then the overall probability after \(K_0\) iterations is \((1-\rho )^{K_0}\). Last, as in the proof of Lemma 3.9, for \(\lambda (\textbf{x}_{h,k}) < 1\), we can show

$$\begin{aligned} \lambda (\textbf{x}_{h,k+1}) \le \frac{\varepsilon _k + \hat{\lambda }(\textbf{x}_{h,k})}{\left( 1 - \hat{\lambda }(\textbf{x}_{h,k}) \right) ^2} \lambda (\textbf{x}_{h,k}) \le \frac{\varepsilon _k + \lambda (\textbf{x}_{h,k})}{\left( 1 - \lambda (\textbf{x}_{h,k}) \right) ^2} \lambda (\textbf{x}_{h,k}), \end{aligned}$$

with probability \(1-\rho \). Here, the difference is that the scalar \(\varepsilon _k\) is not fixed but it depends on the iteration-dependent \(\mu _k\), i.e., \(\varepsilon _k = \sqrt{1 - \mu _k^2}\). Therefore,

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\varepsilon _k + \lambda (\textbf{x}_{h,k})}{\left( 1 - \lambda (\textbf{x}_{h,k}) \right) ^2} = 0, \end{aligned}$$

which proves the super-linear convergence. Finally, the scalar \(\hat{\eta }\) that controls the super-linear rate is given by \(\hat{\eta }:= \frac{3 - \sqrt{5 + 4 \varepsilon _K}}{2}\), where \(\varepsilon _K = \sqrt{1 - \mu _K^2}\) and thus we obtain \(\hat{\eta } \in (0, \frac{3 - \sqrt{5}}{2})\). Then, we can ensure that for all \(k \in \mathbb {N}\) such that \(\lambda (\textbf{x}_{h,k}) < \hat{\eta }\) it holds \(\lambda (\textbf{x}_{h,k+1}) < \lambda (\textbf{x}_{h,k})\), and hence in the second-phase Algorithm 1 converges super-linearly. Therefore, overall, after \(K_0\) iterations, the superlinear rate is achieved with probability \((1-\rho )^{K_0}\). \(\square \)

Remark 3.2

We can also say something about the rate of convergence of the sequence \((f(\textbf{x}_{h,k}))_{k \in \mathbb {N}}\). Note that the sequence \((\lambda (\textbf{x}_{h,k}))_{k \in \mathbb {N}}\) achieves a Q-superlinear rate of convergence. Then, by Lemma 3.3, \((f_h(\textbf{x}_{h,k}))_{k \in \mathbb {N}}\) converges to \(f_h(\textbf{x}^*)\) with R-superlinear convergence rate, see [28] for details on super-linear rates.

The above theorem shows that the multilevel algorithm can achieve a super-linear convergence if \(\textbf{R}_k\) is selected such that \(\lim _{k \rightarrow \infty } \mu _k = 1\). In this case, fine directions need not be taken during the process. In the next section, we illustrate that SIGMA can achieve a super-linear or even a quadratic convergence rate in practice, and thus the key assumption of Theorem 3.3 is not just theoretical. The theorem also allows for quadratic convergence rate, i.e., there exists \(K_q \in \mathbb {N}\) such that \(\mu _k = 1\) for all \(k \ge K_q\). Quadratic rates are also observed for SIGMA in our experiments. Therefore, our theory is consistent with practice. Moreover, obviously, \(\lim _{k \rightarrow \infty } \mu _k = 1\) holds when \(\lim _{k \rightarrow \infty } \hat{\lambda }(\textbf{x}_{h,k}) = \lambda (\textbf{x}_{h,k})\). This condition is expected to be satisfied for \( n \rightarrow N\). However, larger values in n yield more expensive iterations. Therefore, one should search for examples where \(\hat{\lambda }(\textbf{x}_{h,k}) \approx \lambda (\textbf{x}_{h,k})\) for \(n \ll N\). Instances of problem structures that satisfy \(\hat{\lambda }(\textbf{x}_{h,k}) \approx \lambda (\textbf{x}_{h,k})\) include cases such as when the Hessian matrix is nearly low-rank or when there is a big gap between its eigenvalues, i.e., \(\lambda _1 \ge \lambda _2 \ge \cdots \ge \lambda _{n} \gg \lambda _{n+1} \ge \cdots \ge \lambda _N\), or, in other words, when the important second-order information is concentrated in the first few eigenvalues. Given such problem structures, we expect very fast convergence rates without ever using fine directions, and thus a convergence behavior similar to Theorem 3.3. This claim is also verified via numerical experiments.

On the other hand, in the absence of the assumption \(\lim _{k \rightarrow \infty } \mu _k = 1\), we can still guarantee convergence of SIGMA with a local composite rate and probability \(1 - \rho \). However, in this case, the assumption that there exists a global \(\mu \in (0,1]\) such that \(\hat{\lambda }(\textbf{x}_{h,k}) \ge \mu \lambda (\textbf{x}_{h,k})\) with probability \(1-\rho \) is required. We may as well compute the total number of steps that Algorithm 1 requires to reach a tolerance \(\epsilon \). Assume that the algorithm enters the composite phase and achieves a rate \(a_r\), i.e., \(\lambda (\textbf{x}_{h,{k+1}}) \le a_r \lambda (\textbf{x}_{h,k})\), where \(0<a_r <1\). We also can obtain the region of composite convergence which depends on \(a_r\), that is \(\eta (a_r)\). We set \(\eta := \eta (a_r)\) for simplicity. Given the above considerations we can guarantee the following about the total number of steps in both phases.

Corollary 3.1

Let \(\gamma := \alpha \beta \frac{\eta ^2}{1 + \eta }\), \(\eta := \eta (a_r)\) and \(\epsilon \in (0, 0.68^2)\). Further, suppose that the coarse direction is constructed with \(\textbf{R}_k\) as in Definition 2.1 and that the sequence \(( \textbf{x}_{h,k} )_{k \in \mathbb {N}}\) is generated by Algorithm 1. Then, the total number of iteration for the Algorithm 1 to reach an \(\epsilon \)-approximate value in the objective function is at least

with probability \((1-\rho )^{N_\epsilon }\).

Proof

Recall that the main assumption we use here is that \(\hat{\lambda }(\textbf{x}_{h,k}) \ge \mu \lambda (\textbf{x}_{h,k})\) which holds with probability \(1-\rho \). Then, we guarantee that \(\hat{\lambda }(\textbf{x}_{h,k}) > 0\) and thus the result of the first phase of Lemma 3.9 holds with probability \(1-\rho \). This shows that the the objective function decreases by at least \(\gamma \) at each iteration and thus the number of iteration taken by Algorithm 1 during the first phase is at most

$$\begin{aligned} N_1:= \frac{f(\textbf{x}_{h,0} - \textbf{x}_{h}^* )}{\gamma }, \end{aligned}$$

with overall probability \((1-\rho )^{N_1}\). Moreover, let \(K \in \mathbb {N}\) be the iteration that the algorithm enters the composite phase for the first time. It holds \(\lambda (\textbf{x}_{h,K}) \le \eta \). Next, let \(K_{\epsilon }\) be the number of iterations required for the method to reach a tolerance \(\epsilon \). Then, by Lemma 3.3,

$$\begin{aligned} f(\textbf{x}_{h,K + K_\epsilon }) - f(\textbf{x}_{h}^*)&\le \lambda (\textbf{x}_{h,K + K_\epsilon })^2 \le a_r^{2 K_\epsilon } \lambda (\textbf{x}_{h,K})^2 \le a_r^{2 K_\epsilon } \eta ^2, \end{aligned}$$

where the second inequality holds by Lemma 3.9 assuming that the Newton decrement decreases as \(\lambda (\textbf{x}_{h,{k+1}}) \le a_r \lambda (\textbf{x}_{h,k})\). Note that this decrease holds with probability \(1-\rho \) due to the main assumption. Therefore, \(a_r^{2 K_\epsilon } \eta ^2 \le \epsilon ^2\) yields

which ensures that \(f(\textbf{x}_{h,K + K_\epsilon }) - f(\textbf{x}_{h}^*) \le \epsilon ^2\) after \(K_\epsilon + 1\) iterations in the composite phase with probability \((1-\rho )^{K_\epsilon + 1}\). Putting this all together, and since at every iteration of both phases the probability is \(1-\rho \), the minimum number of iterations to reach an \(\epsilon \)-approximate solution follows with probability \((1-\rho )^{N_\epsilon }\). \(\square \)

Below we present a simple example using an explicit value on \(a_r\) (and therefore on \(\eta \) and \(\mu \)).

Example: Assume \(a_r:= 1/2\), i.e., we wish the decrease to be as much as \(\lambda (\textbf{x}_{h,{k+1}}) \le \lambda (\textbf{x}_{h,k}) / 2\) in the second phase of Lemma 3.9. Applying the proof of Lemma 3.9, this value of \(a_r\) can be achieved as long as \(\mu > \sqrt{3/4}\). Moreover, \(a_r = 1/2\) yields the following definition on the region of the composite convergence: \(\eta = 2 - \sqrt{3 + 2 \varepsilon }\), where \(\varepsilon \) as in Lemma 3.9. Note that the condition \(\mu \in (\sqrt{3/4}, 1]\) ensures that \(\eta \) is positive. Specifically, we have that \(\eta \in (0, 13/50)\). Then, applying these values in the result of Corollary 3.1, and given a tolerance \(\epsilon := 10^{-5}\), we obtain . Therefore, in this set up, with a probability \(1-\rho \), Algorithm 1 requires just 15 iterations during the composite phase to ensure that \(f(\textbf{x}_{h,K + K_\epsilon }) - f(\textbf{x}_{h,*}) \le \epsilon ^2\) on exit, using only coarse steps.

Furthermore, in the remark below, we give insights on the magnitude of probability \(\rho \) when the sampling strategy arises from the uniform or adaptive distributions (see Sect. 2.5).

Remark 3.3

Recall that according to the Definition 2.1, the probability that \(\hat{\lambda }(\textbf{x}_{h,k}) \le \nu \), boils down to the probability of selecting the n-(almost) zero entries from the N entries of the gradient vector, while \(\Vert \nabla f(\textbf{x}_{h,k}) \Vert \ne 0\). Let r be the number of zero elements of the gradient vector and assume that the n samples are drawn uniformly without replacement. Then, if \(n > r\) we take \(\rho = 0\), and thus the results in Theorem 3.3 hold with probability one. If \(n \le r\), then \({\left( {\begin{array}{c}r\\ n\end{array}}\right) }\) denotes all the n combinations of the zero elements, and \({\left( {\begin{array}{c}n\\ N\end{array}}\right) }\) the total number of n combinations of the N entries of the gradient vector. Then,

$$\begin{aligned} \rho = \frac{{\left( {\begin{array}{c}r\\ n\end{array}}\right) }}{{\left( {\begin{array}{c}N\\ n\end{array}}\right) }} = \frac{(r - n +1) \cdots (r-1)r}{(N - n +1) \cdots (N-1)N}. \end{aligned}$$

This result indicates that when \(r \ll N\), \(\rho \) must be small enough. In practical applications, it is common to expect \(r \ll N\) or \(n > r\), and hence SIGMA will converge with high probability or probability one, respectively. On the other hand, if the n samples are collected according to the adaptive distribution then, by construction, \(\rho = 0\) and SIGMA will converge with probability one. The aforementioned observation is presented formally in the corollary below, and it will be verified through the numerical experiments in Sect. 4 and Appendix C. \(\square \)

Corollary 3.2

Suppose that the coarse direction is constructed with \(\textbf{R}_k\) as in Definition 2.1, where \(\textbf{p}\) is the adaptive distribution (see Sect. 2.5). Assume also that \(\mu _k = 1 - \frac{1}{2 \ln (2 + k)}\). Moreover, suppose that the sequence \(( \textbf{x}_{h,k} )_{k \in \mathbb {N}}\) is generated by Algorithm 1. Then, there exist constants \(\hat{\gamma } >0\) and \(\hat{\eta } \in (0, \frac{3 - \sqrt{5}}{2})\) such that

  1. (i)

    If \(\lambda (\textbf{x}_{h,k}) > \hat{\eta }\), then

    $$\begin{aligned} f_h(\textbf{x}_{h,k+1}) - f_h(\textbf{x}_{h,k}) \le -\hat{\gamma }, \end{aligned}$$

    with probability one,

  2. (ii)

    If \(\lambda (\textbf{x}_{h,k}) \le \hat{\eta }\), then Algorithm 1 selects the unit step and

    $$\begin{aligned} \hat{\lambda }(\textbf{x}_{h,k+1})&< \left( \frac{\hat{\lambda }(\textbf{x}_{h,k})}{1 - \hat{\lambda }(\textbf{x}_{h,k})} \right) ^2 < \hat{\lambda }(\textbf{x}_{h,k}), \end{aligned}$$

    where this process converges quadratically. Setting \(\varepsilon _k:= \sqrt{1 - \mu _k^2}\), we further have

    $$\begin{aligned} \lambda (\textbf{x}_{h,k+1})< \frac{\varepsilon _k + \lambda (\textbf{x}_{h,k})}{\left( 1 - \lambda (\textbf{x}_{h,k}) \right) ^2} \lambda (\textbf{x}_{h,k}) < \lambda (\textbf{x}_{h,k}), \end{aligned}$$

    where this process achieves a super-linear convergence rate. Both results in this phase hold with probability one.

Proof

Combining Theorem 3.3 and the definition of the adaptive sampling strategy in Sect. 2.5, the result of the theorem follows immediately. \(\square \)

Last, we prove a simple convergence result of SIGMA in expectation. It shows that the expected value of the objective function decreases by at least \(\tilde{\gamma } > 0\) at each iteration.

Theorem 3.4

Suppose that we select \(\textbf{R}_k\) as in Definition 2.1 and that the sequence \( ( \textbf{x}_{h,k} )_{k \in \mathbb {N}}\) is generated from Algorithm 1. Suppose also \(\hat{\lambda }(\textbf{x}_{h,k}) > \nu \) while \(\textbf{x}_{h,k} \ne \textbf{x}_{h}^*\). Then, there exists \(\tilde{\gamma } > 0\) such that

$$\begin{aligned} \mathbb {E} \left[ f_h(\textbf{x}_{h,k+1}) - f_h(\textbf{x}_{h,k}) \right] \le -\tilde{\gamma }. \end{aligned}$$

Proof

Since \(\hat{\lambda }(\textbf{x}_{h,k}) > \nu \), Lemma 3.11 holds with probability one. Then, taking expectation conditioned on \(\mathcal {F}_k\) we have that \( \tilde{\mu }^2 \le \frac{ \mathbb {E} [ \hat{\lambda }(\textbf{x}_{h,k}) | \mathcal {F}_k ]^2}{\lambda (\textbf{x}_{h,k})^2}\), where \(\tilde{\mu }:= \mathbb {E}[\mu _k | \mathcal {F}_k]>0\). Moreover, from Lemma 3.1 we have that

$$\begin{aligned} f_h(\textbf{x}_{h,k+1}) - f_h(\textbf{x}_{h,k}) \le - \alpha \beta \frac{\hat{\lambda }(\textbf{x}_{h,k})^2}{1 + \hat{\lambda }(\textbf{x}_{h,k})}. \end{aligned}$$

Let \(g: x \rightarrow \frac{x^2}{1+x}\), \({\text {dom}} g = [0,1)\), and take expectation on both sides conditioned on \(\mathcal {F}_k\). Then, \( \mathbb {E} \left[ f_h(\textbf{x}_{h,k+1}) | \mathcal {F}_k \right] - f_h(\textbf{x}_{h,k}) \le - \alpha \beta \mathbb {E} \left[ g (\hat{\lambda }(\textbf{x}_{h,k})) | \mathcal {F}_k \right] \). By convexity of g and Jensen’s inequality we take \( \mathbb {E} \left[ f_h(\textbf{x}_{h,k+1}) | \mathcal {F}_k \right] - f_h(\textbf{x}_{h,k}) \le - \alpha \beta g \left( \mathbb {E} \left[ \hat{\lambda }(\textbf{x}_{h,k}) | \mathcal {F}_k \right] \right) \). Using this, the fact that g is monotone increasing and \(\mathbb {E} \left[ \hat{\lambda }(\textbf{x}_{h,k}) | \mathcal {F}_k \right] > \tilde{\mu } \lambda (\textbf{x}_{h,k})\), we take

$$\begin{aligned} \mathbb {E} \left[ f_h(\textbf{x}_{h,k+1}) | \mathcal {F}_k \right] - f_h(\textbf{x}_{h,k}) \le - \alpha \beta \frac{\tilde{\mu }^2 \lambda (\textbf{x}_{h,k})^2}{1 + \hat{\mu } \lambda (\textbf{x}_{h,k})} \end{aligned}$$

which holds almost surely. Taking expectation on both sides w.r.t. the randomness induced by Definition 2.1, the claim follows with \(\tilde{\gamma } = \alpha \beta \mathbb {E} \left[ \frac{\tilde{\mu }^2 \lambda (\textbf{x}_{h,k})^2}{1 + \hat{\mu } \lambda (\textbf{x}_{h,k})} \right] >0\). Last, one can also show a more conservative bound using \(\mathbb {E} \left[ \hat{\lambda }(\textbf{x}_{h,k}) | \mathcal {F}_k \right] > \nu \). Then, we take \(\tilde{\gamma } = \alpha \beta \frac{\nu ^2}{1 + \nu }\). \(\square \)

The above result, even though it does not discusses the convergence rate in expectation, is important as it indicates that, when the coarse directions are constructed randomly, we can always find \(\mu >0\) such that, on average, Condition 1 is satisfied. This now means that convergence to the global minimum is attained as long as \(\hat{\lambda }(\textbf{x}_{h,k}) > \nu \). Hence, Theorem 3.4 and Theorem 3.3 are complementary to each other and together indicate that the coarse directions, constructed from Definition 2.1, will always be effective and yield progress of Algorithm 1. Thus, convergence of Algorithm 1 to the minimum should be expected without ever taking fine directions. This important result is obtained since the method enjoys a global convergence analysis and will not diverge, see Theorem 3.2 and discussion in Sect. 3.4.

4 Examples and Numerical Results

In this section we validate the efficacy of the proposed algorithm and we verify our theoretical results on optimization problems that arise in machine learning applications. Specifically, in the first two sections we use the conventional SIGMA (uniform sampling) to solve the maximum likelihood estimation problem based on the Poisson and Logistic models, respectively. Full details about the experimental setup, objective functions and the datasets are given in Appendix C. Furthermore, in Sect. 4.3 we provide comparisons between the conventional SIGMA and SIGMA with the different sampling strategies of Sect. 2.5. Moreover, we provide additional experiments and we test SIGMA with sub-sampling (Appendix C.2.2). In Remark C.1 we discuss how to efficiently compute the reduced Hessian matrix for Generalized Linear Models.

4.1 Poisson Regression and Impact of the Spectral Gap

Fig. 1
figure 1

The first figure shows convergence of SIGMA for different values in p. The second and third figures compare the optimization methods over the \(\ell _2\) and elastic-net regularized Poisson regression respectively

In this section we attempt to verify the claim that SIGMA enjoys a very fast super-linear convergence rate when there is a big gap between the eigenvalues of the Hessian matrix (see discussion that follows Theorem 3.3). To illustrate this, we generate an input matrix \(\textbf{A}\) based on the Singular Value Decomposition (SVD) and then we run experiments on the Poisson model. More precisely, \(\textbf{A}\) is constructed as follows. We first generate random orthogonal matrices \(\textbf{U} \in \mathbb {R}^{m \times m}\) and \(\textbf{V} \in \mathbb {R}^{N \times N}\), \(m > N\), from the \(\mathcal {O}(N)\) Haar distribution [24]. Denote now the matrix of singular values as \( \mathbf {\Sigma }:= \begin{bmatrix} \mathbf {\Sigma }_N & \textbf{0} \\ \end{bmatrix}^T \in \mathbb {R}^{m \times N}\), where \(\mathbf {\Sigma }_N:= {\text {diag}} (\sigma _1, \sigma _2, \ldots , \sigma _N)\) is a square diagonal matrix. To form a gap, we select some p from the set of integers \(S_N = \{1,2, \ldots , N \} \) and then we compute N singular values such that \(\sigma _1> \sigma _2> \cdots> \sigma _p \gg \sigma _{p+1} \ge \cdots \ge \sigma _N>0\), where the first p are evenly-spaced. Then, we set \(\textbf{A}:= \textbf{U} \mathbf {\Sigma } \textbf{V}^T\) and we expect \(\textbf{A}\) to have a big gap between the p and \({p+1}\) singular values. Further, note that the Hessian matrix of the Poisson model contains the product \(\textbf{A}^T \textbf{A}\). Observe that \(\textbf{A}^T \textbf{A} = \textbf{V} \mathbf {\Sigma }_N^2 \textbf{V}^T\), and thus \(\sigma _1^2, \sigma _2^2, \ldots , \sigma _N^2\) are the eigenvalues of \(\textbf{A}^T \textbf{A}\). Therefore, by the construction of \(\mathbf {\Sigma }_N\), we should expect the Hessian matrix to have a big gap between the p and \({p+1}\) eigenvalues.

In the first experiment (Fig. 1a) we compare the performance of SIGMA for three different locations of the “singular value gap”, i.e., \(p = \{ 0.2N, 0.5 N, 0.8N \} \). In all three cases we set the coarse model dimensions to \(n = N/2\). Figure 1a shows the effect of the eigenvalue gap in the convergence rate of SIGMA. Clearly, when the gap is placed in the first few singular values (or eigenvalues respectively) SIGMA achieves a very fast super-linear rate. In particular, when \(p = 0.2N\), the convergence of SIGMA to the solution is about five times faster in comparison to the convergence when \(p=0.8N\), which verifies our intuition that smaller p yields faster convergence rates for SIGMA. Similar behavior should be expected when the Hessian matrix is (nearly) low-rank or when important second-order information is concentrated on the first few eigenvalues.

Next, we compare SIGMA against the other optimization methods when \(\textbf{A}\) is generated with \(p = 0.5N\). The coarse model dimensions for SIGMA is set as above. For the NewSamp and SubNewton we use \(|S_m| = m/2\) samples at each iteration. Figure 1b shows the performance between the optimization methods over the \(\ell _2\)-regularized Poisson regression. We observe that in the first phase, both the Newton method and SIGMA have similar behavior and they move rapidly towards the solution. Then, the Newton method enters in its quadratic phase and converges in few iterations while SIGMA achieves a super-linear rate. Besides the Newton method and SIGMA, SubNewton is the only algorithm that is able to reach a satisfactory tolerance within the time limit, but it is much slower.

Enforcing sparsity in the solution is typical in most imaging applications that employ the Poisson model. Thus, we further run experiments with the pseudo-Hubert function in action with \(\xi _1 = 10^{-3}\). The algorithms set up is as described above. The results for this experiment are reported in Fig. 1c. Clearly, when sparsity is required, SIGMA outperforms all its competitors. In particular, Fig. 1c shows that SIGMA achieves a quadratic rate and is at least two times faster compared to the Newton and SubNewton methods. On the other hand, NewSamp, GD and SGD fail to reach the required tolerance within the time limit. Finally, we note that for smaller values of p and/or larger input matrices \(\textbf{A}\) the comparison is even more favorable for SIGMA.

Fig. 2
figure 2

Performance of various optimization methods on different datasets for the logistic regression

4.2 Logistic Regression

In this set of experiments we report the performance of the optimization algorithms on the logistic model and three real datasets. In the first one (Fig. 2a) we consider the Leukemia datasets for which \(m \ll N\). The coarse model dimensions for SIGMA is set to \(n = 0.1N\). As for the Newsamp and SubNewton, \(|S_m| = 0.1 m\) data points were used to form the Hessian. Note that since here \(m \ll N\), one should expect an eigenvalue gap located near \(p = m\) and since m is small SIGMA should enjoy a very fast convergence rate. This is illustrated in Fig. 2a, where, in particular, SIGMA converges in few seconds while its competitors are unable to get close to the true minimizer before the time exceeds. Note also that SIGMA obtains a convergence behavior similar to the one in Fig. 1a but it is sharper here since p is much smaller than N. The only method that comes closer to our approach is the GD method which reaches a satisfactory tolerance very fast but then it slows down, and thus it struggles to approach the optimal point.

In the second experiment we consider the Gissete dataset with elastic-net regularization. For this example we set \(n = N/2\) and \(|S_m| = m/2\). The performance of the optimization algorithms is illustrated in Fig. 2b and observe that SIGMA outperforms all its competitors. The Newton method achieves a quadratic rate and thus reaches very high accuracy, but it is slower than SIGMA as the latter clearly achieves a super-linear rate and has cheaper iterates. The sub-sampled Newton methods reach a satisfactory tolerance within the time limit, however, note that very sparse solutions are only obtained in high accuracy. The GD methods on the other hand fail to even reach a sufficiently good solution before the time exceeds.

We end this set of experiments with the real-sim dataset over the \(\ell _2\)-regularized logistic regression. Since here both m and N are quite large, other than multilevel methods, only gradient-based methods can be employed to minimize the logistic model due to memory limitations. The comparison of the performance between SIGMA, GD and SGD is illustrated in Fig. 2c. Indisputably, Fig. 2c shows the efficiency of SIGMA against the first-order methods. In this example, SIGMA is able to return a very accurate solution while GD methods perform poorly, achieving a very slow linear convergence rate even from the start of the process.

4.3 Different Sampling Strategies

In this section, we revisit the numerical experiments of the previous sections to compare SIGMA with the adaptive and mixed sampling strategies against conventional SIGMA which collects samples uniformly. For the mixed strategy, we set \(\tau = 0.5\) to all experiments. Besides parameter \(\tau \), we consider the same set-up and tuning (see sections above and Appendix C.2.1). The results can be found in Figs. 3, 4 and 5. Here, we report iterations instead of CPU time as we are interested in the effect of the sampling strategies on the convergence rate of SIGMA.

Fig. 3
figure 3

Comparisons between SIGMA with Uniform, Adaptive and mixed sampling strategies

Fig. 4
figure 4

Comparisons between SIGMA with Uniform, Adaptive and mixed sampling strategies

Observe that, in all the experiments, the adaptive and/or mixed strategies show substantial improvements in the convergence rate against of the conventional SIGMA. We notice that the adaptive strategy is particularly efficient in the early stages of the algorithm and it, in some examples, tends to slow down near the solution, especially when high accuracy is required. Therefore, when the goal is very accurate solutions, the results suggest that the uniform sampling may yield better convergence rates in the second-phase of SIGMA. On the other hand, SIGMA with the mixed strategy significantly improves the convergence rate of the conventional SIGMA in all experiments, and both of its phases. Note that the mixed strategy was found to be at least \(50\%\) faster and can be up to five times faster than SIGMA with uniform sampling (see Fig. 3c). Recall that the weight parameter \(\tau \) is set to 0.5 to account equally for adaptive and uniform sampling. However, as the experiments indicate, it would be more beneficial to set an adaptive (iteration-dependent) value of \(\tau _k\) that starts from one (adaptive samples) and decays gradually towards zero (uniform samples). Therefore, even faster convergence rates will be expected for SIGMA. As a result, with the experiments presented in this section, we significantly improve the convergence behavior of multilevel or subspace method compared to the conventional methods that draw samples uniformly. In addition, SIGMA with the adaptive and mixed sampling strategies compares even more favorably to the state-of-the-art methods considered in previous sections.

5 Conclusions and Perspectives

Fig. 5
figure 5

Comparisons between SIGMA with Uniform, Adaptive and mixed sampling strategies

We proposed SIGMA, a second-order variant of the Newton algorithm. We performed convergence analysis with the theory of self-concordant functions. We addressed two significant weaknesses of existing second-order methods for machine learning applications. In particular, the lack of global scale-invariant analysis and local super-linear convergence rates without restrictive assumptions. Our theory is general, in the sense that we do not assume specific problem structures. The theory allows also for a quadratic convergence rate. When the coarse direction is constructed randomly through the Nyström method, the convergence analysis is provided in probability. Further, we introduce new adaptive sampling strategies to replace the conventional uniform sampling scheme. We show that the new strategies offer convergence results with higher probabilities than when considering the uniform sampling. In addition, we verify the theoretical results through numerical experiments. We illustrate that SIGMA can achieve both super-linear and quadratic convergence rates on realistic models. We, further, report substantial improvements in the convergence rate of SIGMA with the alternative strategies against the conventional SIGMA. As a future direction, we plan to extensively examine the practical behavior of SIGMA using variants of the new adaptive techniques. In particular, we plan to introduce adaptive and iteration-dependent \(\tau _k\). We also consider replacing the \(\Vert \cdot \Vert _1\) in the definition of adaptive sampling strategy with other norms. We anticipate that these modifications will offer better approximations of the Newton decrement.