An accelerated proximal algorithm for regularized nonconvex and nonsmooth bi-level optimization

Chen, Ziyi; Kailkhura, Bhavya; Zhou, Yi

doi:10.1007/s10994-023-06329-6

An accelerated proximal algorithm for regularized nonconvex and nonsmooth bi-level optimization

Published: 07 April 2023

Volume 112, pages 1433–1463, (2023)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

An accelerated proximal algorithm for regularized nonconvex and nonsmooth bi-level optimization

Download PDF

679 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Many important machine learning applications involve regularized nonconvex bi-level optimization. However, the existing gradient-based bi-level optimization algorithms cannot handle nonconvex or nonsmooth regularizers, and they suffer from a high computation complexity in nonconvex bi-level optimization. In this work, we study a proximal gradient-type algorithm that adopts the approximate implicit differentiation (AID) scheme for nonconvex bi-level optimization with possibly nonconvex and nonsmooth regularizers. In particular, the algorithm applies the Nesterov’s momentum to accelerate the computation of the implicit gradient involved in AID. We provide a comprehensive analysis of the global convergence properties of this algorithm through identifying its intrinsic potential function. In particular, we formally establish the convergence of the model parameters to a critical point of the bi-level problem, and obtain an improved computation complexity $\widetilde{\mathcal {O}}(\kappa ^{3.5}\epsilon ^{-2})$ over the state-of-the-art result. Moreover, we analyze the asymptotic convergence rates of this algorithm under a class of local nonconvex geometries characterized by a Łojasiewicz-type gradient inequality. Experiment on hyper-parameter optimization demonstrates the effectiveness of our algorithm.

Tseng’s extragradient method with double projection for solving pseudomonotone variational inequality problems in Hilbert spaces

Article 10 April 2024

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Article 13 April 2024

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

1 Introduction

Bi-level optimization has become an important and popular optimization framework that covers a variety of emerging machine learning applications, e.g., meta-learning (Franceschi et al., 2018; Bertinetto et la., 2018; Rajeswaran et al., 2019; Ji et al., 2020), hyperparameter optimization (Franceschi et al., 2018; Shaban et al., 2019; Feurer & Hutter, 2019), reinforcement learning (Konda and Tsitsiklis, 2000; Hong et al., 2020), etc. A standard formulation of bi-level optimization takes the following form.

$$\begin{aligned} \min _{x\in \mathbb {R}^d} f(x, y^*(x)), \quad \text{ where } \quad y^*(x)\in \mathop {\mathrm {arg\,min}}\limits _{y\in \mathbb {R}^{p}} g(x,y), \end{aligned}$$

where the upper- and lower-level objective functions f and g are both jointly continuously differentiable. To elaborate, bi-level optimization aims to minimize the upper-level compositional objective function $f(x,y^*(x))$, in which $y^*(x)$ is the minimizer of the lower-level objective function g(x, y).

Solving the above bi-level optimization problem is highly non-trivial as the problem involves two nested minimization problems. In the existing literature, many algorithms have been developed for bi-level optimization. In the early works, Hansen et al. (1992); Shi et al. (2005); Moore (2010) reformulated the bi-level problem into a single-level problem with constraints on the optimality conditions of the lower-level problem, yet this reformulation involves a large number of constraints that are hard to address in practice. More recently, gradient-based bi-level optimization algorithms have been developed, which leverage either the approximate implicit differentiation (AID) scheme (Domke, 2012; Pedregosa, 2016; Gould et al., 2016; Liao et al., 2018; Ghadimi and Wang, 2018; Grazzi et al., 2020; Lorraine et al., 2020) or the iterative differentiation (ITD) scheme (Domke, 2012; Maclaurin et al., 2015; Franceschi et al., 2017, 2018; Shaban et al., 2019; Grazzi et al., 2020) to estimate the gradient of the upper-level function. In particular, the AID scheme is more popular due to its simplicity and computation efficiency. Specifically, bi-level optimization algorithm with AID (referred to as BiO-AID) has been analyzed for (strongly)-convex upper- and lower-level functions (Liu et al., 2020), which do not cover bi-level problems in modern machine learning that usually involve nonconvex upper-level objective functions. On the other hand, recent studies have analyzed the convergence of BiO-AID with nonconvex upper-level function and strongly convex lower-level function, and established the convergence of a certain type of gradient norm to zero (Ji et al., 2021; Ghadimi and Wang, 2018; Hong et al., 2020).

However, the existing gradient-based nonconvex bi-level optimization algorithms have limitations in several perspectives. First, most existing algorithms are not applicable to bi-level problems that involve possibly nonsmooth and nonconvex regularizers in the upper-level function, while Huang and Huang (2021) involves only convex regularizer. For example, in the application of data hyper-cleaning, one can improve the learning performance by adding a nonsmooth and nonconvex regularizer to push the weights of the clean samples towards 1 while push those of the contaminated samples towards 0 (see Sect. 6 for more details). Second, the convergence guarantees of these algorithms typically ensure a weak gradient norm convergence (except (Dagréou et al., 2022) which requires strong global PŁ geometry assumption on both $f(x,\cdot )$ and $f(\cdot ,y)$), which does not necessarily imply the desired convergence of the model parameters. Furthermore, these algorithms suffer from a high computation complexity in nonconvex bi-level optimization. The overarching goal of this work is to develop an efficient and convergent proximal-type algorithm for solving regularized nonconvex and nonsmooth bi-level optimization problems and address the above important issues. We summarize our contributions as follows.

1.1 Our contributions

We propose a proximal BiO-AIDm algorithm (see Algorithm 1) and study its convergence properties. This algorithm is a proximal variant of the BiO-AID algorithm for solving the following class of regularized nonsmooth and nonconvex bi-level optimization problems.

$$\begin{aligned}&\min _{x\in \mathbb {R}^d} f(x, y^*(x)) + h(x), \text {where} ~ y^*(x)= \mathop {\mathrm {arg\,min}}\limits _{y\in \mathbb {R}^{p}} g(x,y), \end{aligned}$$

where the upper-level objective function f is nonconvex, the lower-level objective function g is strongly convex for any fixed x, and the regularizer h is possibly nonsmooth and nonconvex. In particular, our algorithm applies the Nesterov’s momentum to accelerate the computation of the implicit gradient involved in the AID scheme.

We first analyze the global (non-asymptotic) convergence properties of proximal BiO-AIDm under standard Lipschitz and smoothness assumptions on the objective functions. The key to our analysis is to show that proximal BiO-AID admits an intrinsic potential function $H(x_k,y_k)$ that takes the form

$$\begin{aligned} H(x,y'):= \Phi (x) +h(x) +\frac{7}{8}\Vert y^{(T)}(x,y')-y^*(x)\Vert ^2, \end{aligned}$$

where $y^{(T)}(x,y')$ is obtained by applying the Nesterov’s accelerated gradient descent to minimize $g(x,\cdot )$ with initial point $y'$ for T iterations. In particular, we prove that such a potential function is monotonically decreasing along the optimization trajectory, i.e., $H(x_{k+1},y_{k+1}) < H(x_k,y_k)$, which implies that proximal BiO-AIDm can be viewed as a descent-type algorithm and is numerically stable. Based on this property, we formally prove that every limit point of the model parameter trajectory $\{x_k\}_k$ generated by proximal BiO-AIDm is a critical point of the regularized bi-level problem. Furthermore, when the regularizer is convex, we show that proximal BiO-AIDm requires a computation complexity of $\widetilde{\mathcal {O}}(\kappa ^{3.5}\epsilon ^{-2})$ (number of gradient, Hessian-vector product and proximal evaluations) for achieving a critical point x that satisfies $\Vert G(x)\Vert \le \epsilon$, where $\kappa$ denotes the problem condition number and G(x) denotes the proximal gradient mapping. As shown in Table 1, this is the first global convergence and complexity result of proximal BiO-AIDm in regularized nonsmooth and nonconvex bi-level optimization, and it improves the state-of-the-art complexity of BiO-AID (for smooth nonconvex bi-level optimization) by a factor of ${\widetilde{\mathcal {O}}}(\sqrt{\kappa })$.

Besides investigating the global convergence properties, we further establish the asymptotic function value convergence rates of proximal BiO-AIDm under a local Łojasiewicz-type nonconvex geometry, which covers a broad spectrum of local nonconvex geometries. Specifically, we characterize the asymptotic convergence rates of proximal BiO-AIDm in the full spectrum of the Łojasiewicz geometry parameter $\theta$. We prove that as the local geometry becomes sharper (i.e., with a larger $\theta$), the asymptotic convergence rate of proximal BiO-AIDm boosts from sublinear convergence to superlinear convergence. The proof of these local asymptotic convergence rates requires proving two properties that have not been proved in the existing literature to our knowledge. The major property is that the aforementioned potential function H is decreasing. Another property is the Lipschitz property of $y^{(T)}$, which is challenging to prove due to momentum acceleration.

Table 1 List of existing complexity results for bi-level algorithms. ($\checkmark$ in the columns “non-smooth” and “momentum accelerated” respectively means the objective function is non-smooth and the algorithm has momentum acceleration, and $\times$ means the opposite)

Full size table

1.2 Related work

Bi-level Optimization Algorithms Bi-level optimization has been studied for decades (Bracken & McGill, 1973), and various types of bi-level algorithms have been proposed, including but not limited to single-level penalized methods (Shi et al., 2005; Moore, 2010), and gradient-based methods via AID or ITD-based hypergradient estimation (Domke, 2012; Pedregosa, 2016; Franceschi et al., 2018; Ghadimi and Wang, 2018; Hong et al., 2020; Liu et al., 2020; Li et al., 2020; Grazzi et al., 2020; Ji et al., 2021; Lorraine et al., 2020; Ji and Liang, 2021). Huang and Huang (2021) proposed a Bregman distance-based method. In particular, (Ghadimi and Wang, 2018; Hong et al., 2020; Ji et al., 2021; Yang et al., 2021; Chen et al., 2021a; Guo and Yang, 2021; Huang and Huang, 2021) characterized the complexity analysis for their proposed methods for bi-level optimization problem under different types of loss geometries. Ji and Liang (2021) studied the lower complexity bounds for bi-level optimization under (strongly) convex geometry and proposed a nearly-optimal accelerated algorithm. All the existing analysis of nonconvex bi-level optimization algorithms either focuses on the gradient norm convergence or requires strong global PŁ geometry assumption on both $f(x,\cdot )$ and $f(\cdot ,y)$ (Dagréou et al., 2022). In this paper, we formally establish the parameter and function value convergence of proximal BiO-AID in regularized nonconvex and nonsmooth bi-level optimization.

Applications of Bi-level Optimization Bi-level optimization has been widely applied to meta-learning (Snell et al., 2017; Franceschi et al., 2018; Rajeswaran et al., 2019; Zügner and Günnemann, 2019; Ji et al., 2020; Ji, 2021), hyperparameter optimization (Franceschi et al., 2017; Shaban et al., 2019), reinforcement learning (Konda and Tsitsiklis, 2000; Hong et al., 2020), and data poisoning (Mehra et al., 2020). For example, Snell et al. (2017) reformulated the meta-learning objective function under a shared embedding model into a bi-level optimization problem. Rajeswaran et al. (2019) proposed a bi-level optimizer named iMAML as an efficient variant of model-agnostic meta-learning (MAML) (Finn et al., 2017), and analyzed the convergence of iMAML under the strongly convex inner-loop loss. Fallah et al. (2020) characterized the convergence of MAML and first-order MAML under nonconvex loss functions. Ji et al. (2020) studied the convergence behaviors of almost no inner loop (ANIL) (Raghu et al., 2019) under different inner-loop loss geometries of the MAML objective function. Recently Mehra et al. (2020) devised bi-level optimization based data poisoning attacks on certifiably robust classifiers.

Nonconvex Kurdyka-Łojasiewicz Geometry A broad class of regular functions has been shown to satisfy the local nonconvex KŁ geometry (Bolte et al., 2007), which affects the asymptotic convergence rates of gradient-based optimization algorithms. The KŁ geometry has been exploited to study the convergence of various first-order algorithms for solving minimization problems, including gradient descent (Attouch & Bolte, 2009), alternating gradient descent (Bolte et al., 2014), distributed gradient descent (Zhou et al., 2016; Zhou et al., 2018), accelerated gradient descent (Li et al., 2017). It has also been exploited to study the convergence of second-order algorithms such as Newton’s method (Noll and Rondepierre, 2013; Frankel et al., 2015) and cubic regularization method (Zhou et al., 2018).

2 Problem formulation and preliminaries

In this paper, we consider the following regularized nonconvex bi-level optimization problem:

where both the upper-level objective function f and the lower-level objective function g are jointly continuously differentiable, and the regularizer h is possibly nonsmooth and nonconvex. We note that adding a regularizer to the bi-level optimization problem allows us to impose desired structures on the solution, and this is important for many machine learning applications. For example, in the application of data hyper-cleaning (see the experiment in Sect. 6 for more details), one aims to improve the learning performance by adding a regularizer to push the weights of the clean samples towards 1 while push the weights of the contaminated samples towards 0. Such a regularizer often takes a nonsmooth and nonconvex form.

To simplify the notation, throughout the paper we define the function $\Phi (x):= f(x, y^*(x))$. We also adopt the following standard assumptions regarding the regularized bi-level optimization problem (P).

Assumption 1

The functions in the regularized bi-level optimization problem (P) satisfy:

1.
Function $g(x,\cdot )$ is $\mu$-strongly convex for all x and function $\Phi (x)=f(x,y^*(x))$ is nonconvex;
2.
Function h is proper and lower-semicontinuous (possibly nonsmooth and nonconvex);
3.
Function $(\Phi +h)(x)$ is bounded below and has bounded sub-level sets.

In Assumption 1, the regularizer h can be any nonsmooth and nonconvex function so long as it is a closed function. This covers most of the regularizers that we use in practice, including any proper convex functions (can be nonsmooth, e.g., $\ell _1$ norm), $\ell _p$ norm with $p>0$ (can be nonconvex and nonsmooth), $\ell _0$-norm regularizer $h(x)=\lambda \Vert x\Vert _0$ ($\lambda >0$ and $\Vert x\Vert _0$ denotes the number of nonzero entries of vector x), low rank regularizer (Yao et al., 2015), and the regularizer $-\gamma \min (|\lambda _i |,a)$ used for our experiment (see Sect. 6 for detail). In addition to Assumption 1, we also impose the following Lipschitz continuity and smoothness conditions on the objective functions, which are widely considered in the existing literature (Ghadimi and Wang, 2018; Ji et al., 2020). In the following assumption, we denote $z:=(x,y)$.

Assumption 2

The functions f(z) and g(z) in the bi-level problem (P) satisfy:

1.
Function f(z) is M-Lipschitz. Gradients $\nabla f(z)$ and $\nabla g(z)$ are $L_f$-Lipschitz and $L_g$-Lipschitz respectively.
2.
Jacobian $\nabla _x\nabla _y g(z)$ and Hessian $\nabla _y^2\,g(z)$ are $\tau$-Lipschitz and $\rho$-Lipschitz, respectively.

Assumptions 1 and 2 imply that the mapping $y^*(x)$ is $\kappa _g$-Lipschitz, where $\kappa _g=L_g/\mu >1$ denotes the condition number of the lower level function g. Similarly, we denote $\kappa _f=L_f/\mu$ for the upper level function f (Lin et al., 2020; Chen et al., 2021b), Note that $\kappa _f\ge 1$ does not necessarily hold and thus $\kappa _f$ is not a condition number.

Lastly, note that the problem (P) is rewritten as the regularized minimization problem $\min _{x\in \mathbb {R}^d} \Phi (x)+h(x)$, which can be nonsmooth and nonconvex. Therefore, our optimization goal is to find a critical point $x^*$ of the function $\Phi (x)+h(x)$ that satisfies the optimality condition $\textbf{0}\in \partial (\Phi + h)(x^*)$. Here, $\partial$ denotes the following generalized notion of subdifferential.

Definition 1

(Subdifferential and critical point, (Rockafellar & Wets, 2009)) The Frechét subdifferential $\widehat{\partial }F$ of a function F at $x\in \mathop {\textrm{dom}}F$ is the set of $u\in \mathbb {R}^d$ defined as

$$\begin{aligned} \widehat{\partial }F(x) = \Big \{u: \liminf _{z\ne x, z\rightarrow x} \frac{F(z) - F(x) - u^\top (z-x)}{\Vert z-x\Vert } \ge 0 \Big \}, \end{aligned}$$

and the limiting subdifferential $\partial F$ at $x\in \text {dom}~F$ is the graphical closure of $\widehat{\partial }F$ defined as:

$$\begin{aligned} \partial F(x) \!=\! \big \{ \!u\!: \exists (x_k, F(x_k)) \!\rightarrow \! (x, F(x)), \widehat{\partial } F(x_k) \!\ni \! u_k \!\rightarrow \! u \big \}. \end{aligned}$$

The set of critical points of F is defined as $\{ x: \textbf{0}\in \partial F(x) \}$.

3 Proximal bi-level optimization with AID

In this section, we introduce the proximal bi-level optimization algorithm with momentum accelerated approximate implicit differentiation (referred to as proximal BiO-AIDm). Recall that $\Phi (x):= f(x, y^*(x))$. The main challenge for solving the regularized bi-level optimization problem (P) is the computation of the gradient $\nabla \Phi (x)$, which involves higher-order derivatives of the lower-level function. Fortunately, this gradient can be effectively estimated using the popular AID scheme as we elaborate below.

First, it is shown in (Ji et al., 2021) that $\nabla \Phi (x)$ takes the following analytical form.

$$\begin{aligned} \nabla&\Phi (x_k) = \nabla _x f(x_k,y^*(x_k)) -\nabla _x \nabla _y g(x_k,y^*(x_k)) v_k^*, \end{aligned}$$

where $v_k^*$ corresponds to the solution of the linear system $\nabla _y^2 g(x_k,y^*(x_k))v= \nabla _y f(x_k,y^*(x_k))$. In particular, $y^*(x_k)$ is the minimizer of the strongly convex function $g(x_k,\cdot )$, and it can be effectively approximated by running T Nesterov’s accelerated gradient descent updates on $g(x_k,\cdot )$ and obtaining the output $y_{k+1}$ as the approximation. With this approximated minimizer, the AID scheme estimates the gradient $\nabla \Phi (x_k)$ as follows:

$$\begin{aligned} \text {(AID):}\quad \widehat{\nabla }&\Phi (x_k) = \nabla _x f(x_k,y_{k+1}) -\nabla _x \nabla _y g(x_k,y_{k+1})\widehat{v}_k^*, \end{aligned}$$

(1)

where $\widehat{v}_k^*$ is the solution of the approximated linear system $\nabla _y^2 g(x_k,y_{k+1}) v = \nabla _y f(x_k,y_{k+1})$, which can be efficiently solved by standard conjugate-gradient solvers. For simplicity of the discussion, we assume that $\widehat{v}_k^*$ is exactly computed in the main body of the paper. In Appendix F, we discuss how to obtain an inexact solution to this linear system via conjugate-gradient solver, and provide a comprehensive analysis of the impact of such inexactness on the overall computation complexity of the proposed algorithm. Moreover, the Jacobian-vector product involved in (1) can be efficiently computed using the existing automatic differentiation packages (Domke, 2012; Grazzi et al., 2020).

Based on the estimated gradient $\widehat{\nabla } \Phi (x_k,y_{k+1})$, we can then apply the standard proximal gradient algorithm (a.k.a. forward-backward splitting) (Lions & Mercier, 1979) to solve the regularized optimization problem (P). This algorithm is referred to as proximal BiO-AIDm and is summarized in Algorithm 1. Specifically, in each outer loop k, we first run T accelerated gradient descent steps with Nesterov’s momentum with initial point $y_k$ to minimize $g(x_k,\cdot )$ and find an approximated minimizer $y_{k+1}=y^{(T)}(x_k,y_k)\approx y^*(x_k)$, where we use the notation $y^{(T)}(x_k,y_k)$ to emphasize the dependence on $x_k$ and $y_k$. Then, this approximated minimizer is utilized by the AID scheme to estimate $\nabla \Phi (x_k)$. Finally, we apply the proximal gradient algorithm to minimize the regularized objective function $\Phi (x)+h(x)$. Here, the proximal mapping of any function h at v is defined as

$$\begin{aligned} \textrm{prox}_{h} (v):= \mathop {\mathrm {arg\,min}}\limits _{u\in \mathbb {R}^d} \Big \{h(u) + \frac{1}{2}\Vert u-v\Vert ^2\Big \}. \end{aligned}$$

Under Assumptions 1 and 2, the following lemma characterizes the smoothness of $\Phi$ and the estimation error $\Vert \widehat{\nabla }\Phi (x_k)- \nabla \Phi (x_k)\Vert$ of AID scheme.

Lemma 1

Let Assumptions 1.1 and 2 hold. Then, function $\Phi$ is differentiable and the gradient $\nabla \Phi$ is $L_\Phi$-Lipschitz with $L_{\Phi } = L_f + \frac{2L_fL_g+\tau M^2}{\mu } + \frac{\rho L_g M+L_fL_g^2+\tau M L_g}{\mu ^2} + \frac{\rho L_g^2\,M}{\mu ^3}$ (Lemma 2 of Ji (2021)). Moreover, the gradient estimate obtained by the AID scheme satisfies (Lemma 2.2a of Ghadimi and Wang (2018))

$$\begin{aligned} \Vert \widehat{\nabla }\Phi (x_k)- \nabla \Phi (x_k)\Vert ^2 \le \Gamma \Vert y_{k+1} - y^*(x_k)\Vert ^2. \end{aligned}$$

where $\Gamma =4L_f^2+\frac{4\tau ^2\,M^2}{\mu ^2} + \frac{4\,M^2\rho ^2\kappa _g^2}{\mu ^2}+4L_f^2\kappa _g^2$.

4 Global convergence and complexity of proximal BiO-AID

In this section, we study the global convergence properties of proximal BiO-AIDm for general regularized nonconvex and nonsmooth bi-level optimization.

First, note that the main update of proximal BiO-AIDm in Algorithm 1 follows from the proximal gradient algorithm, which has been proven to generate a convergent optimization trajectory to a critical point in general nonconvex optimization (Attouch & Bolte, 2009). Hence, one may expect that proximal BiO-AIDm should share the same convergence guarantee. However, this is not obvious as the proof of convergence of the proximal gradient algorithm heavily relies on the fact that it is a descent-type algorithm, i.e., the objective function is strictly decreasing over the iterations. As a comparison, the main update of proximal BiO-AIDm applies an approximated gradient $\widehat{\nabla }\Phi (x_k)$, which is correlated with both the upper- and lower-level objective functions through the AID scheme and destroys the descent property of the proximal gradient update, and hence conceals the proof of convergence.

The following key result proves that proximal BiO-AIDm does admit an intrinsic potential function that is monotonically decreasing over the iterations. Therefore, it is indeed a descent-type algorithm, which is the first step toward establishing the global convergence.

Proposition 1

Let Assumptions 1 and 2 hold and define the potential function

$$\begin{aligned} H(x,y'):= \Phi (x) +h(x) +\frac{7}{8}\Vert y^{(T)}(x,y')-y^*(x)\Vert ^2. \end{aligned}$$

(2)

Choose hyperparameters $\alpha =\frac{1}{L_g}$, $\beta \le \frac{1}{2}(L_{\Phi }+\Gamma +\kappa _g^2)^{-1}$, $\eta =\frac{\sqrt{\kappa _g}-1}{\sqrt{\kappa _g}+1}$ and $T\ge \frac{\ln (8(1+\kappa _g))}{\ln ((1-\kappa _g^{-0.5})^{-1})}$. Then, the parameter sequence $\{x_k\}_k$ generated by Algorithm 1 satisfies, for all $k=1,2,...,$

$$\begin{aligned} H(x_{k+1}, y_{k+1}) \le&H(x_k,y_k)- \frac{1}{4\beta } \Vert x_{k+1}-x_k\Vert ^2\\&- \frac{1}{8} \Big (\Vert y_{k+1}-y^*(x_k)\Vert ^2 + \Vert y_{k+2}-y^*(x_{k+1})\Vert ^2 \Big ). \end{aligned}$$

To elaborate, the potential function H consists of two components: the upper-level objective function $\Phi (x)+h(x)$ and a regularization term $\Vert y^{(T)}(x,y')-y^*(x)\Vert ^2$ that tracks the optimality gap of the lower-level optimization. Hence, the potential function H fully characterizes the optimization goal of the entire bi-level optimization. Intuitively, if $\{x_k\}_k$ converges to a certain critical point $x^*$ and $y^{(T)}(x_k,y_k)$ converges to $y^*(x^*)$, then it can be seen that $H(x_k,y_k)$ will converge to the local optimum $(\Phi +h)(x^*)$. Finding such a function is not straightforward, since the coefficient $\frac{7}{8}$ in the potential function (2) has to be elaborately selected to guarantee the monotonic decreasing property. (See the proof of Proposition 1 in Appendix A for the detail of coefficient selection.)

Based on the above characterization of the potential function, we obtain the following global convergence result of proximal BiO-AIDm in general regularized nonconvex optimization.

Theorem 2

Under the same conditions as those of Proposition 1, the parameter sequence $\{x_k, y_k\}_k$ generated by Algorithm 1 satisfies the following properties.

1.
$\Vert x_{k+1} - x_k \Vert \overset{k}{\rightarrow }\ 0$, $\Vert y_{k+1} - y^*(x_k)\Vert \overset{k}{\rightarrow }\ 0$;
2.
The function value sequence $\{(\Phi +h)(x_k)\}_k$ converges to a finite limit $H^*>-\infty$;
3.
The sequence $\{(x_k,y_k)\}_k$ is bounded and has a compact set of limit points. Moreover, $(\Phi +h)(x^*)\equiv H^*$ for any limit point $x^*$ of $\{x_k\}_k$;
4.
Every limit point $x^*$ of $\{x_k\}_k$ is a critical point of the upper-level function $(\Phi +h)(x)$.

Theorem 2 provides a comprehensive characterization of the global convergence properties of proximal BiO-AIDm in regularized nonconvex and nonsmooth bi-level optimization. Specifically, item 1 shows that the parameter sequence $\{x_k\}_k$ is asymptotically stable, and $y_{k+1}$ asymptotically converges to the corresponding minimizer $y^*(x_k)$ of the lower-level objective function $g(x_k, \cdot )$. In particular, in the unregularized case (i.e., $h=0$), this result reduces to the existing understanding that the gradient norm $\Vert \nabla \Phi (x)\Vert$ converges to zero (Ji et al., 2021; Ghadimi and Wang, 2018; Hong et al., 2020), which does not imply the convergence of the parameter sequence. Item 2 shows that the function value sequence converges to a finite limit, which is also the limit of the potential function value sequence $\{H(x_k,y_k)\}_k$. Moreover, items 3 and 4 show that the parameter sequence $\{x_k\}_k$ converges to only critical points of the objective function, and these limit points are in a flat region where the corresponding function value is the constant $H^*$. Note that due to the nonconvexity of $\Phi$, $H^*$ is not necessarily the optimal value, i.e., it is possible that $H^*>\min _{x\in \mathbb {R}^d} (\Phi +h)(x)$. To summarize, Theorem 2 formally proves that proximal BiO-AIDm will eventually converge to critical points in nonsmooth and nonconvex bi-level optimization.

In addition to the above global convergence result, Proposition 1 can be further leveraged to characterize the computation complexity of proximal BiO-AIDm for finding a critical point in regularized nonconvex bi-level optimization. Specifically, when the regularizer h is convex, we can define the following proximal gradient mapping associated with the objective function $\Phi (x)+h(x)$.

$$\begin{aligned} G(x)=\frac{1}{\beta }\Big (x-\textrm{prox}_{\beta h} \big (x- \beta \nabla \Phi (x)\big )\Big ). \end{aligned}$$

(3)

The proximal gradient mapping is a standard metric for evaluating the optimality of regularized nonconvex optimization problems (Nesterov, 2013). It can be shown that x is a critical point of $\Phi (x)+h(x)$ if and only if $G(x)=\textbf{0}$, and it reduces to the normal gradient in the unregularized case. Hence, we define the convergence criterion as finding a near-critical point x that satisfies $\Vert G(x)\Vert \le \epsilon$ for some pre-determined accuracy $\epsilon >0$. We obtain the following global convergence rate and complexity of proximal BiO-AIDm.

Corollary 1

Suppose h is convex and the conditions of Proposition 1 hold. Then, the sequence $\{x_k\}_k$ generated by Algorithm 1 satisfies the following convergence rate.

$$\begin{aligned} \min _{0\le k\le K}\Vert G(x_k)\Vert \le \sqrt{\frac{32}{K\beta }\big (H(x_0) - \inf _x (\Phi +h)(x)\big )}. \end{aligned}$$

(4)

Moreover, to achieve $\min _{0\le k\le K}\Vert G(x_k)\Vert \le \epsilon$, we run the algorithm with $K=32\epsilon ^{-2}(L_{\Phi }+\Gamma +\kappa _g^2)\big (H(x_0) - \inf _x (\Phi +h)(x)\big )$ outer iterations and $T=\frac{\ln (8(1+\kappa _g))}{\ln ((1-\kappa _g^{-0.5})^{-1})}$ inner iterations, and the overall computation complexity is $KT=\frac{32\ln (8(1+\kappa _g))}{\epsilon ^2\ln ((1-\kappa _g^{-0.5})^{-1})}(L_{\Phi }+\Gamma +\kappa _g^2)\big (H(x_0) - \inf _x (\Phi +h)(x)\big )$.

The dependence of the above computation complexity on $\epsilon$ and $\kappa :=\max (\kappa _f,\kappa _g)>1$ is no larger than $\mathcal {O}(\kappa ^{3.5}(\ln \kappa )\epsilon ^{-2})$. This strictly improves the computation complexity $\mathcal {O}(\kappa ^4\epsilon ^{-2})$ of BiO-AID that only applies to smooth nonconvex bi-level optimization (Ji et al., 2021). To elaborate the reason, in our algorithm, the T Nesterov’s accelerated gradient descent steps applied to $\min _y g(x_t,y)$ achieve the convergence rate $\Vert y_{t+1}-y^*(x_t)\Vert \le (1+\kappa _g)(1-\kappa _g^{-0.5})^T \Vert y_t-y^*(x_t)\Vert$, which is faster than $\Vert y_{t+1}-y^*(x_t)\Vert \le (1-\kappa _g^{-1})^T \Vert y_t-y^*(x_t)\Vert$ of standard gradient descent since $1-\kappa _g^{-0.5}<1-\kappa _g^{-1}$. Therefore, to ensure that $\Vert y_{t+1}-y^*(x_t)\Vert \le \frac{1}{4} \Vert y_t-y^*(x_t)\Vert$, Nesterov’s accelerated gradient descent requires $T=\frac{\ln (8(1+\kappa _g))}{\ln ((1-\kappa _g^{-0.5})^{-1})}=\mathcal {O}(\sqrt{\kappa _g}\ln \kappa _g)$ steps, which is much less than $T=\mathcal {O}(\kappa _g)$ required by standard gradient descent. On the other hand, the number of outer iterations K is the same for both algorithms. Therefore, Nesterov’s accelerated gradient descent yields smaller computation complexity KT than that of standard gradient descent. In addition, (Ji et al., 2021) only uses smooth upper-level function f while we have non-smooth regularizer h which requires to analyze nonconvex proximal gradient mapping (3). To the best of our knowledge, this is the first convergence rate and complexity result of momentum accelerated algorithm for solving regularized nonsmooth and nonconvex bi-level optimization problems. We note that another momentum accelerated bi-level optimization algorithm has been studied in (Ji and Liang, 2021), which only applies to unregularized (strongly) convex bi-level optimization problems.

5 Convergence rates under local nonconvex geometry

In the previous section, we have proved that the optimization trajectory generated by proximal BiO-AIDm approaches a compact set of critical points. Hence, we are further motivated to exploit the local function geometry around the critical points to study its local (asymptotic) convergence guarantees, which is the focus of this section. In particular, we consider a broad class of Łojasiewicz-type geometry of nonconvex functions.

5.1 Local Kurdyka–Łojasiewicz geometry

General nonconvex functions typically do not have a global geometry. However, they may have certain local geometry around the critical points that determines the local convergence rate of optimization algorithms. In particular, the Kurdyka-Łojasiewicz (KŁ) geometry characterizes a broad spectrum of local geometries of nonconvex functions (Bolte et al., 2007, 2014), and it generalizes various conventional global geometries such as the strong convexity and Polyak-Łojasiewicz geometry. Next, we formally introduce the KŁ geometry.

Definition 2

(KŁ geometry, Bolte et al. (2014)) A proper and lower semi-continuous function F is said to have the KŁ geometry if for every compact set $\Omega \subset \textrm{dom}F$ on which F takes a constant value $F_\Omega \in \mathbb {R}$, there exist $\varepsilon , \lambda >0$ such that for all $\bar{x} \in \Omega$ and all $x\in \{z\in \mathbb {R}^m: \textrm{dist}_\Omega (z)<\varepsilon , F_\Omega< F(z) <F_\Omega + \lambda \}$, the following condition holds:

$$\begin{aligned} \varphi ' \left( F(x) - F_\Omega \right) \cdot \textrm{dist}_{\partial F(x)}(\textbf{0}) \ge 1, \end{aligned}$$

(5)

where $\varphi '$ is the derivative of $\varphi : [0,\lambda ) \rightarrow \mathbb {R}_+$ that takes the form $\varphi (t) = \frac{c}{\theta } t^\theta$ for certain constant $c>0$ and KŁ parameter $\theta \in (0,1]$, and $\textrm{dist}_{\partial F(x)}(\textbf{0}) = \min _{u\in \partial F(x)} \Vert u-\textbf{0}\Vert$ denotes the point-to-set distance.

As an intuitive explanation, when function F is differentiable, the KŁ inequality in (5) can be rewritten as $F(x)-F_{\Omega } \le \mathcal {O}(\Vert \nabla F(x)\Vert ^{\frac{1}{1-\theta }})$, which can be viewed as a type of local gradient dominance condition and generalizes the Polyak-Łojasiewicz (P{\L}) condition (with parameter $\theta =\frac{1}{2}$) (Łojasiewicz, 1963; Karimi et al., 2016). In the existing literature, a large class of functions has been shown to have the local KŁ geometry, e.g., sub-analytic functions, logarithm and exponential functions and semi-algebraic functions (Bolte et al., 2014). Moreover, the KŁ geometry has been exploited to establish the convergence of many gradient-based algorithms in nonconvex optimization, e.g., gradient descent (Attouch & Bolte, 2009; Li et al., 2017), accelerated gradient method (Zhou et al., 2020), alternating minimization (Bolte et al., 2014) and distributed gradient methods (Zhou et al., 2016).

5.2 Convergence rates of proximal BiO-AIDm under KŁ geometry

In this subsection, we obtain the following asymptotic function value convergence rates of the proximal BiO-AIDm algorithm under different parameter ranges of the KŁ geometry. Throughout, we define $k_0\in \mathbb {N}^+$ to be a sufficiently large integer.

Theorem 3

Let Assumptions 1 and 2 hold and and assume that the potential function H defined in (2) has KŁ geometry. Then, under the same choices of hyper-parameters as those of Proposition 1, the potential function value sequence $\{H(x_k,y_k)\}_k$ converges to its limit $H^*$ (see its definition in Theorem 2) at the following rates.

1.
If KŁ geometry holds with $\theta \in \big (\frac{1}{2},1\big )$, then $H(x_k,y_k)\downarrow H^*$ super-linearly as
$$\begin{aligned} H(x_k,y_k)-H^*\le {\mathcal {O}}\Big (\!-\!\Big (\frac{1}{2(1-\theta )}\Big )^{k-k_0}\Big ), ~\forall k\ge k_0; \end{aligned}$$
(6)
2.
If KŁ geometry holds with $\theta =\frac{1}{2}$, then $H(x_k,y_k)\downarrow H^*$ linearly as (for some constant $C>0$)
$$\begin{aligned} H(x_k,y_k)-H^*\le {\mathcal {O}}\big ((1+C)^{-(k-k_0)}\big ),\quad \forall k\ge k_0; \end{aligned}$$
(7)
3.
If KŁ geometry holds with $\theta \in \big (0,\frac{1}{2}\big )$, then $H(x_k,y_k)\downarrow H^*$ sub-linearly as
$$\begin{aligned} H(x_k,y_k)-H^*\le {\mathcal {O}}\big ((k-k_0)^{-\frac{1}{1-2\theta }}\big ),\quad \forall k\ge k_0, \end{aligned}$$
(8)

Intuitively, a larger KŁ parameter $\theta$ implies that the local geometry of the potential function H is sharper, which implies an orderwise faster convergence rate as shown in Theorem 3. In particular, when the KŁ geometry holds with $\theta =\frac{1}{2}$, the proximal BiO-AIDm algorithm converges at a linear rate, which matches the convergence rate of bi-level optimization under the stronger geometry that both the upper and lower-level objective functions are strongly convex (Ghadimi and Wang, 2018). To the best of our knowledge, the above result provides the first function value converge rate analysis of proximal BiO-AIDm in the full spectrum of the nonconvex local KŁ geometry. The proof of Theorem 3 involves two novel techniques. The first novel technique is to establish the monotonically decreasing potential function $H(x,y')$ in Proposition 1. This monotonic decreasing property guarantees that the sequence $\{(x_k,y_k)\}_k$ generated by Algorithm 1 will enter the neighborhood of critical point $(x^*,y^*(x^*))$ where the KŁ property of H holds, which is essential to prove the function value convergence rates in Theorem 3. Another technical novelty is to prove the Lipschitz property of the mapping $y^{(T)}(x,y)$ in Lemma 2, which is the key to establish the asymptotic convergence rates in Theorem 3. To the best of our knowledge, this Lipschitz property has not been established in the existing literature for the mapping defined by Nesterov’s accelerated gradient descent steps, and it is challenging to prove due to momentum acceleration. To address this challenge, we recursively write $y^{(t)}$ as $y^{(t)}(x,y)=(1+\eta )G_x(y^{(t-1)}(x,y))-\eta G_x(y^{(t-2)}(x,y))$, where $G_x(y):=y-\alpha \nabla _y g(x,y)$ is the gradient descent mapping. We then leverage the Lipschitz property of $G_x$ (Hardt et al., 2016) to establish the Lipschitz property of $y^{(t)}$ via induction on t.

6 Experiment

We apply our bi-level optimization algorithm to solve a regularized data cleaning problem (Shaban et al., 2019) with the MNIST dataset (LeCun et al., 1998) and a linear classification model. We generate a training dataset $\mathcal {D}_{\text {tr}}$ with 20k samples, a validation dataset $\mathcal {D}_{\text {val}}$ with 20k samples, and a test dataset with 10k samples. In particular, we corrupt the training data by randomizing a proportion $p\in (0,1)$ of their labels, and the goal of this application is to identify and avoid using these corrupted training samples. The corresponding bi-level problem is written as follows.

$$\begin{aligned}&\min _{\lambda } \frac{1}{|\mathcal {D}_{\text {val}}|}\! \sum _{(x_i,y_i)\in \mathcal {D}_{\text {val}}} \!\!\!L\big (w^*(\lambda )^{\top }x_i,y_i\big ) {- \frac{\gamma }{|\mathcal {D}_{\text {tr}}|}\sum _{(x_i,y_i)\in \mathcal {D}_{\text {tr}}}\min (|\lambda _i|,a)}, \nonumber \\&\textrm{where}~w^*(\lambda )=\mathop {\arg \min }_w\frac{1}{|\mathcal {D}_{\text {tr}}|}\!\sum _{(x_i,y_i)\in \mathcal {D}_{\text {tr}}}\!\!\!\!\!\!\! \sigma (\lambda _i)L\big (w^{\top }\!x_i,y_i\big )+\rho \Vert w\Vert ^2, \end{aligned}$$

(9)

where $x_i, y_i$ denote the data and label of the i-th sample, respectively, $\sigma (\cdot )$ denotes the sigmoid function, L is the cross-entropy loss, and $\rho ,\gamma >0$ are regularization hyperparameters. The regularizer $\rho \Vert w\Vert ^2$ makes the lower-level objective function strongly convex. In particular, we add the nonconvex and nonsmooth regularizer $-\gamma \min (|\lambda _i |,a)$ to the upper-level objective function. (See Appendix G for the analytical solution to its proximal mapping) Intuitively, it encourages $|\lambda _i|$ to approach the large positive constant a so that the training sample coefficient $\sigma (\lambda _i)$ is close to either 0 or 1 for corrupted and clean training samples, respectively. In this experiment we set $a=20$. Therefore, such a regularized bi-level data cleaning problem belongs to the problem class considered in this paper.

We compare the performance of our proximal BiO-AIDm with several bi-level optimization algorithms, including proximal BiO-AID (without accelerated AID), BiO-AID (without accelerated AID) and BiO-AIDm (with accelerated-AID). In particular, for BiO-AID and BiO-AIDm, we apply them to solve the unregularized data cleaning problem (i.e., $\gamma =0$). This serves as a baseline that helps understand the impact of regularization on the test performance. In addition, we also implement all these algorithms by replacing the AID scheme with the ITD scheme to demonstrate their generality.

Hyperparameter setup We consider choices of corruption rates $p=0.1, 0.2, 0.4$, regularization parameters ${\frac{\gamma }{20000}=0,0.001,0.1,100}$ and $\rho =10^{-3}$. We run each algorithm for $K=50$ outer iterations with stepsize $\beta =0.5$ and $T=5$ inner gradient steps with stepsize $\alpha =0.1$. For the algorithms with momentum accelerated AID/ITD, we set the momentum parameter $\eta =1.0$.

Table 2 Comparison of test accuracy (test loss). (Regularizer coefficient ${\frac{\gamma }{20000}}=0$ corresponds to four non-proximal algorithms including Bio-AID(m) and Bio-ITD(m), and ${\frac{\gamma }{20000}=0.001,0.1,100}$ correspond to the proximal variants of the four algorithms. The best test accuracies and test losses of each corruption rate p are bolded)

Full size table

6.1 Optimization performance

We first investigate the effect of momentum acceleration on the optimization performance. In Fig. 1, we plot the upper-level objective function value versus the computational complexity for different bi-level algorithms under ${\frac{\gamma }{20000}=0.001}$ and different data corruption rates. In these figures, we separately compare the non-proximal algorithms and the proximal algorithms, as their upper-level objective functions are different (non-proximal algorithms are applied to solve the unregularized bi-level problem). It can be seen that all the bi-level optimization algorithms with momentum accelerated AID/ITD schemes consistently converge faster than their unaccelerated counterparts. The reason is that the momentum scheme accelerates the convergence of the inner gradient descent steps, which yields a more accurate implicit gradient and thus accelerates the convergence of the outer iterations. In addition, all the curves decrease almost in straight lines, which match the polynomial dependence of our computational complexity $\mathcal {O}(\kappa ^{3.5}(\ln \kappa )\epsilon ^{-2})$ on $\epsilon$ (see Corollary 1), as we plot both axes on log-scale.

6.2 Test performance

To understand the impact of momentum and the nonconvex regularization on the test performance of the model, we report the test accuracy and test loss of the models trained by all the algorithms in Table 2. It can be seen that the bi-level optimization algorithms with momentum accelerated AID/ITD (columns 2 & 4 of Table 2) achieve significantly better test performance than their unaccelerated counterparts (columns 1 & 3 of Table 2). This demonstrates the advantage of introducing momentum to accelerate the AID/ITD schemes. Furthermore, we observe that the test loss decreases and then increases as the regularizer coefficient $\gamma$ increases. Therefore, adding such a regularizer with proper coefficient $\gamma$ improves test performance via distinguishing the sample coefficients $\sigma (\lambda _i)$ between corrupted and clean training samples. In particular, proximal BiO-ITDm with ${\frac{\gamma }{20000}=0.1}$ achieves the best test performance within each corruption rate p, which again demonstrates advantage of the regularizer and momentum acceleration as bolded in Table 2. Lastly, a larger corruption rate p leads to a lower test performance, which is reasonable.

7 Conclusion

In this paper, we provided a comprehensive analysis of the proximal BiO-AIDm algorithm with momentum acceleration for solving regularized nonconvex and nonsmooth bi-level optimization problems. Our key finding is that this algorithm admits an intrinsic monotonically decreasing potential function, which fully tracks the bi-level optimization progress. Based on this result, we established the first global convergence rate of proximal BiO-AIDm to a critical point in regularized nonconvex optimization, which is faster than that of BiO-AID. We also characterized the asymptotic convergence behavior and rates of the algorithm under the local KŁ geometry. We anticipate that this new analysis framework can be extended to study the convergence of other bi-level optimization algorithms, including stochastic bi-level optimization. In particular, it would be interesting to explore how bi-level optimization algorithm design affects the form of the potential function and leads to different convergence guarantees and rates in nonconvex bi-level optimization.

Data availability

We used the publicly available MNIST dataset.

Code availability

All the codes are made publicly available on GitHub at https://github.com/changy12/Accelerated-Proximal-Algorithm-for-Regularized-Bi-level-Optimization.

References

Attouch, H., & Bolte, J. (2009). On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Mathematical Programming, 116(1–2), 5–16.
Article MathSciNet MATH Google Scholar
Bertinetto, L. , Henriques, J.F. , Torr, P., Vedaldi, A. (2018). Meta-learning with differentiable closed-form solvers. In Proceeding of International Conference on Learning Representations (ICLR).
Bolte, J., Daniilidis, A., & Lewis, A. (2007). The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17, 1205–1223.
Article MATH Google Scholar
Bolte, J., Sabach, S., & Teboulle, M. (2014). Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1–2), 459–494.
Article MathSciNet MATH Google Scholar
Bracken, J., & McGill, J. T. (1973). Mathematical programs with optimization problems in the constraints. Operations Research, 21(1), 37–44.
Article MathSciNet MATH Google Scholar
Chen, T. , Sun, Y., Yin, W. (2021a). A single-timescale stochastic bilevel optimization method. arXiv:2102.04671.
Chen, Z. , Zhou, Y. , Xu, T., Liang, Y. (2021b). Proximal gradient descent-ascent: Variable convergence under kłgeometry. In Proceeding of International Conference on Learning Representations (ICLR).
Dagréou, M. , Ablin, P. , Vaiter, S., Moreau, T. (2022). A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. Advances in neural information processing systems (neurips).
Domke, J. (2012). Generic methods for optimization-based modeling. In Proceeding of Artificial Intelligence and Statistics (AISTATS) (pp. 318–326).
Fallah, A. , Mokhtari, A., Ozdaglar, A. (2020). On the convergence theory of gradient-based model-agnostic meta-learning algorithms. In Proceeding International Conference on Artificial Intelligence and Statistics (AISTATS) (pp. 1082–1092).
Feurer, M., & Hutter, F. (2019). Hyperparameter optimization. Automated machine learning (pp. 3–33). Berlin & Heidelberg: Springer.
Book Google Scholar
Finn, C. , Abbeel, P., Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In Proceeding of International Conference on Machine Learning (ICML) (pp. 1126–1135).
Franceschi, L. , Donini, M. , Frasconi, P., Pontil, M. (2017). Forward and reverse gradient-based hyperparameter optimization. In Proceeding of International Conference on Machine Learning (ICML) (pp. 1165–1173).
Franceschi, L. , Frasconi, P. , Salzo, S. , Grazzi, R., Pontil, M. (2018). Bilevel programming for hyperparameter optimization and meta-learning. In Proceeding of International Conference on Machine Learning (ICML) (pp. 1568–1577).
Frankel, P., Garrigos, G., & Peypouquet, J. (2015). Splitting methods with variable metric for Kurdyka–Łojasiewicz functions and general convergence rates. Journal of Optimization Theory and Applications, 165(3), 874–900.
Article MathSciNet MATH Google Scholar
Ghadimi, S., & Wang, M. (2018). Approximation methods for bilevel programming. arXiv:1802.02246.
Gould, S., Fernando, B., Cherian, A., Anderson, P., Cruz, R.S., Guo, E. (2016). On differentiating parameterized argmin and argmax problems with application to bi-level optimization. arXiv:1607.05447.
Grazzi, R. , Franceschi, L. , Pontil, M., Salzo, S. (2020). On the iteration complexity of hypergradient computation. In Proc. International Conference on Machine Learning (ICML).
Guo, Z., & Yang, T. (2021). Randomized stochastic variance-reduced methods for stochastic bilevel optimization. arXiv:2105.02266.
Hansen, P., Jaumard, B., & Savard, G. (1992). New branch-and-bound rules for linear bilevel programming. SIAM Journal on Scientific and Statistical Computing, 13(5), 1194–1217.
Article MathSciNet MATH Google Scholar
Hardt, M. , Recht, B., Singer, Y. (2016). Train faster, generalize better: Stability of stochastic gradient descent. In Proceeding of International Conference on Machine Learning (ICML) (pp. 1225–1234).
Hong, M. , Wai, H. T. , Wang, Z., Yang, Z. (2020). A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic. arXiv:2007.05170.
Huang, F., & Huang, H. (2021). Enhanced bilevel optimization via bregman distance. arXiv:2107.12301.
Ji, K. (2021). Bilevel optimization for machine learning: Algorithm design and convergence analysis (Unpublished doctoral dissertation). The Ohio State University.
Ji, K. , Lee, J.D. , Liang, Y., Poor, H.V. (2020). Convergence of meta-learning with task-specific adaptation over partial parameters. arXiv:2006.09486.
Ji, K., & Liang, Y. (2021). Lower bounds and accelerated algorithms for bilevel optimization. arXiv:2102.03926.
Ji, K. , Yang, J., Liang, Y. (2020). Multi-step model-agnostic meta-learning: Convergence and improved algorithms. arXiv:2002.07836.
Ji, K. , Yang, J., Liang, Y. (2021). Bilevel optimization: Convergence analysis and enhanced design. In Proc. International Conference on Machine Learning (ICML), (pp. 4882–4892).
Karimi, H. , Nutini, J., Schmidt, M. (2016). Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Proceeding of Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD) (pp. 795–811).
Konda, V.R., & Tsitsiklis, J.N. (2000). Actor-critic algorithms. In Proceeding of Advances in Neural Information Processing Systems (NeurIPS) (pp. 1008–1014).
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Article Google Scholar
Li, J. , Gu, B., Huang, H. (2020). Improved bilevel model: Fast and optimal algorithm with theoretical guarantee. arXiv:2009.00690.
Li, Q. , Zhou, Y. , Liang, Y., Varshney, P.K. (2017). Convergence analysis of proximal gradient with momentum for nonconvex optimization. In Proceeding of International Conference on Machine Learning (ICML) (vol.70, pp. 2111–2119).
Liao, R. , Xiong, Y. , Fetaya, E. , Zhang, L. , Yoon, K. , Pitkow, X. & Zemel, R. (2018). Reviving and improving recurrent back-propagation. In Proceeding of International Conference on Machine Learning (ICML).
Lin, T. , Jin, C., Jordan, M.I. (2020). On gradient descent ascent for nonconvex-concave minimax problems. In Proceeding of International Conference on Machine Learning (ICML).
Lions, P. L., & Mercier, B. (1979). Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16(6), 964–979.
Article MathSciNet MATH Google Scholar
Liu, R. , Mu, P. , Yuan, X. , Zeng, S., Zhang, J. (2020). A generic first-order algorithmic framework for bi-level programming beyond lower-level singleton. In Proceeding of International Conference on Machine Learning (ICML).
Łojasiewicz, S. (1963). A topological property of real analytic subsets A topological property of real analytic subsets. Coll. du CNRS, Les equations aux derivees partielles, 117:87–89.
Lorraine, J. , Vicol, P., Duvenaud, D. (2020). Optimizing millions of hyperparameters by implicit differentiation. In Proceeding of International Conference on Artificial Intelligence and Statistics (AISTATS) (pp. 1540–1552).
Maclaurin, D. , Duvenaud, D., Adams, R. (2015). Gradient-based hyperparameter optimization through reversible learning. In Proceeding of International Conference on Machine Learning (ICML) (pp. 2113–2122).
Mehra, A. , Kailkhura, B. , Chen, P.Y., Hamm, J. (2020). How robust are randomized smoothing based defenses to data poisoning? arXiv:2012.01274.
Moore, G. M. (2010). Bilevel programming algorithms for machine learning model selection. Berlin & Heidelberg: Rensselaer Polytechnic Institute.
Google Scholar
Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course (Vol. 87). Berlin, Heidelberg: Springer.
MATH Google Scholar
Noll, D., & Rondepierre, A. (2013). Convergence of linesearch and trust-region methods using the Kurdyka–Łojasiewicz inequality. Computational and Analytical Mathematics (pp. 593–611).
Pedregosa, F. (2016). Hyperparameter optimization with approximate gradient. In Proceeding of International Conference on Machine Learning (ICML) (pp. 737–746).
Raghu, A. , Raghu, M. , Bengio, S., Vinyals, O. (2019). Rapid learning or feature reuse? towards understanding the effectiveness of MAML. Proceeding of International Conference on Learning Representations (ICLR).
Rajeswaran, A. , Finn, C. , Kakade, S.M., Levine, S. (2019). Meta-learning with implicit gradients. Proceeding of Advances in Neural Information Processing Systems (NeurIPS) (pp. 113–124).
Rockafellar, R. T., & Wets, R. J. B. (2009). Variational analysis (Vol. 317). Berlin, Heidelberg: Springer.
MATH Google Scholar
Shaban, A. , Cheng, C.A. , Hatch, N., Boots, B. (2019). Truncated back-propagation for bilevel optimization. Proc. International Conference on Artificial Intelligence and Statistics (AISTATS) (pp. 1723–1732).
Shi, C., Lu, J., & Zhang, G. (2005). An extended kuhn-tucker approach for linear bilevel programming. Applied Mathematics and Computation, 162(1), 51–63.
Article MathSciNet MATH Google Scholar
Snell, J. , Swersky, K., Zemel, R. (2017). Prototypical networks for few-shot learning. Proc. advances in neural information processing systems (neurips).
Yang, J. , Ji, K., Liang, Y. (2021). Provably faster algorithms for bilevel optimization. ArXiv:2106.04692.
Yao, Q. , Kwok, J.T., Zhong, W. (2015). Fast low-rank matrix learning with nonconvex regularization. 2015 ieee international conference on data mining (pp. 539–548).
Zhou, Y., Liang, Y., Yu, Y., Dai, W., & Xing, E. P. (2018). Distributed Proximal Gradient Algorithm for Partially Asynchronous Computer Clusters. Journal of Machine Learning Research (JMLR), 19(19), 1–32.
MathSciNet MATH Google Scholar
Zhou, Y. , Wang, Z. , Ji, K. , Liang, Y., Tarokh, V. (2020). Proximal gradient algorithm with momentum and flexible parameter restart for nonconvex optimization. Proc. International Joint Conference on Artificial Intelligence (IJCAI) (pp. 1445–1451).
Zhou, Y. , Wang, Z., Liang, Y. (2018). Convergence of cubic regularization for nonconvex optimization under kl property. Proc. Advances in Neural Information Processing Systems (NeurIPS) (pp. 3760–3769).
Zhou, Y. , Yu, Y. , Dai, W. , Liang, Y., Xing, E. (2016). On convergence of model parallel proximal gradient algorithm for stale synchronous parallel system. Proc. International Conference on Artificial Intelligence and Statistics (AISTATS) (vol.51, pp. 713–722).
Zügner, D., & Günnemann, S. (2019). Adversarial attacks on graph neural networks via meta learning. Proc. International Conference on Learning Representations (ICLR).

Download references

Funding

The work of Ziyi Chen and Yi Zhou was supported in part by U.S. National Science Foundation under the Grant. Nos. CCF-2106216 and DMS-2134223.

Author information

Authors and Affiliations

Electrical and Computer Department, University of Utah, 50 Central Campus Dr 2110, Salt Lake City, 84112, UT, USA
Ziyi Chen & Yi Zhou
Lawrence Livermore National Lab, 7000 East Avenue, Livermore, 10587, CA, USA
Bhavya Kailkhura

Authors

Ziyi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Bhavya Kailkhura
View author publications
You can also search for this author in PubMed Google Scholar
Yi Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

ZC is the main contributor who wrote theoretical proof and experiments. Co-authors BK and YZ proposed the ideas, joined the discussion and helped polish the paper.

Corresponding author

Correspondence to Ziyi Chen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Editor: Lijun Zhang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of proposition 1

Proposition 1

Let Assumptions 1 and 2 hold and define the potential function

$$\begin{aligned} H(x,y'):= \Phi (x) +h(x) +\frac{7}{8}\Vert y^{(T)}(x,y')-y^*(x)\Vert ^2. \end{aligned}$$

(2)

Choose hyperparameters $\alpha =\frac{1}{L_g}$, $\beta \le \frac{1}{2}(L_{\Phi }+\Gamma +\kappa _g^2)^{-1}$, $\eta =\frac{\sqrt{\kappa _g}-1}{\sqrt{\kappa _g}+1}$ and $T\ge \frac{\ln (8(1+\kappa _g))}{\ln ((1-\kappa _g^{-0.5})^{-1})}$. Then, the parameter sequence $\{x_k\}_k$ generated by Algorithm 1 satisfies, for all $k=1,2,...,$

$$\begin{aligned} H(x_{k+1}, y_{k+1}) \le&H(x_k,y_k)- \frac{1}{4\beta } \Vert x_{k+1}-x_k\Vert ^2\\&- \frac{1}{8} \Big (\Vert y_{k+1}-y^*(x_k)\Vert ^2 + \Vert y_{k+2}-y^*(x_{k+1})\Vert ^2 \Big ). \end{aligned}$$

Proof

Based on the smoothness of the function $\Phi (x)$ established in Lemma 1, we have

$$\begin{aligned} \Phi (x_{k+1})&\le \Phi (x_k) + \langle \nabla \Phi (x_k), x_{k+1}-x_k\rangle + \frac{L_\Phi }{2} \Vert x_{k+1}-x_k\Vert ^2. \end{aligned}$$

(A1)

On the other hand, by the definition of the proximal gradient step of $x_k$, we have

$$\begin{aligned} h(x_{k+1}) + \frac{1}{2\beta } \Vert x_{k+1} - x_k + \beta \widehat{\nabla } \Phi (x_k, y_{k+1})\Vert ^2 \le h(x_{k}) + \frac{1}{2\beta } \Vert \beta \widehat{\nabla } \Phi (x_k, y_{k+1})\Vert ^2, \end{aligned}$$

(A2)

which further simplifies to

$$\begin{aligned} h(x_{k+1}) \le h(x_{k}) - \frac{1}{2\beta } \Vert x_{k+1} - x_k \Vert ^2 - \langle x_{k+1} - x_k, \widehat{\nabla } \Phi (x_k, y_{k+1}) \rangle . \end{aligned}$$

(A3)

Adding up (A3) and (A1) yields that

$$\begin{aligned}&\Phi (x_{k+1}) + h(x_{k+1})\nonumber \\&\le \Phi (x_k) +h(x_k) - \Big (\frac{1}{2\beta } - \frac{L_\Phi }{2} \Big ) \Vert x_{k+1}-x_k\Vert ^2 + \langle x_{k+1}-x_k, \nabla \Phi (x_k) - \widehat{\nabla } \Phi (x_k, y_{k+1})\rangle \nonumber \\&\le \Phi (x_k)+h(x_k) - \Big (\frac{1}{2\beta } -\frac{L_\Phi }{2} \Big ) \Vert x_{k+1}-x_k\Vert ^2 + \Vert x_{k+1}-x_k\Vert \Vert \nabla \Phi (x_k) - \widehat{\nabla } \Phi (x_k, y_{k+1})\Vert \nonumber \\&\le \Phi (x_k) +h(x_k) - \Big (\frac{1}{2\beta } -\frac{L_\Phi }{2} - \frac{\Gamma }{2}\Big ) \Vert x_{k+1}-x_k\Vert ^2 + \frac{1}{2\Gamma }\Vert \nabla \Phi (x_k) - \widehat{\nabla } \Phi (x_k, y_{k+1})\Vert ^2. \nonumber \\&\le \Phi (x_k) +h(x_k) - \Big (\frac{1}{2\beta } -\frac{L_\Phi }{2} - \frac{\Gamma }{2}\Big ) \Vert x_{k+1}-x_k\Vert ^2 + \frac{1}{2}\Vert y_{k+1}-y^*(x_k)\Vert ^2, \end{aligned}$$

(A4)

where the last inequality utilizes Lemma 1. Next, note that $y_{k+2}=y^{(T)}(x_{k+1},y_{k+1})$ is generated by minimizing the strongly convex function $g(x_{k+1},\cdot )$ through T gradient descent steps with Nesterov’s momentum with the initial point $y_{k+1}$. Hence, with $\alpha = \frac{1}{L_g}$ and $\eta =\frac{\sqrt{\kappa _g}-1}{\sqrt{\kappa _g}+1}$ (see Theorem 2.2.3 of (Nesterov, 2013)), we obtain that

$$\begin{aligned}&\Vert y_{k+2}-y^*(x_{k+1})\Vert ^2 \nonumber \\&\le (1+\kappa _g)(1-\kappa _g^{-0.5})^T \Vert y_{k+1}-y^*(x_{k+1})\Vert ^2 \nonumber \\&\le (1+\kappa _g)(1-\kappa _g^{-0.5})^T \big (2\Vert y_{k+1}-y^*(x_k)\Vert ^2+2\Vert y^*(x_k)-y^*(x_{k+1})\Vert ^2\big ) \nonumber \\&{\mathop {\le }\limits ^{(i)}} \frac{1}{4} \Vert y_{k+1}-y^*(x_k)\Vert ^2 + \frac{\kappa _g^2}{4}\Vert x_{k+1}-x_k\Vert ^2, \end{aligned}$$

(A5)

where (i) uses the fact that $y^*$ is $\kappa _g$-Lipschitz (proved in Proposition 1 of (Chen et al., 2021b)) and $T\ge \frac{\ln (8(1+\kappa _g))}{\ln ((1-\kappa _g^{-0.5})^{-1})}$.

Adding up (A5) and (A4) yields that

$$\begin{aligned}&\Phi (x_{k+1}) + h(x_{k+1}) + \Vert y_{k+2}-y^*(x_{k+1})\Vert ^2 \\&{\le }\Phi (x_k) +h(x_k) - \Big (\frac{1}{2\beta } -\frac{L_\Phi }{2} - \frac{\Gamma }{2} - \frac{\kappa _g^2}{4}\Big ) \Vert x_{k+1}-x_k\Vert ^2 + \frac{3}{4}\Vert y_{k+1}-y^*(x_k)\Vert ^2 \\&{\mathop {\le }\limits ^{(i)}} \Phi (x_k) +h(x_k) - \frac{1}{4\beta } \Vert x_{k+1}-x_k\Vert ^2 + \frac{3}{4}\Vert y_{k+1}-y^*(x_k)\Vert ^2 \end{aligned}$$

where (i) uses the stepsize $\beta \le \frac{1}{2}(L_{\Phi }+\Gamma +\kappa _g^2)^{-1}$. Defining the potential function $H(x_k,y_k):= \Phi (x_k) +h(x_k) +\frac{7}{8}\Vert y^{(T)}(x_k,y_k)-y^*(x_k)\Vert ^2=\Phi (x_k) +h(x_k) +\frac{7}{8}\Vert y_{k+1}-y^*(x_k)\Vert ^2$ and rearranging the above inequality yields that

$$\begin{aligned} H(x_{k+1}, y_{k+1}) \le&H(x_k,y_k)- \frac{1}{4\beta } \Vert x_{k+1}-x_k\Vert ^2\\&- \frac{1}{8} \Big (\Vert y_{k+1}-y^*(x_k)\Vert ^2 + \Vert y_{k+2}-y^*(x_{k+1})\Vert ^2 \Big ). \end{aligned}$$

Appendix B: Proof of theorem 2

Theorem 2

Under the same conditions as those of Proposition 1, the parameter sequence $\{x_k, y_k\}_k$ generated by Algorithm 1 satisfies the following properties.

1.
$\Vert x_{k+1} - x_k \Vert \overset{k}{\rightarrow }\ 0$, $\Vert y_{k+1} - y^*(x_k)\Vert \overset{k}{\rightarrow }\ 0$;
2.
The function value sequence $\{(\Phi +h)(x_k)\}_k$ converges to a finite limit $H^*>-\infty$;
3.
The sequence $\{(x_k,y_k)\}_k$ is bounded and has a compact set of limit points. Moreover, $(\Phi +h)(x^*)\equiv H^*$ for any limit point $x^*$ of $\{x_k\}_k$;
4.
Every limit point $x^*$ of $\{x_k\}_k$ is a critical point of the upper-level function $(\Phi +h)(x)$.

Proof

We first prove the item 1. Summing Proposition 1 over $k=0,1,...,K-1$, we obtain that for all $K\in \mathbb {N}_+$,

$$\begin{aligned}&\sum _{k=0}^{K-1} \frac{1}{4\beta } \Vert x_{k+1}-x_k\Vert ^2 + \frac{1}{8} \big (\Vert y_{k+1}-y^*(x_k)\Vert ^2 + \Vert y_{k+2}-y^*(x_{k+1})\Vert ^2 \big ) \nonumber \\&\le H(x_0,y_0) - H(x_K,y_K) \nonumber \\&{\mathop {\le }\limits ^{(i)}} H(x_0,y_0) - \inf _x (\Phi +h)(x) \nonumber \\&< +\infty , \end{aligned}$$

(B6)

where (i) uses $H(x,y)\ge (\Phi +h)(x)$ and the item 3 of Assumption 1 that $\Phi +h$ is lower bounded.

Letting $K\rightarrow \infty$, we further obtain that

$$\begin{aligned} \sum _{k=0}^{\infty } \frac{1}{4\beta } \Vert x_{k+1}-x_k\Vert ^2 + \frac{1}{8} \big (\Vert y_{k+1}-y^*(x_k)\Vert ^2 + \Vert y_{k+2}-y^*(x_{k+1})\Vert ^2 \big ) {<} +\infty . \end{aligned}$$

(B7)

Hence, we conclude that $\Vert x_{k+1}-x_k\Vert \rightarrow 0, \Vert y_{k+1}-y^*(x_k)\Vert \rightarrow 0$, which proves the item 1.

Next, we prove the item 2. We have shown in Proposition 1 that $\{H(x_k,y_k)\}_k$ is monotonically decreasing. Since $H(x_k,y_k){\ge \Phi (x_k)+h(x_k) \ge \inf _{x'} \Phi (x')+h(x')}$, which is bounded below, we conclude that $\{H(x_k,y_k)\}_k$ has a finite limit $H^*>-\infty$, i.e., $\lim _{k\rightarrow \infty } (\Phi +h)(x_k) + \frac{7}{8} \Vert y_{k+1} - y^*(x_k) \Vert ^2 = H^*$. Moreover, since we already showed that $\Vert y_{k+1} - y^*(x_k) \Vert \rightarrow 0$, we further conclude that $\lim _{k\rightarrow \infty } (\Phi +h)(x_k) = H^*$.

Next, we prove the item 3. $\{x_k\}_k$ is bounded since $\Phi (x_k)+h(x_k)\le H(x_k,y_k)\le H(x_0,y_0)$ and $\Phi +h$ has compact sub-level set. Note that

$$\begin{aligned} \Vert y_k\Vert&\le \Vert y_k-y^*(x_{k-1})\Vert +\Vert y^*(x_{k-1})-y^*(0)\Vert +\Vert y^*(0)\Vert \nonumber \\&\overset{(i)}{\le } \Vert y_k-y^*(x_{k-1})\Vert +\kappa _g\Vert x_{k-1}\Vert +\Vert y^*(0)\Vert , \end{aligned}$$

(B8)

where (i) uses the $\kappa _g$-Lipschitz continuity of $y^*$ (Proved in Proposition 1 of (Chen et al., 2021b)). Since $\Vert y_k-y^*(x_{k-1})\Vert \rightarrow 0$ and $\Vert x_{k-1}\Vert$ is bounded, the above inequality implies that $\{x_k,y_k\}_k$ is bounded and thus has compact set of limit points.

Next, we bound the subdifferential of the function $\Phi +h$. By the optimality condition of the proximal gradient update of $x_k$ and the summation rule of subdifferential, we obtain that

$$\begin{aligned} \textbf{0}\in \partial h(x_{k+1}) + \frac{1}{\beta } \big (x_{k+1} - x_k + \beta \widehat{\nabla } \Phi (x_k, y_{k+1})\big ). \end{aligned}$$

The above equation further implies that

$$\begin{aligned} \frac{1}{\beta } \big (x_{k} - x_{k+1}\big ) +\nabla \Phi (x_{{k+1}}) - \widehat{\nabla } \Phi (x_k, y_{k+1}) \in \partial (\Phi +h)(x_{k+1}). \end{aligned}$$

Then, we obtain that

$$\begin{aligned}&\textrm{dist}_{\partial (\Phi +h)(x_{k+1})}(\textbf{0})\nonumber \\&\le \frac{1}{\beta } \Vert x_{k} - x_{k+1}\Vert +\Vert \nabla \Phi (x_{k+1}) - \widehat{\nabla } \Phi (x_k, y_{k+1})\Vert \nonumber \\&\le \frac{1}{\beta } \Vert x_{k} - x_{k+1}\Vert +{\Vert \nabla \Phi (x_{k+1}) - \nabla \Phi (x_k)\Vert } +\Vert \nabla \Phi (x_k) - \widehat{\nabla } \Phi (x_k, y_{k+1})\Vert \nonumber \\&\overset{(i)}{\le }\ \Big (\frac{1}{\beta }+{L_{\Phi }}\Big ) \Vert x_{k} - x_{k+1}\Vert + \sqrt{\Gamma }\Vert y_{k+1} - y^*(x_k)\Vert , \end{aligned}$$

(B9)

where (i) follows from Lemma 1. Since we have shown that $\Vert x_{k+1}-x_k\Vert \rightarrow 0, \Vert y_{k+1}-y^*(x_k)\Vert \rightarrow 0$, the above inequality implies that

$$\begin{aligned}&\frac{1}{\beta } \big (x_{k} - x_{k+1}\big ) +\nabla \Phi (x_{{k+1}}) - \widehat{\nabla } \Phi (x_k, y_{k+1}) \in \partial (\Phi +h)(x_{k+1}), \nonumber \\ \text {and} \quad&\frac{1}{\beta } \big (x_{k} - x_{k+1}\big ) +\nabla \Phi (x_{{k+1}}) - \widehat{\nabla } \Phi (x_k, y_{k+1}) \rightarrow \textbf{0}. \end{aligned}$$

(B10)

Next, consider any limit point $x^*$ of $\{x_k\}_k$ so that $x_{k(j)} \overset{j}{\rightarrow }\ x^*$ along a subsequence. By the proximal update of $x_{k(j)}$, we have

$$\begin{aligned}&h(x_{k(j)}) + \frac{1}{2\beta } \Vert x_{k(j)} -x_{k(j)-1} \Vert ^2 + \langle x_{k(j)} -x_{k(j)-1} , \widehat{\nabla } \Phi (x_{k(j)-1}, y_{k(j)}) \rangle \\&\le h(x^*) + \frac{1}{2\beta } \Vert x^* -x_{k(j)-1} \Vert ^2 + \langle x^* -x_{k(j)-1} , \widehat{\nabla } \Phi (x_{k(j)-1}, y_{k(j)}) \rangle . \end{aligned}$$

Rearranging the above inequality yields that

$$\begin{aligned}&h(x_{k(j)}) + \frac{1}{2\beta } \Vert x_{k(j)} -x_{k(j)-1} \Vert ^2 \\&\le h(x^*) + \frac{1}{2\beta } \Vert x^* -x_{k(j)-1} \Vert ^2 \\&\quad + \langle x^* -x_{k(j)} , \widehat{\nabla } \Phi (x_{k(j)-1}, y_{k(j)})- \nabla \Phi (x_{k(j)-1})+\nabla \Phi (x_{k(j)-1}) \rangle \\&\le h(x^*) + \frac{1}{2\beta } \Vert x^* -x_{k(j)-1} \Vert ^2 + \langle x^* -x_{k(j)} , \nabla \Phi (x_{k(j)-1}) \rangle \\&\quad + \Vert x^* -x_{k(j)} \Vert \Vert \widehat{\nabla } \Phi (x_{k(j)-1}, y_{k(j)})- \nabla \Phi (x_{k(j)-1})\Vert \\&\le h(x^*) + \frac{1}{2\beta } \Vert x^* -x_{k(j)-1} \Vert ^2 + \langle x^* -x_{k(j)} , \nabla \Phi (x_{k(j)-1}) \rangle \\&\quad + \sqrt{\Gamma }\Vert x^* -x_{k(j)} \Vert \Vert y_{k(j)} - y^*(x_{k(j)-1})\Vert . \end{aligned}$$

Taking limsup on both sides of the above inequality and noting that $\{x_k\}_k$ is bounded, $\nabla \Phi$ is Lipschitz, $\Vert x_{k+1} - x_k\Vert \rightarrow 0$, $x_{k(j)} \overset{j}{\rightarrow }\ x^*$ and $\Vert y_{k(j)} -y^*(x_{k(j)-1})\Vert \overset{j}{\rightarrow }\ 0$, we conclude that $\lim \sup _j h(x_{k(j)}) \le h(x^*)$. Since h is lower-semicontinuous, we know that $\lim \inf _j h(x_{k(j)}) \ge h(x^*)$. Combining these two inequalities yields that $\lim _j h(x_{k(j)}) = h(x^*)$. By continuity of $\Phi$, we further conclude that $\lim _j (\Phi +h)(x_{k(j)}) = (\Phi +h)(x^*)$. Since we have shown that the entire sequence $\{(\Phi +h)(x_k)\}_k$ converges to a certain finite limit $H^*$, we conclude that $(\Phi +h)(x^*) \equiv H^*$ for all the limit points $x^*$ of $\{ x_k\}_k$. This proves the item 3.

Finally, we prove the item 4. To this end, we have shown that for every subsequence $x_{k(j)} \overset{j}{\rightarrow }\ x^*$, we have that $(\Phi +h)(x_{k(j)}) \overset{j}{\rightarrow }\ H^*{=(\Phi +h)(x^*)}$ and there exists $u_k \in \partial (\Phi + h)(x_{k})$ such that $u_k\rightarrow \textbf{0}$ (by (B10)). Recall the definition of limiting subdifferential, we conclude that every limit point $x^*$ of $\{x_k\}_k$ is a critical point of $(\Phi +h)(x)$, i.e., $\textbf{0}\in \partial (\Phi + h)(x^*)$.

Appendix C: Proof of corollary 1

Corollary 1

Suppose h is convex and the conditions of Proposition 1 hold. Then, the sequence $\{x_k\}_k$ generated by Algorithm 1 satisfies the following convergence rate.

$$\begin{aligned} \min _{0\le k\le K}\Vert G(x_k)\Vert \le \sqrt{\frac{32}{K\beta }\big (H(x_0) - \inf _x (\Phi +h)(x)\big )}. \end{aligned}$$

(4)

Moreover, to achieve $\min _{0\le k\le K}\Vert G(x_k)\Vert \le \epsilon$, we run the algorithm with $K=32\epsilon ^{-2}(L_{\Phi }+\Gamma +\kappa _g^2)\big (H(x_0) - \inf _x (\Phi +h)(x)\big )$ outer iterations and $T=\frac{\ln (8(1+\kappa _g))}{\ln ((1-\kappa _g^{-0.5})^{-1})}$ inner iterations, and the overall computation complexity is $KT=\frac{32\ln (8(1+\kappa _g))}{\epsilon ^2\ln ((1-\kappa _g^{-0.5})^{-1})}(L_{\Phi }+\Gamma +\kappa _g^2)\big (H(x_0) - \inf _x (\Phi +h)(x)\big )$.

Proof

$$\begin{aligned} \Vert G(x_{k+1})\Vert =&\frac{1}{\beta }\Vert x_{k+1}-\textrm{prox}_{\beta h} (x_{k+1}- \beta \nabla \Phi (x_{k+1}))\Vert \\&\overset{(i)}{\le }\ \frac{1}{\beta } \big \Vert x_{k+1}-x_k-\beta \big (\nabla \Phi (x_{k+1}) - \widehat{\nabla } \Phi (x_k, y_{k+1})\big )\big \Vert \\&\le \frac{1}{\beta }\big \Vert x_{k+1}-x_k\big \Vert \!+\!\big \Vert \nabla \Phi (x_{k+1}) - \nabla \Phi (x_k)\big \Vert \!+\!\big \Vert \nabla \Phi (x_k) - \widehat{\nabla } \Phi (x_k, y_{k+1})\big \Vert \\&\overset{(ii)}{\le } \Big (\frac{1}{\beta }+L_{\Phi }\Big )\Vert x_{k+1}-x_k\Vert + \sqrt{\Gamma }\Vert y_{k+1}-y^*(x_k)\Vert \\&\overset{(iii)}{\le }\ \frac{2}{\beta }\Vert x_{k+1}-x_k\Vert + \sqrt{\Gamma }\Vert y_{k+1}-y^*(x_k)\Vert \end{aligned}$$

where (i) uses $x_{k+1}\in \textrm{prox}_{\beta h}\big (x_k - \beta \widehat{\nabla } \Phi (x_k, y_{k+1})\big )$ and the non-expansiveness of proximal mapping since h is convex, (ii) uses the property that $y^*$ is $\kappa _g$-Lipschitz continuous, and (iii) uses the stepsize $\beta \le \frac{1}{2}(L_{\Phi }+\Gamma +\kappa _g^2)^{-1}$. Hence, we have

$$\begin{aligned}&\sum _{k=0}^{K-1}\Vert G(x_{k+1})\Vert ^2\\&\le 2\sum _{k=0}^{K-1} \Big (\frac{4}{\beta ^2}\Vert x_{k+1}-x_k\Vert ^2 + \Gamma \Vert y_{k+1}-y^*(x_k)\Vert ^2\Big ) \\&\le \!\max \!\Big (\frac{32}{\beta },\!16\Gamma \Big )\!\sum _{k=0}^{K-1}\!\Big (\frac{1}{4\beta } \Vert x_{k+1}\!-\!x_k\Vert ^2\!+\!\frac{1}{8} \big (\Vert y_{k+1}\!-\!y^*(x_k)\Vert ^2\!+\!\Vert y_{k+2}\!-\!y^*(x_{k+1})\Vert ^2 \big )\Big ) \\&\overset{(i)}{\le }\ \frac{32}{\beta }\big (H(x_0)-\inf _x (\Phi +h)(x)\big ), \end{aligned}$$

where (i) uses eq. (B6) and the stepsize $\beta \le \frac{1}{2}(L_{\Phi }+\Gamma +\kappa _g^2)^{-1}$ which implies that $16\Gamma \le \frac{8}{\beta }$. Hence,

$$\begin{aligned}&\min _{0\le k\le K}\Vert G(x_k)\Vert \le \sqrt{\frac{1}{K}\sum _{k=0}^{K-1}\Vert G(x_{t+1})\Vert ^2} \le \sqrt{\frac{32}{K\beta }\big (H(x_0) - \inf _x (\Phi +h)(x)\big )}. \end{aligned}$$

To achieve $\min _{0\le k\le K}\Vert G(x_k)\Vert \le \epsilon$, it sufficies that $K\ge \frac{32}{\beta \epsilon ^2}\big (H(x_0) - \inf _x (\Phi +h)(x)\big )$ (the maximum possible stepsize $\beta =\frac{1}{2}(L_{\Phi }+\Gamma +\kappa _g^2)^{-1}$). Since each inner loop and each outer loop of Algorithm 1 involves less than 7 evaluations of gradients, Hessian-vector products and proximal mappings in total, the computational complexity is $KT={\frac{32\ln (8(1+\kappa _g))}{\epsilon ^2\ln ((1-\kappa _g^{-0.5})^{-1})}(L_{\Phi }+\Gamma +\kappa _g^2)\big (H(x_0) - \inf _x (\Phi +h)(x)\big )}$.

Appendix D: Auxiliary lemmas for proving theorem 3

We first inspect the mapping $y^{(T)}(x,y)$ defined as T Nesterov’s accelerated gradient descent steps for minimizing $g(x,\cdot )$ with initial point y. Define the gradient descent operator $G_x(y)=y-\alpha \nabla _y g(x,y)$. Note that $g(x, \cdot )$ is $L_g$-smooth and $\mu$-strongly convex, and our learning rate $\alpha =\frac{1}{L_g}\le \frac{2}{L_g+\mu }$. Hence, based on Lemma 3.6 in (Hardt et al., 2016), $G_x(\cdot )$ is a contraction mapping with Lipschitz constant $1-\frac{\alpha L_g\mu }{L_g+\mu }=\frac{\kappa _g}{\kappa _g+1}$. Also, it can be easily seen that $G_x(y)$ is 1-Lipschitz as a function of x since $\Vert G_{x'}(y)-G_x(y)\Vert =\alpha \Vert \nabla _y g(x,y)-\nabla _y g(x',y)\Vert \le \alpha L_g\Vert x'-x\Vert =\Vert x'-x\Vert$. With the operator $G_x$, the mapping $y^{(t)}$ can be recursively defined as follows.

$$\begin{aligned}&y^{(0)}(x,y)=y; \end{aligned}$$

(D11)

$$\begin{aligned}&y^{(1)}(x,y)=G_x(y); \end{aligned}$$

(D12)

$$\begin{aligned}&y^{(t)}(x,y)=(1+\eta )G_x(y^{(t-1)}(x,y))-\eta G_x(y^{(t-2)}(x,y)); t\ge 2. \end{aligned}$$

(D13)

We can prove the above mapping $y^{(t)}$ satisfies following lemma.

Lemma 2

Under Assumptions 1 & 2, $y^{(T)}(\cdot ,\cdot )$ is a $(2.5^{T+1}-1.5)$-Lipschitz continuous mapping, that is, for any two points $z:=(x,y)$ and $z':=(x',y')$,

$$\begin{aligned} \Vert y^{(T)}(z')-y^{(T)}(z)\Vert \le (2.5^{T+1}-1.5)\Vert z'-z\Vert . \end{aligned}$$

Proof

We will prove this Lemma by induction.

Based on eq. (D11), $y^{(0)}$ is 1-Lipschitz, so this Lemma holds for $T=0$.

Based on eq. (D12), the following inequality holds, which implies that this Lemma also holds for $T\ge 1$

$$\begin{aligned} \Vert y^{(1)}(z')-y^{(1)}(z)\Vert \le&\Vert G_{x'}(y')-G_x(y')\Vert +\Vert G_x(y')-G_x(y)\Vert \nonumber \\ \le&\Vert x'-x\Vert +\frac{\kappa _g}{\kappa _g+1}\Vert y'-y\Vert \le \sqrt{2}\Vert z'-z\Vert \end{aligned}$$

(D14)

Suppose this Lemma holds for any $T\le t-1$ ($t\ge 2$). Then, based on eq. (D13),

$$\begin{aligned}&\Vert y^{(t)}(z')-y^{(t)}(z)\Vert \nonumber \\&\le (1+\eta )\Vert G_{x'}(y^{(t-1)}(z'))-G_x(y^{(t-1)}(z))\Vert + \eta \Vert G_{x'}(y^{(t-2)}(z'))-G_x(y^{(t-2)}(z))\Vert \nonumber \\&\le (1+\eta )\big (\Vert G_{x'}(y^{(t-1)}(z'))-G_x(y^{(t-1)}(z'))\Vert +\Vert G_x(y^{(t-1)}(z'))-G_x(y^{(t-1)}(z))\Vert \big ) \nonumber \\&\quad + \eta \Vert G_{x'}(y^{(t-2)}(z'))-G_x(y^{(t-2)}(z'))\Vert + \eta \Vert G_x(y^{(t-2)}(z'))-G_x(y^{(t-2)}(z))\Vert \nonumber \\&\le (1+\eta )\Vert x'-x\Vert +(1+\eta )\frac{\kappa _g}{\kappa _g+1}\Vert y^{(t-1)}(z')-y^{(t-1)}(z)\Vert \nonumber \\&\quad + \eta \Vert x'-x\Vert + \eta \frac{\kappa _g}{\kappa _g+1}\Vert y^{(t-2)}(z')-y^{(t-2)}(z)\Vert \nonumber \\&{\mathop {\le }\limits ^{(i)}} 3\Vert z'-z\Vert +2(2.5^t-1.5)\Vert z'-z\Vert +(2.5^{t-1}-1.5)\Vert z'-z\Vert \nonumber \\&\le \big (2.4(2.5^t)-1.5\big )\Vert z'-z\Vert \le (2.5^{t+1}-1.5)\Vert z'-z\Vert . \end{aligned}$$

(D15)

where (i) uses $\eta =\frac{\sqrt{\kappa _g}-1}{\sqrt{\kappa _g}+1}\le 1$, $\Vert x'-x\Vert \le \Vert z'-z\Vert$ and the assumption that $y^{(T)}(\cdot ,\cdot )$ is $2.5^{T+1}-1.5$-Lipschitz for any $T\le t-1$. Hence, this Lemma also holds for $T=t$ and thus for all $T\in \mathbb {N}$.

Lemma 3

Under Assumptions 1 and 2, function $\Vert y^{(T)}(z)-y^*(x)\Vert ^2$ is differentiable with regard to $z:=(x,y)$.

Proof

It sufficies to prove that $y^*(x)$ and $y^{(T)}(z)$ are differentiable.

First, we prove the differentiability of $y^*(x):={\arg \min }_y g(x,y)$, which satisfies the stationary condition that $\nabla _y g[x,y^*(x)]=0$. Note that for all $x\in \mathbb {R}^d$, $\nabla _y^2\,g[x,y^*(x)]$ is invertible as $g(x,\cdot )$ is strongly-convex. Therefore, the implicit function theorem implies that $y^*(x)$ is differentiable with $\nabla y^*(x)=-\big (\nabla _y^2\,g[x,y^*(x)]\big )^{-1}\nabla _y\nabla _x g[x,y^*(x)]$.

Next, we will prove that $y^{(t)}(x,y)$ is differentiable for $0\le t\le T$ via induction. Note that $y^{(0)}(x,y)=y$ is differentiable. Then, suppose there exists $T'$ such that $y^{(t)}$ is differentiable for all $0\le t\le T'-1$, and it sufficies to prove that $y^{(T')}$ is differentiable.

Note that the gradient descent operator $G_x(y)=y-\alpha \nabla _y g(x,y)$ is differentiable with gradients

$$\begin{aligned} \nabla _1 G_x(y):=\nabla _x G_x(y)=-\alpha \nabla _x\nabla _y g(x,y), \quad \nabla _2 G_x(y):=\nabla _y G_x(y)=I-\alpha \nabla _y^2 g(x,y). \end{aligned}$$

Therefore, based on chain rule, $y^{(t)}(x,y)=(1+\eta )G_x(y^{(t-1)}(x,y))-\eta G_x(y^{(t-2)}(x,y))$ has the following gradients

$$\begin{aligned} \nabla _x y^{(t)}(x,y)=&(1+\eta )\nabla _1 G_x(y^{(t-1)}(x,y))+(1+\eta )\nabla _2 G_x(y^{(t-1)}(x,y))\nabla _x y^{(t-1)}(x,y)\nonumber \\&-\eta \nabla _1 G_x(y^{(t-2)}(x,y))-\eta \nabla _2 G_x(y^{(t-2)}(x,y))\nabla _x y^{(t-2)}(x,y), \nonumber \\ \nabla _y y^{(t)}(x,y)=&(1+\eta )\nabla _2 G_x(y^{(t-1)}(x,y))\nabla _y y^{(t-1)}(x,y)\nonumber \\&-\eta \nabla _2 G_x(y^{(t-2)}(x,y))\nabla _y y^{(t-2)}(x,y), \nonumber \end{aligned}$$

which based on the above discussion implies that $y^{(T)}$ is differentiable and thus concludes the proof.

Lemma 3 ensures that the potential function $H(x,y'):= \Phi (x) +h(x) +\frac{7}{8}\Vert y^{(T)}(x,y')-y^*(x)\Vert ^2$ is subdifferentiable since $\Phi +h$ is subdifferentiable. Furthermore, to prove Theorem 3 under KŁ geometry, we obtain the following bound on the subdifferential of the potential function H. Throughout, we denote $z:=(x,y)$, $z':=(x',y')$ and $z_k:=(x_k,y_k)$.

Lemma 4

Let Assumptions 1 and 2 hold and consider the potential function H defined in (2). Then, H is subdifferentiable. Furthermore, under the same choices of hyper-parameters as those of Proposition 1, the subdifferential of H satisfies the following bound:

$$\begin{aligned} \textrm{dist}_{\partial _z H(z_{k+1})}(\textbf{0}) \le&{\frac{2}{\beta }} \Vert x_{k+1} - x_{k}\Vert + \sqrt{\Gamma } \Vert y_{k+1}-y^*(x_k)\Vert \\&+ 2\big (2.5^{T+1}+\kappa _g\big ) \Vert y_{k+2}-y^*(x_{k+1})\Vert . \end{aligned}$$

Proof

Recall the potential function $H(z):=\Phi (x) +h(x)+\frac{7}{8}\Vert y^{(T)}(x,y)-y^*(x)\Vert ^2$. Note that $\Phi +h$ is subdifferentiable, and $\Vert y^{(T)}(z)-y^*(x)\Vert ^2$ is differentiable based on Lemma 3. Hence, by the subdifferetial rule we have

$$\begin{aligned} \partial _z H(z)&\supset \partial _z(\Phi + h)(x) + \frac{7}{8} \partial _{z} (\Vert y^{(T)}(z)-y^*(x)\Vert ^2) \nonumber \\&=\partial (\Phi + h)(x)\times \{\textbf{0}\} + \frac{7}{8} \partial _{z} (\Vert y^{(T)}(z)-y^*(x)\Vert ^2), \end{aligned}$$

(D16)

where the second “$=$” uses $\partial _y(\Phi + h)(x)=\{\textbf{0}\}$.

Next, we derive an upper bound for the subdifferentials $\partial _{z} (\Vert y^{(T)}(z)-y^*(x)\Vert ^2)$. Take any Frechet subdifferential $u\in \widehat{\partial }_z (\Vert y^{(T)}(z)-y^*(x)\Vert ^2)$, we obtain from its definition that

$$\begin{aligned} 0&\le \liminf _{z'\ne z, z'\rightarrow z} \frac{\Vert y^{(T)}(z')-y^*(x')\Vert ^2 - \Vert y^{(T)}(z)-y^*(x)\Vert ^2 - u^\top (z'-z)}{\Vert z'-z\Vert } \nonumber \\&=\liminf _{z'\ne z, z'\rightarrow z} \frac{1}{\Vert z'-z\Vert }\Big ([y^{(T)}(z')-y^*(x') + y^{(T)}(z)-y^*(x)]^\top \nonumber \\&\quad [y^{(T)}(z')-y^*(x') - y^{(T)}(z)+y^*(x)] - u^\top (z'-z)\Big ) \nonumber \\&\le \liminf _{z'\ne z, z'\rightarrow z} \Big (\frac{1}{\Vert z'-z\Vert }\big (\Vert y^{(T)}(z')-y^*(x')\Vert + \Vert y^{(T)}(z)-y^*(x)\Vert \big )\nonumber \\&\quad \big (\Vert y^{(T)}(z') - y^{(T)}(z)\Vert + \Vert y^*(x) - y^*(x')\Vert \big )-\frac{u^\top (z'-z)}{\Vert z'-z\Vert }\Big )\nonumber \\&\overset{(i)}{\le }\ \liminf _{z'\ne z, z'\rightarrow z} \Big [\big (\Vert y^{(T)}(z')-y^*(x')\Vert + \Vert y^{(T)}(z)-y^*(x)\Vert \big ) \Big (2.5^{T+1}+\kappa _g\frac{\Vert x'-x\Vert }{\Vert z'-z\Vert }\Big )\nonumber \\&\quad -\frac{u^\top (z'-z)}{\Vert z'-z\Vert }\Big ]\nonumber \\&\overset{(ii)}{\le }\ 2\big (2.5^{T+1}+\kappa _g\big )\Vert y^{(T)}(z)-y^*(x)\Vert - \limsup _{z'\ne z, z'\rightarrow z} \frac{u^\top (z'-z)}{\Vert z'-z\Vert }\nonumber \\&\overset{(iii)}{=} 2\big (2.5^{T+1}+\kappa _g\big )\Vert y^{(T)}(z)-y^*(x)\Vert - \Vert u\Vert \nonumber , \end{aligned}$$

where (i) uses Lemma 2 that $y^{(T)}(\cdot ,\cdot )$ is a $(2.5^{T+1}-1.5)$-Lipschitz continuous mapping and that $y^*$ is $\kappa _g$-Lipschitz (proved in Proposition 1 of (Chen et al., 2021b)), (ii) uses $\kappa _g\ge 1$, $2\sqrt{2}\le 3$ and $\Vert x'-x\Vert \le \Vert z'-z\Vert$, and the equality in (iii) is achieved by letting $z'=z+\sigma u$ with $\sigma \rightarrow 0^+$. Hence, we conclude that $\Vert u\Vert \le 2\big (2.5^{T+1}+\kappa _g\big ) \Vert y^{(T)}(z)-y^*(x)\Vert$. Since $\partial _z (\Vert y^{(T)}(x,y)-y^*(x)\Vert ^2)$ is the graphical closure of $\widehat{\partial }_z (\Vert y^{(T)}(z)-y^*(x)\Vert ^2)$, we have that

$$\begin{aligned} \textrm{dist}_{\partial _z (\Vert y^{(T)}(z)-y^*(x)\Vert ^2)}(\textbf{0}) \le 2\big (2.5^{T+1}+\kappa _g\big ) \Vert y^{(T)}(z)-y^*(x)\Vert . \end{aligned}$$

(D17)

Next, using the subdifferential decomposition (D16), we obtain that

$$\begin{aligned}&\textrm{dist}_{\partial _z H(z_{k+1})}(\textbf{0}) \nonumber \\&\le \textrm{dist}_{\partial (\Phi + h)(x_{k+1})}(\textbf{0}) + \frac{7}{8} \textrm{dist}_{\partial _z (\Vert y^{(T)}(z_{k+1})-y^*(x_{k+1})\Vert ^2)}(\textbf{0}) \nonumber \\&\overset{(i)}{\le }\ {\Big (\frac{1}{\beta }+L_{\Phi }\Big )} \Vert x_{k} - x_{k+1}\Vert + \sqrt{\Gamma }\Vert y_{k+1} - y^*(x_k)\Vert \nonumber \\&\quad + 2\big (2.5^{T+1}+\kappa _g\big ) \Vert y^{(T)}(z_{k+1})-y^*(x_{k+1})\Vert \nonumber \\&\overset{(ii)}{\le } \frac{2}{\beta }\Vert x_{k}\!-\!x_{k+1}\Vert \!+\!\sqrt{\Gamma }\Vert y_{k+1}\!-\!y^*(x_k)\Vert \!+\!2\big (2.5^{T+1}\!+\!\kappa _g\big ) \Vert y_{k+2}\!-\!y^*(x_{k+1})\Vert , \end{aligned}$$

(D18)

where (i) uses (B9) &(D17), (ii) uses the hyperparameter choice that $\beta \le {\frac{1}{2}(L_{\Phi }+\Gamma +\kappa _g^2)^{-1}}$ in Proposition 1.

Appendix E: Proof of theorem 3

Theorem 3

Let Assumptions 1 and 2 hold and and assume that the potential function H defined in (2) has KŁ geometry. Then, under the same choices of hyper-parameters as those of Proposition 1, the potential function value sequence $\{H(x_k,y_k)\}_k$ converges to its limit $H^*$ (see its definition in Theorem 2) at the following rates.

1.
If KŁ geometry holds with $\theta \in \big (\frac{1}{2},1\big )$, then $H(x_k,y_k)\downarrow H^*$ super-linearly as
$$\begin{aligned} H(x_k,y_k)-H^*\le {\mathcal {O}}\Big (\!-\!\Big (\frac{1}{2(1-\theta )}\Big )^{k-k_0}\Big ), ~\forall k\ge k_0; \end{aligned}$$
(6)
2.
If KŁ geometry holds with $\theta =\frac{1}{2}$, then $H(x_k,y_k)\downarrow H^*$ linearly as (for some constant $C>0$)
$$\begin{aligned} H(x_k,y_k)-H^*\le {\mathcal {O}}\big ((1+C)^{-(k-k_0)}\big ),\quad \forall k\ge k_0; \end{aligned}$$
(7)
3.
If KŁ geometry holds with $\theta \in \big (0,\frac{1}{2}\big )$, then $H(x_k,y_k)\downarrow H^*$ sub-linearly as
$$\begin{aligned} H(x_k,y_k)-H^*\le {\mathcal {O}}\big ((k-k_0)^{-\frac{1}{1-2\theta }}\big ),\quad \forall k\ge k_0, \end{aligned}$$
(8)

Proof

Recall that we have shown in the proof of Theorem 2 that: 1) $\{H(x_k,y_k)\}_k$ decreases monotonically to the finite limit $H^*$; 2) for any limit point $x^*$ of $\{x_k\}_k$, $H(x^*,y^*(x^*))=(\Phi +h)(x^*)$ has the constant value $H^*$. Hence, the KŁ inequality holds after a sufficiently large number of iterations, i.e., there exists $k_0\in \mathbb {N}^+$ such that the following holds for all $k\ge k_0$.

$$\begin{aligned} \varphi '(H({x_k,y_k}) -H^*) \textrm{dist}_{{\partial _z H(x_k,y_k)}}(\textbf{0})\ge 1. \end{aligned}$$

Rearranging the above inequality and utilizing (D18), we obtain that for all $k\ge k_0$,

$$\begin{aligned}&\varphi '(H(x_k,y_k) -H^*) \nonumber \\&\ge \frac{1}{\textrm{dist}_{\partial _z H(x_k,y_k)}(\textbf{0})} \nonumber \\&\ge {\Big (\frac{2}{\beta } \Vert x_{k-1}\!-\!x_{{k}}\Vert \!+\!\sqrt{\Gamma }\Vert y_k\!-\!y^*(x_{k-1})\Vert \!+\!2\big (2.5^{T+1}\!+\!\kappa _g\big ) \Vert y_{k+1}\!-\!y^*(x_k)\Vert \Big )^{-1}}. \end{aligned}$$

(E19)

For simplicity, denote $d_k:=H(x_k,y_k)-H^*$ as the function value gap. Then, for a sufficiently large k such that (E19) holds, we have

$$\begin{aligned}&c^{-2}d_k^{2(1-\theta )}\nonumber \\&{\mathop {=}\limits ^{(i)}}\big [\varphi '(d_k)\big ]^{-2} \nonumber \\&{\mathop {\le }\limits ^{(ii)}} \Big (\frac{2}{\beta } \Vert x_{k-1} - x_{{k}}\Vert + \sqrt{\Gamma }\Vert y_k - y^*(x_{k-1})\Vert + 2\big (2.5^{T+1}+\kappa _g\big ) \Vert y_{k+1}-y^*(x_k)\Vert \Big )^2 \nonumber \\&{\mathop {\le }\limits ^{(iii)}} \frac{12}{\beta ^2} \Vert x_{k-1} - x_{k}\Vert ^2 + 3\Gamma \Vert y_k - y^*(x_{k-1})\Vert ^2 + 24(5^{T+1}+\kappa _g^2) \Vert y_{k+1}-y^*(x_k)\Vert ^2\nonumber \\&\le \max \Big (\frac{48}{\beta },24\Gamma ,24(5^{T+1}+\kappa _g^2)\Big )\nonumber \\&\quad \Big (\frac{1}{4\beta } \Vert x_{k-1} - x_{k}\Vert ^2 + \frac{1}{8}\big (\Vert y_k-y^*(x_{k-1})\Vert ^2 + \Vert y_{k+1}-y^*(x_k)\Vert ^2 \big )\Big ) \\&{\mathop {\le }\limits ^{(iv)}} \max \Big (\frac{48}{\beta },24\Gamma ,24(5^{T+1}+\kappa _g^2)\Big ) \big (H(x_{k-1},y_{k-1})-H(x_k,y_k)\big )\nonumber \\&\le \max \Big (\frac{48}{\beta },24\Gamma ,24(5^{T+1}+\kappa _g^2)\Big ) \big (d_{k-1}-d_{k}\big ), \nonumber \end{aligned}$$

where (i) uses the equality that $\varphi '(s)=cs^{\theta -1}$ based on Definition 2, (ii) uses (E19), (iii) uses the inequality that $(a+b+c)^2\le 3a^2+3b^2+3c^2$, and (iv) uses Proposition 1. Rearranging the above inequality yields that

$$\begin{aligned} d_{k-1}\ge d_{k}+Cd_k^{2(1-\theta )}, \end{aligned}$$

(E20)

where $C:=\big [c\max \big (\frac{48}{\beta },24\Gamma ,24(5^{T+1}+\kappa _g^2)\big )\big ]^{-1}>0$ is a constant.

Next, we prove the convergence rates case by case.

(Case I) If $\theta \in \big (\frac{1}{2},1\big )$, then since $d_{k}\ge 0$, (E20) implies that $d_{k-1}\ge Cd_k^{2(1-\theta )}$, which is equivalent to that

$$\begin{aligned} C^{-\frac{1}{2\theta -1}}d_{k} \le \Big (C^{-\frac{1}{2\theta -1}}d_{k-1}\Big )^{\frac{1}{2(1-\theta )}}. \end{aligned}$$

Since $d_k\downarrow 0$, $C^{-\frac{1}{2\theta -1}}d_{k_0}\le e^{-1}$ for sufficiently large $k_0\in \mathbb {N}^+$. Hence, the above inequality implies that for $k\ge k_0$,

$$\begin{aligned} C^{-\frac{1}{2\theta -1}}d_{k} \le \Big (C^{-\frac{1}{2\theta -1}}d_{k_0}\Big )^{\Big [\frac{1}{2(1-\theta )}\Big ]^{k-k_0}} \le \exp \Big (-\Big [\frac{1}{2(1-\theta )}\Big ]^{k-k_0}\Big ). \end{aligned}$$

(E21)

Since $\theta \in \big (\frac{1}{2},1\big )$ implies that $\frac{1}{2(1-\theta )}>1$, the above inequality implies that $d_k\downarrow 0$ (i.e. $H(x_k,y_k)\downarrow H^*$) at the super-linear rate given by (6).

(Case II) If $\theta =\frac{1}{2}$, then (E20) implies that

$$\begin{aligned} d_{k}\le (1+C)^{-1}d_{k-1}, \end{aligned}$$

which further implies that $d_k\downarrow 0$ (i.e. $H(x_k,y_k)\downarrow H^*$) at the linear rate given by (7).

(Case III) If $\theta \in \big (0,\frac{1}{2}\big )$, then denote $\psi (s)=\frac{1}{1-2\theta }s^{-(1-2\theta )}$ and consider the following two subcases.

If $d_{k-1}\le 2d_k$, then

$$\begin{aligned} \psi (d_k)-\psi (d_{k-1})=&\int _{d_k}^{d_{k-1}} -\psi '(s)ds=\int _{d_k}^{d_{k-1}} s^{-2(1-\theta )} ds {\mathop {\ge }\limits ^{(i)}} d_{k-1}^{-2(1-\theta )}(d_{k-1}-d_k) \\ {\mathop {\ge }\limits ^{(ii)}}&C\Big (\frac{d_k}{d_{k-1}}\Big )^{2(1-\theta )} {\mathop {\ge }\limits ^{(iii)}} 2^{-2(1-\theta )}C \end{aligned}$$

where (i) uses $d_k\le d_{k-1}$ and $-2(1-\theta )<-1$, (ii) uses (E20), and (iii) uses $C>0$, $d_{k-1}\le 2d_k$ and $2(1-\theta )>1$.

If $d_{k-1}> 2d_k$, then for $k\ge k_0$

$$\begin{aligned} \psi (d_k)-\psi (d_{k-1})=&\frac{1}{1-2\theta }\big (d_k^{-(1-2\theta )}-d_{k-1}^{-(1-2\theta )}\big ){\mathop {\ge }\limits ^{(i)}} \frac{1}{1-2\theta }\big (d_k^{-(1-2\theta )}-(2d_k)^{-(1-2\theta )}\big )\\ \ge&\frac{1-2^{-(1-2\theta )}}{1-2\theta }d_k^{-(1-2\theta )} {\mathop {\ge }\limits ^{(ii)}} \frac{1-2^{-(1-2\theta )}}{1-2\theta }d_{k_0}^{-(1-2\theta )} \end{aligned}$$

where (i) uses $d_{k-1}> 2d_k$ and $-(1-2\theta )<0$, and (ii) uses $-(1-2\theta )<0$, $\frac{1-2^{-(1-2\theta )}}{1-2\theta }>0$ and $d_k\le d_{k_0}$.

Combining the above two subcases yields that

$$\begin{aligned} \psi (d_k)-\psi (d_{k-1})\ge \min \Big [2^{-2(1-\theta )}C, \frac{1-2^{-(1-2\theta )}}{1-2\theta }d_{k_0}^{-(1-2\theta )} \Big ]=\frac{U}{1-2\theta }>0, k\ge k_0 \end{aligned}$$

where $U:=\min \Big ( 2^{-2(1-\theta )}C(1-2\theta ), \big (1-2^{-(1-2\theta )}\big )d_{k_0}^{-(1-2\theta )} \Big )>0$. Iterating the above inequality yields that

$$\begin{aligned} \psi (d_k)\ge \psi (d_{k_0})+\frac{U}{1-2\theta }(k-k_0) \ge \frac{U}{1-2\theta }(k-k_0) \end{aligned}$$

Then by substituting $\psi (s)=\frac{1}{1-2\theta }s^{-(1-2\theta )}$, the inequality above implies that that $d_k\downarrow 0$ (i.e. $H(x_k,y_k)\downarrow H^*$) at the sub-linear rate given by (8).

Appendix F: Computing inexact solution to the linear system $\nabla _y^2\,g(x_k,y_{k+1}) v = \nabla _y f(x_k,y_{k+1})$.

In the approximate gradient $\widehat{\nabla }\Phi (x_k, y_{k+1})$ defined in (1), we assume access to the exact solution $\widehat{v}_k^*$ of the approximated linear system $\nabla _y^2\,g(x_k,y_{k+1}) v =\nabla _y f(x_k,y_{k+1})$ for simplicity. In this section, we will consider using standard conjugate-gradient (CG) solvers to obtain an inexact solution, and prove that such inexactness almost does not increase the order of computation complexity.

Denote $\widetilde{v}_k^*$ as the inexact solution obtained by N iterations of CG with initialization 0. Then, the approximation error of CG can be derived as follows.

$$\begin{aligned} \Vert \widetilde{v}_k^*-\widehat{v}_k^*\Vert&\overset{(i)}{\le } 2\sqrt{\kappa _g}\Big (\frac{\sqrt{\kappa _g}-1}{\sqrt{\kappa _g}+1}\Big )^{N}\Vert \widehat{v}_k^*\Vert \overset{(ii)}{\le } \frac{2M\sqrt{\kappa _g}}{\mu }(1-\kappa _g^{-1/2})^{N}, \end{aligned}$$

(F23)

where (i) uses eq. (17) of (Grazzi et al., 2020) with initialization 0 and (ii) uses $\Vert \widehat{v}_k^*\Vert =\Vert [\nabla _y^2\,g(x_k,y_{k+1})]^{-1}\nabla _y f(x_k,y_{k+1})\Vert \le \frac{M}{\mu }$ (since $g(x,\cdot )$ is $\mu$-strongly convex and f is M-Lipscithz.)

Then, replacing the exact solution $\widehat{v}_k^*$ with the inexact solution $\widetilde{v}_k^*$ in the approximate gradient (1) we define the new approximate gradient and update rule as follows

$$\begin{aligned} \widetilde{\nabla }\Phi (x_k)&= \nabla _x f(x_k,y_{k+1}) -\nabla _x \nabla _y g(x_k,y_{k+1})\widetilde{v}_k^*. \end{aligned}$$

(F24)

The approximation error of the above approximate gradient has the following upper bound.

$$\begin{aligned} \Vert \widetilde{\nabla }\Phi (x_k)-\nabla \Phi (x_k)\Vert ^2&\le 5\Vert \widetilde{\nabla }\Phi (x_k)-\widehat{\nabla } \Phi (x_k, y_{k+1})\Vert ^2+\frac{5}{4}\Vert \widehat{\nabla } \Phi (x_k, y_{k+1})-\nabla \Phi (x_k)\Vert ^2\\&\overset{(i)}{\le }\ 5\Vert \nabla _x \nabla _y g(x_k,y_{k+1})\Vert \Vert \widetilde{v}_k^*-\widehat{v}_k^*\Vert + \frac{5\Gamma }{4}\Vert y_{k+1} - y^*(x_k)\Vert ^2\\&\overset{(ii)}{\le }\ 10M\kappa _g^{1.5}(1-\kappa _g^{-1/2})^{N} + \frac{5\Gamma }{4}\Vert y_{k+1} - y^*(x_k)\Vert ^2, \end{aligned}$$

where (i) uses eqs. (1) & (F24) and Lemma 1, and (ii) uses eq. (F23) and $\Vert \nabla _x \nabla _y g(x_k,y_{k+1})\Vert \le L_g$ (since $\nabla g(z)$ is $L_g$-smooth). Compared with the gradient error bound in Lemma 1, the above bound has the additional term $2M\kappa _g^{1.5}(1-\kappa _g^{-1/2})^{N}$ and the coefficient $\Gamma$ is increased to $5\Gamma /4$. Hence, using $\widetilde{\nabla }\Phi$ instead of $\Phi$ in the proof of Proposition 1, eq. (A4) changes as follows

$$\begin{aligned}&\Phi (x_{k+1}) + h(x_{k+1}) \\&\le \Phi (x_k) +h(x_k) - \Big (\frac{1}{2\beta } -\frac{L_\Phi }{2} - \frac{\Gamma }{2}\Big ) \Vert x_{k+1}-x_k\Vert ^2 + \frac{1}{2\Gamma }\Vert \nabla \Phi (x_k) - \widetilde{\nabla } \Phi (x_k)\Vert ^2. \\&\le \Phi (x_k) +h(x_k) - \Big (\frac{1}{2\beta } -\frac{L_\Phi }{2} - \frac{\Gamma }{2}\Big ) \Vert x_{k+1}-x_k\Vert ^2 \\&\quad +\frac{5}{8}\Vert y_{k+1}-y^*(x_k)\Vert ^2+5M\Gamma ^{-1}\kappa _g^{1.5}(1-\kappa _g^{-1/2})^{N}. \end{aligned}$$

Adding the above inequality to eq. (A5), we have

$$\begin{aligned}&\Phi (x_{k+1}) + h(x_{k+1}) + \Vert y_{k+2}-y^*(x_{k+1})\Vert ^2 \\&\le \Phi (x_k) +h(x_k) - \frac{1}{4\beta } \Vert x_{k+1}-x_k\Vert ^2\\&\quad +\frac{7}{8}\Vert y_{k+1}-y^*(x_k)\Vert ^2+5M\Gamma ^{-1}\kappa _g^{1.5}(1-\kappa _g^{-1/2})^{N}. \end{aligned}$$

Define the potential function $\widetilde{H}(x,y'):=\Phi (x)+h(x)+\Vert y^{(T)}(x,y')-y^*(x)\Vert ^2$. Then the above inequality can be rewritten into

$$\begin{aligned} \widetilde{H}(x_{k+1}, y_{k+1})\le&\widetilde{H}(x_k,y_k)- \frac{1}{4\beta } \Vert x_{k+1}-x_k\Vert ^2 \\&- \frac{1}{8}\Vert y_{k+1}-y^*(x_k)\Vert ^2+5M\Gamma ^{-1}\kappa _g^{1.5}(1-\kappa _g^{-1/2})^{N}. \end{aligned}$$

Telescoping the above inequality, we obtain that

$$\begin{aligned}&\sum _{k=0}^{K-1} \frac{1}{4\beta } \Vert x_{k+1}-x_k\Vert ^2 + \frac{1}{8} \Vert y_{k+1}-y^*(x_k)\Vert ^2 \nonumber \\&\le \widetilde{H}(x_0,y_0) - \inf _x (\Phi +h)(x) + 5KM\Gamma ^{-1}\kappa _g^{1.5}(1-\kappa _g^{-1/2})^{N}, \end{aligned}$$

(F25)

which is a slight modification of eq. (B6) with the additional term $5KM\Gamma ^{-1}\kappa _g^{1.5}(1-\kappa _g^{-1/2})^{N}$. Therefore, near the end of the proof of Corollary 1 in Appendix C, replacing eq. (B6) with eq. (F25), we obtain the following new convergence rate

$$\begin{aligned}&\min _{0\le k\le K}\Vert G(x_k)\Vert \le \sqrt{\frac{32}{K\beta }\big (H(x_0) - \inf _x (\Phi +h)(x)\big )+5M\Gamma ^{-1}\kappa _g^{1.5}(1-\kappa _g^{-1/2})^{N}}. \end{aligned}$$

To achieve $\min _{0\le k\le K}\Vert G(x_k)\Vert \le \epsilon$, we simply replace the number of outer iterations $K\ge \frac{32}{\beta \epsilon ^2}\big (H(x_0) - \inf _x (\Phi +h)(x)\big )$ with $K\ge \frac{64}{\beta \epsilon ^2}\big (H(x_0) - \inf _x (\Phi +h)(x)\big )$, and set $N\ge \frac{\ln (10\,M(\Gamma \epsilon )^{-1}\kappa _g^{1.5})}{-\ln (1-\kappa _g^{-1/2})}$. All the other hyperparameter choices are not changed. Since there are K outer iterations and each outer iteration contains T inner gradient descent steps for minimizing $g(x,\cdot )$ and N CG steps, the overall computation complexity is $K(T+N)$, whose dependence on $\epsilon$ and $\kappa :=\max (\kappa _f, \kappa _g)>1$ is no larger than $\mathcal {O}(\kappa ^{3.5}\ln (\kappa \epsilon ^{-1}))\epsilon ^{-2}$. This dependence is almost the same as $\mathcal {O}(\kappa ^{3.5}(\ln \kappa )\epsilon ^{-2})$ given by Corollary 1 with difference of logarithm level.

Appendix G: The proximal mapping of the regularizer in the experiment

All the AID-type and ITD-type algorithms implemented in the experiment require computing the following proximal mapping with stepsize $\beta >0$.

$$\begin{aligned} \textrm{prox}_{\beta h} (\lambda ):= \mathop {\mathrm {arg\,min}}\limits _{u\in \mathbb {R}^d} \Big \{h(u) + \frac{1}{2\beta }\Vert u-\lambda \Vert ^2\Big \}. \end{aligned}$$

In the experiment in Sect. 6, we use the non-smooth and nonconvex regularizer $h(\lambda )=-\frac{\gamma }{|\mathcal {D}_{\text {val}} |}\!\sum _{(x_i,y_i)\in \mathcal {D}_{\text {val}}}\min (|\lambda _i |,a)$, for which the i-th entry of the above proximal mapping $[\text {prox}_{\beta h}(\lambda )]_i$ has analytical solution in the following two cases.

(Case 1) If $a>\frac{\beta \gamma }{|\mathcal {D}_{\text {val}} |}$, then

$$\begin{aligned}{}[\text {prox}_{\beta h}(\lambda )]_i:=\left\{ \begin{gathered} -a; -a<\lambda _i<-\Big (a-\frac{\beta \gamma }{|\mathcal {D}_{\text {val}} |}\Big ) \\ \lambda _i-\frac{\beta \gamma }{|\mathcal {D}_{\text {val}} |}; -\Big (a-\frac{\beta \gamma }{|\mathcal {D}_{\text {val}} |}\Big )\le \lambda _i<0\\ \lambda _i+\frac{\beta \gamma }{|\mathcal {D}_{\text {val}} |}; 0\le \lambda _i<a-\frac{\beta \gamma }{|\mathcal {D}_{\text {val}} |}\\ a; a-\frac{\beta \gamma }{|\mathcal {D}_{\text {val}} |}\le \lambda _i<a \\ \lambda _i; |\lambda _i |\ge a \\ \end{gathered} \right. .\nonumber \end{aligned}$$

(Case 2) If $a\le \frac{\beta \gamma }{|\mathcal {D}_{\text {val}} |}$, then

$$\begin{aligned}{}[\text {prox}_{\beta h}(\lambda )]_i:=\left\{ \begin{gathered} -a; -a<\lambda _i<0 \\ a; 0\le \lambda _i<a \\ \lambda _i; |\lambda _i |\ge a \\ \end{gathered} \right. .\nonumber \end{aligned}$$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, Z., Kailkhura, B. & Zhou, Y. An accelerated proximal algorithm for regularized nonconvex and nonsmooth bi-level optimization. Mach Learn 112, 1433–1463 (2023). https://doi.org/10.1007/s10994-023-06329-6

Download citation

Received: 12 October 2022
Revised: 09 February 2023
Accepted: 11 March 2023
Published: 07 April 2023
Issue Date: May 2023
DOI: https://doi.org/10.1007/s10994-023-06329-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An accelerated proximal algorithm for regularized nonconvex and nonsmooth bi-level optimization

Abstract

Similar content being viewed by others

Tseng’s extragradient method with double projection for solving pseudomonotone variational inequality problems in Hilbert spaces

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

The Frank-Wolfe Algorithm: A Short Introduction

1 Introduction

1.1 Our contributions

1.2 Related work

2 Problem formulation and preliminaries

Assumption 1

Assumption 2

Definition 1

3 Proximal bi-level optimization with AID

Lemma 1

4 Global convergence and complexity of proximal BiO-AID

Proposition 1

Theorem 2

Corollary 1

5 Convergence rates under local nonconvex geometry

5.1 Local Kurdyka–Łojasiewicz geometry

Definition 2

5.2 Convergence rates of proximal BiO-AIDm under KŁ geometry

Theorem 3

6 Experiment

6.1 Optimization performance

6.2 Test performance

7 Conclusion

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A: Proof of proposition 1

Proposition 1

Proof

Appendix B: Proof of theorem 2

Theorem 2

Proof

Appendix C: Proof of corollary 1

Corollary 1

Proof

Appendix D: Auxiliary lemmas for proving theorem 3

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

Appendix E: Proof of theorem 3

Theorem 3

Proof

Appendix F: Computing inexact solution to the linear system \(\nabla _y^2\,g(x_k,y_{k+1}) v = \nabla _y f(x_k,y_{k+1})\).

Appendix G: The proximal mapping of the regularizer in the experiment

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation