1 Introduction

Bi-level optimization has become an important and popular optimization framework that covers a variety of emerging machine learning applications, e.g., meta-learning (Franceschi et al., 2018; Bertinetto et la., 2018; Rajeswaran et al., 2019; Ji et al., 2020), hyperparameter optimization (Franceschi et al., 2018; Shaban et al., 2019; Feurer & Hutter, 2019), reinforcement learning (Konda and Tsitsiklis, 2000; Hong et al., 2020), etc. A standard formulation of bi-level optimization takes the following form.

$$\begin{aligned} \min _{x\in \mathbb {R}^d} f(x, y^*(x)), \quad \text{ where } \quad y^*(x)\in \mathop {\mathrm {arg\,min}}\limits _{y\in \mathbb {R}^{p}} g(x,y), \end{aligned}$$

where the upper- and lower-level objective functions f and g are both jointly continuously differentiable. To elaborate, bi-level optimization aims to minimize the upper-level compositional objective function \(f(x,y^*(x))\), in which \(y^*(x)\) is the minimizer of the lower-level objective function g(xy).

Solving the above bi-level optimization problem is highly non-trivial as the problem involves two nested minimization problems. In the existing literature, many algorithms have been developed for bi-level optimization. In the early works, Hansen et al. (1992); Shi et al. (2005); Moore (2010) reformulated the bi-level problem into a single-level problem with constraints on the optimality conditions of the lower-level problem, yet this reformulation involves a large number of constraints that are hard to address in practice. More recently, gradient-based bi-level optimization algorithms have been developed, which leverage either the approximate implicit differentiation (AID) scheme (Domke, 2012; Pedregosa, 2016; Gould et al., 2016; Liao et al., 2018; Ghadimi and Wang, 2018; Grazzi et al., 2020; Lorraine et al., 2020) or the iterative differentiation (ITD) scheme (Domke, 2012; Maclaurin et al., 2015; Franceschi et al., 2017, 2018; Shaban et al., 2019; Grazzi et al., 2020) to estimate the gradient of the upper-level function. In particular, the AID scheme is more popular due to its simplicity and computation efficiency. Specifically, bi-level optimization algorithm with AID (referred to as BiO-AID) has been analyzed for (strongly)-convex upper- and lower-level functions (Liu et al., 2020), which do not cover bi-level problems in modern machine learning that usually involve nonconvex upper-level objective functions. On the other hand, recent studies have analyzed the convergence of BiO-AID with nonconvex upper-level function and strongly convex lower-level function, and established the convergence of a certain type of gradient norm to zero (Ji et al., 2021; Ghadimi and Wang, 2018; Hong et al., 2020).

However, the existing gradient-based nonconvex bi-level optimization algorithms have limitations in several perspectives. First, most existing algorithms are not applicable to bi-level problems that involve possibly nonsmooth and nonconvex regularizers in the upper-level function, while Huang and Huang (2021) involves only convex regularizer. For example, in the application of data hyper-cleaning, one can improve the learning performance by adding a nonsmooth and nonconvex regularizer to push the weights of the clean samples towards 1 while push those of the contaminated samples towards 0 (see Sect. 6 for more details). Second, the convergence guarantees of these algorithms typically ensure a weak gradient norm convergence (except (Dagréou et al., 2022) which requires strong global PŁ geometry assumption on both \(f(x,\cdot )\) and \(f(\cdot ,y)\)), which does not necessarily imply the desired convergence of the model parameters. Furthermore, these algorithms suffer from a high computation complexity in nonconvex bi-level optimization. The overarching goal of this work is to develop an efficient and convergent proximal-type algorithm for solving regularized nonconvex and nonsmooth bi-level optimization problems and address the above important issues. We summarize our contributions as follows.

1.1 Our contributions

We propose a proximal BiO-AIDm algorithm (see Algorithm 1) and study its convergence properties. This algorithm is a proximal variant of the BiO-AID algorithm for solving the following class of regularized nonsmooth and nonconvex bi-level optimization problems.

$$\begin{aligned}&\min _{x\in \mathbb {R}^d} f(x, y^*(x)) + h(x), \text {where} ~ y^*(x)= \mathop {\mathrm {arg\,min}}\limits _{y\in \mathbb {R}^{p}} g(x,y), \end{aligned}$$

where the upper-level objective function f is nonconvex, the lower-level objective function g is strongly convex for any fixed x, and the regularizer h is possibly nonsmooth and nonconvex. In particular, our algorithm applies the Nesterov’s momentum to accelerate the computation of the implicit gradient involved in the AID scheme.

We first analyze the global (non-asymptotic) convergence properties of proximal BiO-AIDm under standard Lipschitz and smoothness assumptions on the objective functions. The key to our analysis is to show that proximal BiO-AID admits an intrinsic potential function \(H(x_k,y_k)\) that takes the form

$$\begin{aligned} H(x,y'):= \Phi (x) +h(x) +\frac{7}{8}\Vert y^{(T)}(x,y')-y^*(x)\Vert ^2, \end{aligned}$$

where \(y^{(T)}(x,y')\) is obtained by applying the Nesterov’s accelerated gradient descent to minimize \(g(x,\cdot )\) with initial point \(y'\) for T iterations. In particular, we prove that such a potential function is monotonically decreasing along the optimization trajectory, i.e., \(H(x_{k+1},y_{k+1}) < H(x_k,y_k)\), which implies that proximal BiO-AIDm can be viewed as a descent-type algorithm and is numerically stable. Based on this property, we formally prove that every limit point of the model parameter trajectory \(\{x_k\}_k\) generated by proximal BiO-AIDm is a critical point of the regularized bi-level problem. Furthermore, when the regularizer is convex, we show that proximal BiO-AIDm requires a computation complexity of \(\widetilde{\mathcal {O}}(\kappa ^{3.5}\epsilon ^{-2})\) (number of gradient, Hessian-vector product and proximal evaluations) for achieving a critical point x that satisfies \(\Vert G(x)\Vert \le \epsilon\), where \(\kappa\) denotes the problem condition number and G(x) denotes the proximal gradient mapping. As shown in Table 1, this is the first global convergence and complexity result of proximal BiO-AIDm in regularized nonsmooth and nonconvex bi-level optimization, and it improves the state-of-the-art complexity of BiO-AID (for smooth nonconvex bi-level optimization) by a factor of \({\widetilde{\mathcal {O}}}(\sqrt{\kappa })\).

Besides investigating the global convergence properties, we further establish the asymptotic function value convergence rates of proximal BiO-AIDm under a local Łojasiewicz-type nonconvex geometry, which covers a broad spectrum of local nonconvex geometries. Specifically, we characterize the asymptotic convergence rates of proximal BiO-AIDm in the full spectrum of the Łojasiewicz geometry parameter \(\theta\). We prove that as the local geometry becomes sharper (i.e., with a larger \(\theta\)), the asymptotic convergence rate of proximal BiO-AIDm boosts from sublinear convergence to superlinear convergence. The proof of these local asymptotic convergence rates requires proving two properties that have not been proved in the existing literature to our knowledge. The major property is that the aforementioned potential function H is decreasing. Another property is the Lipschitz property of \(y^{(T)}\), which is challenging to prove due to momentum acceleration.

Table 1 List of existing complexity results for bi-level algorithms. (\(\checkmark\) in the columns “non-smooth” and “momentum accelerated” respectively means the objective function is non-smooth and the algorithm has momentum acceleration, and \(\times\) means the opposite)

1.2 Related work

Bi-level Optimization Algorithms Bi-level optimization has been studied for decades (Bracken & McGill, 1973), and various types of bi-level algorithms have been proposed, including but not limited to single-level penalized methods (Shi et al., 2005; Moore, 2010), and gradient-based methods via AID or ITD-based hypergradient estimation (Domke, 2012; Pedregosa, 2016; Franceschi et al., 2018; Ghadimi and Wang, 2018; Hong et al., 2020; Liu et al., 2020; Li et al., 2020; Grazzi et al., 2020; Ji et al., 2021; Lorraine et al., 2020; Ji and Liang, 2021). Huang and Huang (2021) proposed a Bregman distance-based method. In particular, (Ghadimi and Wang, 2018; Hong et al., 2020; Ji et al., 2021; Yang et al., 2021; Chen et al., 2021a; Guo and Yang, 2021; Huang and Huang, 2021) characterized the complexity analysis for their proposed methods for bi-level optimization problem under different types of loss geometries. Ji and Liang (2021) studied the lower complexity bounds for bi-level optimization under (strongly) convex geometry and proposed a nearly-optimal accelerated algorithm. All the existing analysis of nonconvex bi-level optimization algorithms either focuses on the gradient norm convergence or requires strong global PŁ geometry assumption on both \(f(x,\cdot )\) and \(f(\cdot ,y)\) (Dagréou et al., 2022). In this paper, we formally establish the parameter and function value convergence of proximal BiO-AID in regularized nonconvex and nonsmooth bi-level optimization.

Applications of Bi-level Optimization Bi-level optimization has been widely applied to meta-learning  (Snell et al., 2017; Franceschi et al., 2018; Rajeswaran et al., 2019; Zügner and Günnemann, 2019; Ji et al., 2020; Ji, 2021), hyperparameter optimization (Franceschi et al., 2017; Shaban et al., 2019), reinforcement learning (Konda and Tsitsiklis, 2000; Hong et al., 2020), and data poisoning (Mehra et al., 2020). For example, Snell et al. (2017) reformulated the meta-learning objective function under a shared embedding model into a bi-level optimization problem. Rajeswaran et al. (2019) proposed a bi-level optimizer named iMAML as an efficient variant of model-agnostic meta-learning (MAML) (Finn et al., 2017), and analyzed the convergence of iMAML under the strongly convex inner-loop loss. Fallah et al. (2020) characterized the convergence of MAML and first-order MAML under nonconvex loss functions. Ji et al. (2020) studied the convergence behaviors of almost no inner loop (ANIL) (Raghu et al., 2019) under different inner-loop loss geometries of the MAML objective function. Recently Mehra et al. (2020) devised bi-level optimization based data poisoning attacks on certifiably robust classifiers.

Nonconvex Kurdyka-Łojasiewicz Geometry A broad class of regular functions has been shown to satisfy the local nonconvex KŁ geometry (Bolte et al., 2007), which affects the asymptotic convergence rates of gradient-based optimization algorithms. The KŁ geometry has been exploited to study the convergence of various first-order algorithms for solving minimization problems, including gradient descent (Attouch & Bolte, 2009), alternating gradient descent (Bolte et al., 2014), distributed gradient descent (Zhou et al., 2016; Zhou et al., 2018), accelerated gradient descent (Li et al., 2017). It has also been exploited to study the convergence of second-order algorithms such as Newton’s method (Noll and Rondepierre, 2013; Frankel et al., 2015) and cubic regularization method (Zhou et al., 2018).

2 Problem formulation and preliminaries

In this paper, we consider the following regularized nonconvex bi-level optimization problem:

figure a

where both the upper-level objective function f and the lower-level objective function g are jointly continuously differentiable, and the regularizer h is possibly nonsmooth and nonconvex. We note that adding a regularizer to the bi-level optimization problem allows us to impose desired structures on the solution, and this is important for many machine learning applications. For example, in the application of data hyper-cleaning (see the experiment in Sect. 6 for more details), one aims to improve the learning performance by adding a regularizer to push the weights of the clean samples towards 1 while push the weights of the contaminated samples towards 0. Such a regularizer often takes a nonsmooth and nonconvex form.

To simplify the notation, throughout the paper we define the function \(\Phi (x):= f(x, y^*(x))\). We also adopt the following standard assumptions regarding the regularized bi-level optimization problem (P).

Assumption 1

The functions in the regularized bi-level optimization problem (P) satisfy:

  1. 1.

    Function \(g(x,\cdot )\) is \(\mu\)-strongly convex for all x and function \(\Phi (x)=f(x,y^*(x))\) is nonconvex;

  2. 2.

    Function h is proper and lower-semicontinuous (possibly nonsmooth and nonconvex);

  3. 3.

    Function \((\Phi +h)(x)\) is bounded below and has bounded sub-level sets.

In Assumption 1, the regularizer h can be any nonsmooth and nonconvex function so long as it is a closed function. This covers most of the regularizers that we use in practice, including any proper convex functions (can be nonsmooth, e.g., \(\ell _1\) norm), \(\ell _p\) norm with \(p>0\) (can be nonconvex and nonsmooth), \(\ell _0\)-norm regularizer \(h(x)=\lambda \Vert x\Vert _0\) (\(\lambda >0\) and \(\Vert x\Vert _0\) denotes the number of nonzero entries of vector x), low rank regularizer (Yao et al., 2015), and the regularizer \(-\gamma \min (|\lambda _i |,a)\) used for our experiment (see Sect. 6 for detail). In addition to Assumption 1, we also impose the following Lipschitz continuity and smoothness conditions on the objective functions, which are widely considered in the existing literature (Ghadimi and Wang, 2018; Ji et al., 2020). In the following assumption, we denote \(z:=(x,y)\).

Assumption 2

The functions f(z) and g(z) in the bi-level problem (P) satisfy:

  1. 1.

    Function f(z) is M-Lipschitz. Gradients \(\nabla f(z)\) and \(\nabla g(z)\) are \(L_f\)-Lipschitz and \(L_g\)-Lipschitz respectively.

  2. 2.

    Jacobian \(\nabla _x\nabla _y g(z)\) and Hessian \(\nabla _y^2\,g(z)\) are \(\tau\)-Lipschitz and \(\rho\)-Lipschitz, respectively.

Assumptions 1 and 2 imply that the mapping \(y^*(x)\) is \(\kappa _g\)-Lipschitz, where \(\kappa _g=L_g/\mu >1\) denotes the condition number of the lower level function g. Similarly, we denote \(\kappa _f=L_f/\mu\) for the upper level function f (Lin et al., 2020; Chen et al., 2021b), Note that \(\kappa _f\ge 1\) does not necessarily hold and thus \(\kappa _f\) is not a condition number.

Lastly, note that the problem (P) is rewritten as the regularized minimization problem \(\min _{x\in \mathbb {R}^d} \Phi (x)+h(x)\), which can be nonsmooth and nonconvex. Therefore, our optimization goal is to find a critical point \(x^*\) of the function \(\Phi (x)+h(x)\) that satisfies the optimality condition \(\textbf{0}\in \partial (\Phi + h)(x^*)\). Here, \(\partial\) denotes the following generalized notion of subdifferential.

Definition 1

(Subdifferential and critical point, (Rockafellar & Wets, 2009)) The Frechét subdifferential \(\widehat{\partial }F\) of a function F at \(x\in \mathop {\textrm{dom}}F\) is the set of \(u\in \mathbb {R}^d\) defined as

$$\begin{aligned} \widehat{\partial }F(x) = \Big \{u: \liminf _{z\ne x, z\rightarrow x} \frac{F(z) - F(x) - u^\top (z-x)}{\Vert z-x\Vert } \ge 0 \Big \}, \end{aligned}$$

and the limiting subdifferential \(\partial F\) at \(x\in \text {dom}~F\) is the graphical closure of \(\widehat{\partial }F\) defined as:

$$\begin{aligned} \partial F(x) \!=\! \big \{ \!u\!: \exists (x_k, F(x_k)) \!\rightarrow \! (x, F(x)), \widehat{\partial } F(x_k) \!\ni \! u_k \!\rightarrow \! u \big \}. \end{aligned}$$

The set of critical points of F is defined as \(\{ x: \textbf{0}\in \partial F(x) \}\).

3 Proximal bi-level optimization with AID

In this section, we introduce the proximal bi-level optimization algorithm with momentum accelerated approximate implicit differentiation (referred to as proximal BiO-AIDm). Recall that \(\Phi (x):= f(x, y^*(x))\). The main challenge for solving the regularized bi-level optimization problem (P) is the computation of the gradient \(\nabla \Phi (x)\), which involves higher-order derivatives of the lower-level function. Fortunately, this gradient can be effectively estimated using the popular AID scheme as we elaborate below.

First, it is shown in (Ji et al., 2021) that \(\nabla \Phi (x)\) takes the following analytical form.

$$\begin{aligned} \nabla&\Phi (x_k) = \nabla _x f(x_k,y^*(x_k)) -\nabla _x \nabla _y g(x_k,y^*(x_k)) v_k^*, \end{aligned}$$

where \(v_k^*\) corresponds to the solution of the linear system \(\nabla _y^2 g(x_k,y^*(x_k))v= \nabla _y f(x_k,y^*(x_k))\). In particular, \(y^*(x_k)\) is the minimizer of the strongly convex function \(g(x_k,\cdot )\), and it can be effectively approximated by running T Nesterov’s accelerated gradient descent updates on \(g(x_k,\cdot )\) and obtaining the output \(y_{k+1}\) as the approximation. With this approximated minimizer, the AID scheme estimates the gradient \(\nabla \Phi (x_k)\) as follows:

$$\begin{aligned} \text {(AID):}\quad \widehat{\nabla }&\Phi (x_k) = \nabla _x f(x_k,y_{k+1}) -\nabla _x \nabla _y g(x_k,y_{k+1})\widehat{v}_k^*, \end{aligned}$$
(1)

where \(\widehat{v}_k^*\) is the solution of the approximated linear system \(\nabla _y^2 g(x_k,y_{k+1}) v = \nabla _y f(x_k,y_{k+1})\), which can be efficiently solved by standard conjugate-gradient solvers. For simplicity of the discussion, we assume that \(\widehat{v}_k^*\) is exactly computed in the main body of the paper. In Appendix F, we discuss how to obtain an inexact solution to this linear system via conjugate-gradient solver, and provide a comprehensive analysis of the impact of such inexactness on the overall computation complexity of the proposed algorithm. Moreover, the Jacobian-vector product involved in (1) can be efficiently computed using the existing automatic differentiation packages (Domke, 2012; Grazzi et al., 2020).

Based on the estimated gradient \(\widehat{\nabla } \Phi (x_k,y_{k+1})\), we can then apply the standard proximal gradient algorithm (a.k.a. forward-backward splitting) (Lions & Mercier, 1979) to solve the regularized optimization problem (P). This algorithm is referred to as proximal BiO-AIDm and is summarized in Algorithm 1. Specifically, in each outer loop k, we first run T accelerated gradient descent steps with Nesterov’s momentum with initial point \(y_k\) to minimize \(g(x_k,\cdot )\) and find an approximated minimizer \(y_{k+1}=y^{(T)}(x_k,y_k)\approx y^*(x_k)\), where we use the notation \(y^{(T)}(x_k,y_k)\) to emphasize the dependence on \(x_k\) and \(y_k\). Then, this approximated minimizer is utilized by the AID scheme to estimate \(\nabla \Phi (x_k)\). Finally, we apply the proximal gradient algorithm to minimize the regularized objective function \(\Phi (x)+h(x)\). Here, the proximal mapping of any function h at v is defined as

$$\begin{aligned} \textrm{prox}_{h} (v):= \mathop {\mathrm {arg\,min}}\limits _{u\in \mathbb {R}^d} \Big \{h(u) + \frac{1}{2}\Vert u-v\Vert ^2\Big \}. \end{aligned}$$
figure b

Under Assumptions 1 and 2, the following lemma characterizes the smoothness of \(\Phi\) and the estimation error \(\Vert \widehat{\nabla }\Phi (x_k)- \nabla \Phi (x_k)\Vert\) of AID scheme.

Lemma 1

Let Assumptions 1.1 and 2 hold. Then, function \(\Phi\) is differentiable and the gradient \(\nabla \Phi\) is \(L_\Phi\)-Lipschitz with \(L_{\Phi } = L_f + \frac{2L_fL_g+\tau M^2}{\mu } + \frac{\rho L_g M+L_fL_g^2+\tau M L_g}{\mu ^2} + \frac{\rho L_g^2\,M}{\mu ^3}\) (Lemma 2 of Ji (2021)). Moreover, the gradient estimate obtained by the AID scheme satisfies (Lemma 2.2a of Ghadimi and Wang (2018))

$$\begin{aligned} \Vert \widehat{\nabla }\Phi (x_k)- \nabla \Phi (x_k)\Vert ^2 \le \Gamma \Vert y_{k+1} - y^*(x_k)\Vert ^2. \end{aligned}$$

where \(\Gamma =4L_f^2+\frac{4\tau ^2\,M^2}{\mu ^2} + \frac{4\,M^2\rho ^2\kappa _g^2}{\mu ^2}+4L_f^2\kappa _g^2\).

4 Global convergence and complexity of proximal BiO-AID

In this section, we study the global convergence properties of proximal BiO-AIDm for general regularized nonconvex and nonsmooth bi-level optimization.

First, note that the main update of proximal BiO-AIDm in Algorithm 1 follows from the proximal gradient algorithm, which has been proven to generate a convergent optimization trajectory to a critical point in general nonconvex optimization (Attouch & Bolte, 2009). Hence, one may expect that proximal BiO-AIDm should share the same convergence guarantee. However, this is not obvious as the proof of convergence of the proximal gradient algorithm heavily relies on the fact that it is a descent-type algorithm, i.e., the objective function is strictly decreasing over the iterations. As a comparison, the main update of proximal BiO-AIDm applies an approximated gradient \(\widehat{\nabla }\Phi (x_k)\), which is correlated with both the upper- and lower-level objective functions through the AID scheme and destroys the descent property of the proximal gradient update, and hence conceals the proof of convergence.

The following key result proves that proximal BiO-AIDm does admit an intrinsic potential function that is monotonically decreasing over the iterations. Therefore, it is indeed a descent-type algorithm, which is the first step toward establishing the global convergence.

Proposition 1

Let Assumptions 1 and 2 hold and define the potential function

$$\begin{aligned} H(x,y'):= \Phi (x) +h(x) +\frac{7}{8}\Vert y^{(T)}(x,y')-y^*(x)\Vert ^2. \end{aligned}$$
(2)

Choose hyperparameters \(\alpha =\frac{1}{L_g}\), \(\beta \le \frac{1}{2}(L_{\Phi }+\Gamma +\kappa _g^2)^{-1}\), \(\eta =\frac{\sqrt{\kappa _g}-1}{\sqrt{\kappa _g}+1}\) and \(T\ge \frac{\ln (8(1+\kappa _g))}{\ln ((1-\kappa _g^{-0.5})^{-1})}\). Then, the parameter sequence \(\{x_k\}_k\) generated by Algorithm 1 satisfies, for all \(k=1,2,...,\)

$$\begin{aligned} H(x_{k+1}, y_{k+1}) \le&H(x_k,y_k)- \frac{1}{4\beta } \Vert x_{k+1}-x_k\Vert ^2\\&- \frac{1}{8} \Big (\Vert y_{k+1}-y^*(x_k)\Vert ^2 + \Vert y_{k+2}-y^*(x_{k+1})\Vert ^2 \Big ). \end{aligned}$$

To elaborate, the potential function H consists of two components: the upper-level objective function \(\Phi (x)+h(x)\) and a regularization term \(\Vert y^{(T)}(x,y')-y^*(x)\Vert ^2\) that tracks the optimality gap of the lower-level optimization. Hence, the potential function H fully characterizes the optimization goal of the entire bi-level optimization. Intuitively, if \(\{x_k\}_k\) converges to a certain critical point \(x^*\) and \(y^{(T)}(x_k,y_k)\) converges to \(y^*(x^*)\), then it can be seen that \(H(x_k,y_k)\) will converge to the local optimum \((\Phi +h)(x^*)\). Finding such a function is not straightforward, since the coefficient \(\frac{7}{8}\) in the potential function (2) has to be elaborately selected to guarantee the monotonic decreasing property. (See the proof of Proposition 1 in Appendix A for the detail of coefficient selection.)

Based on the above characterization of the potential function, we obtain the following global convergence result of proximal BiO-AIDm in general regularized nonconvex optimization.

Theorem 2

Under the same conditions as those of Proposition 1, the parameter sequence \(\{x_k, y_k\}_k\) generated by Algorithm 1 satisfies the following properties.

  1. 1.

    \(\Vert x_{k+1} - x_k \Vert \overset{k}{\rightarrow }\ 0\), \(\Vert y_{k+1} - y^*(x_k)\Vert \overset{k}{\rightarrow }\ 0\);

  2. 2.

    The function value sequence \(\{(\Phi +h)(x_k)\}_k\) converges to a finite limit \(H^*>-\infty\);

  3. 3.

    The sequence \(\{(x_k,y_k)\}_k\) is bounded and has a compact set of limit points. Moreover, \((\Phi +h)(x^*)\equiv H^*\) for any limit point \(x^*\) of \(\{x_k\}_k\);

  4. 4.

    Every limit point \(x^*\) of \(\{x_k\}_k\) is a critical point of the upper-level function \((\Phi +h)(x)\).

Theorem 2 provides a comprehensive characterization of the global convergence properties of proximal BiO-AIDm in regularized nonconvex and nonsmooth bi-level optimization. Specifically, item 1 shows that the parameter sequence \(\{x_k\}_k\) is asymptotically stable, and \(y_{k+1}\) asymptotically converges to the corresponding minimizer \(y^*(x_k)\) of the lower-level objective function \(g(x_k, \cdot )\). In particular, in the unregularized case (i.e., \(h=0\)), this result reduces to the existing understanding that the gradient norm \(\Vert \nabla \Phi (x)\Vert\) converges to zero (Ji et al., 2021; Ghadimi and Wang, 2018; Hong et al., 2020), which does not imply the convergence of the parameter sequence. Item 2 shows that the function value sequence converges to a finite limit, which is also the limit of the potential function value sequence \(\{H(x_k,y_k)\}_k\). Moreover, items 3 and 4 show that the parameter sequence \(\{x_k\}_k\) converges to only critical points of the objective function, and these limit points are in a flat region where the corresponding function value is the constant \(H^*\). Note that due to the nonconvexity of \(\Phi\), \(H^*\) is not necessarily the optimal value, i.e., it is possible that \(H^*>\min _{x\in \mathbb {R}^d} (\Phi +h)(x)\). To summarize, Theorem 2 formally proves that proximal BiO-AIDm will eventually converge to critical points in nonsmooth and nonconvex bi-level optimization.

In addition to the above global convergence result, Proposition 1 can be further leveraged to characterize the computation complexity of proximal BiO-AIDm for finding a critical point in regularized nonconvex bi-level optimization. Specifically, when the regularizer h is convex, we can define the following proximal gradient mapping associated with the objective function \(\Phi (x)+h(x)\).

$$\begin{aligned} G(x)=\frac{1}{\beta }\Big (x-\textrm{prox}_{\beta h} \big (x- \beta \nabla \Phi (x)\big )\Big ). \end{aligned}$$
(3)

The proximal gradient mapping is a standard metric for evaluating the optimality of regularized nonconvex optimization problems (Nesterov, 2013). It can be shown that x is a critical point of \(\Phi (x)+h(x)\) if and only if \(G(x)=\textbf{0}\), and it reduces to the normal gradient in the unregularized case. Hence, we define the convergence criterion as finding a near-critical point x that satisfies \(\Vert G(x)\Vert \le \epsilon\) for some pre-determined accuracy \(\epsilon >0\). We obtain the following global convergence rate and complexity of proximal BiO-AIDm.

Corollary 1

Suppose h is convex and the conditions of Proposition 1 hold. Then, the sequence \(\{x_k\}_k\) generated by Algorithm 1 satisfies the following convergence rate.

$$\begin{aligned} \min _{0\le k\le K}\Vert G(x_k)\Vert \le \sqrt{\frac{32}{K\beta }\big (H(x_0) - \inf _x (\Phi +h)(x)\big )}. \end{aligned}$$
(4)

Moreover, to achieve \(\min _{0\le k\le K}\Vert G(x_k)\Vert \le \epsilon\), we run the algorithm with \(K=32\epsilon ^{-2}(L_{\Phi }+\Gamma +\kappa _g^2)\big (H(x_0) - \inf _x (\Phi +h)(x)\big )\) outer iterations and \(T=\frac{\ln (8(1+\kappa _g))}{\ln ((1-\kappa _g^{-0.5})^{-1})}\) inner iterations, and the overall computation complexity is \(KT=\frac{32\ln (8(1+\kappa _g))}{\epsilon ^2\ln ((1-\kappa _g^{-0.5})^{-1})}(L_{\Phi }+\Gamma +\kappa _g^2)\big (H(x_0) - \inf _x (\Phi +h)(x)\big )\).

The dependence of the above computation complexity on \(\epsilon\) and \(\kappa :=\max (\kappa _f,\kappa _g)>1\) is no larger than \(\mathcal {O}(\kappa ^{3.5}(\ln \kappa )\epsilon ^{-2})\). This strictly improves the computation complexity \(\mathcal {O}(\kappa ^4\epsilon ^{-2})\) of BiO-AID that only applies to smooth nonconvex bi-level optimization (Ji et al., 2021). To elaborate the reason, in our algorithm, the T Nesterov’s accelerated gradient descent steps applied to \(\min _y g(x_t,y)\) achieve the convergence rate \(\Vert y_{t+1}-y^*(x_t)\Vert \le (1+\kappa _g)(1-\kappa _g^{-0.5})^T \Vert y_t-y^*(x_t)\Vert\), which is faster than \(\Vert y_{t+1}-y^*(x_t)\Vert \le (1-\kappa _g^{-1})^T \Vert y_t-y^*(x_t)\Vert\) of standard gradient descent since \(1-\kappa _g^{-0.5}<1-\kappa _g^{-1}\). Therefore, to ensure that \(\Vert y_{t+1}-y^*(x_t)\Vert \le \frac{1}{4} \Vert y_t-y^*(x_t)\Vert\), Nesterov’s accelerated gradient descent requires \(T=\frac{\ln (8(1+\kappa _g))}{\ln ((1-\kappa _g^{-0.5})^{-1})}=\mathcal {O}(\sqrt{\kappa _g}\ln \kappa _g)\) steps, which is much less than \(T=\mathcal {O}(\kappa _g)\) required by standard gradient descent. On the other hand, the number of outer iterations K is the same for both algorithms. Therefore, Nesterov’s accelerated gradient descent yields smaller computation complexity KT than that of standard gradient descent. In addition, (Ji et al., 2021) only uses smooth upper-level function f while we have non-smooth regularizer h which requires to analyze nonconvex proximal gradient mapping (3). To the best of our knowledge, this is the first convergence rate and complexity result of momentum accelerated algorithm for solving regularized nonsmooth and nonconvex bi-level optimization problems. We note that another momentum accelerated bi-level optimization algorithm has been studied in (Ji and Liang, 2021), which only applies to unregularized (strongly) convex bi-level optimization problems.

5 Convergence rates under local nonconvex geometry

In the previous section, we have proved that the optimization trajectory generated by proximal BiO-AIDm approaches a compact set of critical points. Hence, we are further motivated to exploit the local function geometry around the critical points to study its local (asymptotic) convergence guarantees, which is the focus of this section. In particular, we consider a broad class of Łojasiewicz-type geometry of nonconvex functions.

5.1 Local Kurdyka–Łojasiewicz geometry

General nonconvex functions typically do not have a global geometry. However, they may have certain local geometry around the critical points that determines the local convergence rate of optimization algorithms. In particular, the Kurdyka-Łojasiewicz (KŁ) geometry characterizes a broad spectrum of local geometries of nonconvex functions (Bolte et al., 2007, 2014), and it generalizes various conventional global geometries such as the strong convexity and Polyak-Łojasiewicz geometry. Next, we formally introduce the KŁ geometry.

Definition 2

(KŁ geometry, Bolte et al. (2014)) A proper and lower semi-continuous function F is said to have the KŁ geometry if for every compact set \(\Omega \subset \textrm{dom}F\) on which F takes a constant value \(F_\Omega \in \mathbb {R}\), there exist \(\varepsilon , \lambda >0\) such that for all \(\bar{x} \in \Omega\) and all \(x\in \{z\in \mathbb {R}^m: \textrm{dist}_\Omega (z)<\varepsilon , F_\Omega< F(z) <F_\Omega + \lambda \}\), the following condition holds:

$$\begin{aligned} \varphi ' \left( F(x) - F_\Omega \right) \cdot \textrm{dist}_{\partial F(x)}(\textbf{0}) \ge 1, \end{aligned}$$
(5)

where \(\varphi '\) is the derivative of \(\varphi : [0,\lambda ) \rightarrow \mathbb {R}_+\) that takes the form \(\varphi (t) = \frac{c}{\theta } t^\theta\) for certain constant \(c>0\) and KŁ parameter \(\theta \in (0,1]\), and \(\textrm{dist}_{\partial F(x)}(\textbf{0}) = \min _{u\in \partial F(x)} \Vert u-\textbf{0}\Vert\) denotes the point-to-set distance.

As an intuitive explanation, when function F is differentiable, the KŁ inequality in (5) can be rewritten as \(F(x)-F_{\Omega } \le \mathcal {O}(\Vert \nabla F(x)\Vert ^{\frac{1}{1-\theta }})\), which can be viewed as a type of local gradient dominance condition and generalizes the Polyak-Łojasiewicz (P{\L}) condition (with parameter \(\theta =\frac{1}{2}\)) (Łojasiewicz, 1963; Karimi et al., 2016). In the existing literature, a large class of functions has been shown to have the local KŁ geometry, e.g., sub-analytic functions, logarithm and exponential functions and semi-algebraic functions (Bolte et al., 2014). Moreover, the KŁ geometry has been exploited to establish the convergence of many gradient-based algorithms in nonconvex optimization, e.g., gradient descent (Attouch & Bolte, 2009; Li et al., 2017), accelerated gradient method (Zhou et al., 2020), alternating minimization (Bolte et al., 2014) and distributed gradient methods (Zhou et al., 2016).

5.2 Convergence rates of proximal BiO-AIDm under KŁ geometry

In this subsection, we obtain the following asymptotic function value convergence rates of the proximal BiO-AIDm algorithm under different parameter ranges of the KŁ geometry. Throughout, we define \(k_0\in \mathbb {N}^+\) to be a sufficiently large integer.

Theorem 3

Let Assumptions 1 and 2 hold and and assume that the potential function H defined in (2) has KŁ geometry. Then, under the same choices of hyper-parameters as those of Proposition 1, the potential function value sequence \(\{H(x_k,y_k)\}_k\) converges to its limit \(H^*\) (see its definition in Theorem 2) at the following rates.

  1. 1.

    If KŁ geometry holds with \(\theta \in \big (\frac{1}{2},1\big )\), then \(H(x_k,y_k)\downarrow H^*\) super-linearly as

    $$\begin{aligned} H(x_k,y_k)-H^*\le {\mathcal {O}}\Big (\!-\!\Big (\frac{1}{2(1-\theta )}\Big )^{k-k_0}\Big ), ~\forall k\ge k_0; \end{aligned}$$
    (6)
  2. 2.

    If KŁ geometry holds with \(\theta =\frac{1}{2}\), then \(H(x_k,y_k)\downarrow H^*\) linearly as (for some constant \(C>0\))

    $$\begin{aligned} H(x_k,y_k)-H^*\le {\mathcal {O}}\big ((1+C)^{-(k-k_0)}\big ),\quad \forall k\ge k_0; \end{aligned}$$
    (7)
  3. 3.

    If KŁ geometry holds with \(\theta \in \big (0,\frac{1}{2}\big )\), then \(H(x_k,y_k)\downarrow H^*\) sub-linearly as

    $$\begin{aligned} H(x_k,y_k)-H^*\le {\mathcal {O}}\big ((k-k_0)^{-\frac{1}{1-2\theta }}\big ),\quad \forall k\ge k_0, \end{aligned}$$
    (8)

Intuitively, a larger KŁ parameter \(\theta\) implies that the local geometry of the potential function H is sharper, which implies an orderwise faster convergence rate as shown in Theorem 3. In particular, when the KŁ geometry holds with \(\theta =\frac{1}{2}\), the proximal BiO-AIDm algorithm converges at a linear rate, which matches the convergence rate of bi-level optimization under the stronger geometry that both the upper and lower-level objective functions are strongly convex (Ghadimi and Wang, 2018). To the best of our knowledge, the above result provides the first function value converge rate analysis of proximal BiO-AIDm in the full spectrum of the nonconvex local KŁ geometry. The proof of Theorem 3 involves two novel techniques. The first novel technique is to establish the monotonically decreasing potential function \(H(x,y')\) in Proposition 1. This monotonic decreasing property guarantees that the sequence \(\{(x_k,y_k)\}_k\) generated by Algorithm 1 will enter the neighborhood of critical point \((x^*,y^*(x^*))\) where the KŁ property of H holds, which is essential to prove the function value convergence rates in Theorem 3. Another technical novelty is to prove the Lipschitz property of the mapping \(y^{(T)}(x,y)\) in Lemma 2, which is the key to establish the asymptotic convergence rates in Theorem 3. To the best of our knowledge, this Lipschitz property has not been established in the existing literature for the mapping defined by Nesterov’s accelerated gradient descent steps, and it is challenging to prove due to momentum acceleration. To address this challenge, we recursively write \(y^{(t)}\) as \(y^{(t)}(x,y)=(1+\eta )G_x(y^{(t-1)}(x,y))-\eta G_x(y^{(t-2)}(x,y))\), where \(G_x(y):=y-\alpha \nabla _y g(x,y)\) is the gradient descent mapping. We then leverage the Lipschitz property of \(G_x\) (Hardt et al., 2016) to establish the Lipschitz property of \(y^{(t)}\) via induction on t.

6 Experiment

We apply our bi-level optimization algorithm to solve a regularized data cleaning problem (Shaban et al., 2019) with the MNIST dataset (LeCun et al., 1998) and a linear classification model. We generate a training dataset \(\mathcal {D}_{\text {tr}}\) with 20k samples, a validation dataset \(\mathcal {D}_{\text {val}}\) with 20k samples, and a test dataset with 10k samples. In particular, we corrupt the training data by randomizing a proportion \(p\in (0,1)\) of their labels, and the goal of this application is to identify and avoid using these corrupted training samples. The corresponding bi-level problem is written as follows.

$$\begin{aligned}&\min _{\lambda } \frac{1}{|\mathcal {D}_{\text {val}}|}\! \sum _{(x_i,y_i)\in \mathcal {D}_{\text {val}}} \!\!\!L\big (w^*(\lambda )^{\top }x_i,y_i\big ) {- \frac{\gamma }{|\mathcal {D}_{\text {tr}}|}\sum _{(x_i,y_i)\in \mathcal {D}_{\text {tr}}}\min (|\lambda _i|,a)}, \nonumber \\&\textrm{where}~w^*(\lambda )=\mathop {\arg \min }_w\frac{1}{|\mathcal {D}_{\text {tr}}|}\!\sum _{(x_i,y_i)\in \mathcal {D}_{\text {tr}}}\!\!\!\!\!\!\! \sigma (\lambda _i)L\big (w^{\top }\!x_i,y_i\big )+\rho \Vert w\Vert ^2, \end{aligned}$$
(9)

where \(x_i, y_i\) denote the data and label of the i-th sample, respectively, \(\sigma (\cdot )\) denotes the sigmoid function, L is the cross-entropy loss, and \(\rho ,\gamma >0\) are regularization hyperparameters. The regularizer \(\rho \Vert w\Vert ^2\) makes the lower-level objective function strongly convex. In particular, we add the nonconvex and nonsmooth regularizer \(-\gamma \min (|\lambda _i |,a)\) to the upper-level objective function. (See Appendix G for the analytical solution to its proximal mapping) Intuitively, it encourages \(|\lambda _i|\) to approach the large positive constant a so that the training sample coefficient \(\sigma (\lambda _i)\) is close to either 0 or 1 for corrupted and clean training samples, respectively. In this experiment we set \(a=20\). Therefore, such a regularized bi-level data cleaning problem belongs to the problem class considered in this paper.

We compare the performance of our proximal BiO-AIDm with several bi-level optimization algorithms, including proximal BiO-AID (without accelerated AID), BiO-AID (without accelerated AID) and BiO-AIDm (with accelerated-AID). In particular, for BiO-AID and BiO-AIDm, we apply them to solve the unregularized data cleaning problem (i.e., \(\gamma =0\)). This serves as a baseline that helps understand the impact of regularization on the test performance. In addition, we also implement all these algorithms by replacing the AID scheme with the ITD scheme to demonstrate their generality.

Fig. 1
figure 1

Comparison of bi-level optimization algorithms under data corruption rate \(p=0.1\) (top row), \(p=0.2\) (middle row) and \(p=0.4\) (bottom row). The proximal algorithms in the right 2 columns correspond to \(\gamma =5\). The y-axis corresponds to the upper-level objective function value, and the x-axis corresponds to the overall computation complexity (number of inner gradient descent steps)

Hyperparameter setup We consider choices of corruption rates \(p=0.1, 0.2, 0.4\), regularization parameters \({\frac{\gamma }{20000}=0,0.001,0.1,100}\) and \(\rho =10^{-3}\). We run each algorithm for \(K=50\) outer iterations with stepsize \(\beta =0.5\) and \(T=5\) inner gradient steps with stepsize \(\alpha =0.1\). For the algorithms with momentum accelerated AID/ITD, we set the momentum parameter \(\eta =1.0\).

Table 2 Comparison of test accuracy (test loss). (Regularizer coefficient \({\frac{\gamma }{20000}}=0\) corresponds to four non-proximal algorithms including Bio-AID(m) and Bio-ITD(m), and \({\frac{\gamma }{20000}=0.001,0.1,100}\) correspond to the proximal variants of the four algorithms. The best test accuracies and test losses of each corruption rate p are bolded)

6.1 Optimization performance

We first investigate the effect of momentum acceleration on the optimization performance. In Fig. 1, we plot the upper-level objective function value versus the computational complexity for different bi-level algorithms under \({\frac{\gamma }{20000}=0.001}\) and different data corruption rates. In these figures, we separately compare the non-proximal algorithms and the proximal algorithms, as their upper-level objective functions are different (non-proximal algorithms are applied to solve the unregularized bi-level problem). It can be seen that all the bi-level optimization algorithms with momentum accelerated AID/ITD schemes consistently converge faster than their unaccelerated counterparts. The reason is that the momentum scheme accelerates the convergence of the inner gradient descent steps, which yields a more accurate implicit gradient and thus accelerates the convergence of the outer iterations. In addition, all the curves decrease almost in straight lines, which match the polynomial dependence of our computational complexity \(\mathcal {O}(\kappa ^{3.5}(\ln \kappa )\epsilon ^{-2})\) on \(\epsilon\) (see Corollary 1), as we plot both axes on log-scale.

6.2 Test performance

To understand the impact of momentum and the nonconvex regularization on the test performance of the model, we report the test accuracy and test loss of the models trained by all the algorithms in Table 2. It can be seen that the bi-level optimization algorithms with momentum accelerated AID/ITD (columns 2 & 4 of Table 2) achieve significantly better test performance than their unaccelerated counterparts (columns 1 & 3 of Table 2). This demonstrates the advantage of introducing momentum to accelerate the AID/ITD schemes. Furthermore, we observe that the test loss decreases and then increases as the regularizer coefficient \(\gamma\) increases. Therefore, adding such a regularizer with proper coefficient \(\gamma\) improves test performance via distinguishing the sample coefficients \(\sigma (\lambda _i)\) between corrupted and clean training samples. In particular, proximal BiO-ITDm with \({\frac{\gamma }{20000}=0.1}\) achieves the best test performance within each corruption rate p, which again demonstrates advantage of the regularizer and momentum acceleration as bolded in Table 2. Lastly, a larger corruption rate p leads to a lower test performance, which is reasonable.

7 Conclusion

In this paper, we provided a comprehensive analysis of the proximal BiO-AIDm algorithm with momentum acceleration for solving regularized nonconvex and nonsmooth bi-level optimization problems. Our key finding is that this algorithm admits an intrinsic monotonically decreasing potential function, which fully tracks the bi-level optimization progress. Based on this result, we established the first global convergence rate of proximal BiO-AIDm to a critical point in regularized nonconvex optimization, which is faster than that of BiO-AID. We also characterized the asymptotic convergence behavior and rates of the algorithm under the local KŁ geometry. We anticipate that this new analysis framework can be extended to study the convergence of other bi-level optimization algorithms, including stochastic bi-level optimization. In particular, it would be interesting to explore how bi-level optimization algorithm design affects the form of the potential function and leads to different convergence guarantees and rates in nonconvex bi-level optimization.