Abstract
Stochastic sequential quadratic optimization (SQP) methods for solving continuous optimization problems with nonlinear equality constraints have attracted attention recently, such as for solving large-scale data-fitting problems subject to nonconvex constraints. However, for a recently proposed subclass of such methods that is built on the popular stochastic-gradient methodology from the unconstrained setting, convergence guarantees have been limited to the asymptotic convergence of the expected value of a stationarity measure to zero. This is in contrast to the unconstrained setting in which almost-sure convergence guarantees (of the gradient of the objective to zero) can be proved for stochastic-gradient-based methods. In this paper, new almost-sure convergence guarantees for the primal iterates, Lagrange multipliers, and stationarity measures generated by a stochastic SQP algorithm in this subclass of methods are proved. It is shown that the error in the Lagrange multipliers can be bounded by the distance of the primal iterate to a primal stationary point plus the error in the latest stochastic gradient estimate. It is further shown that, subject to certain assumptions, this latter error can be made to vanish by employing a running average of the Lagrange multipliers that are computed during the run of the algorithm. The results of numerical experiments are provided to demonstrate the proved theoretical guarantees.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In this paper, we study convergence guarantees that can be offered for a stochastic algorithm for solving continuous optimization problems with nonlinear equality constraints. Such problems arise in a variety of important areas throughout science and engineering, including optimal control, PDE-constrained optimization, network optimization, and resource allocation [6, 8, 23, 34], and have recently arisen in new and interesting modern application areas such as constrained deep neural network training (e.g., physics-informed learning [3, 3] where one can impose hard constraints rather than merely define the loss function to minimize residual errors [3]). In certain instances of such problems, the objective function can be defined as an expectation over a random variable argument. In this context, the useful features of the algorithm that we study are that it is applicable when only (unbiased) stochastic estimates of the gradient of the objective function are tractable to obtain during the optimization, while at the same time the algorithm can exploit exact constraint function and derivative values that are tractable to obtain.
Sequential quadratic optimization (commonly known as SQP) methods are a popular and powerful class of derivative-based algorithms for solving continuous constrained optimization problems. SQP methods that are stochastic in nature (due to their use of stochastic gradient estimates in place of true gradients of the objective) have been investigated recently for solving problems of the aforementioned type, namely, ones for which the objective is defined by an expectation. For one subclass of stochastic SQP methods developed in [2, 3, 16, 17] that is based on the classical stochastic-gradient methodology (known more broadly as stochastic approximation) [35], convergence guarantees have been limited to asymptotic convergence of the expected value of a stationarity measure to zero; see, e.g., [3, Corollary 3.14]. In this paper, we present new analyses of situations in which, for an instance of a method in this subclass, almost-sure convergence guarantees of the primal iterates, Lagrange multipliers, and a stationarity measure can be guaranteed. (For a straightforward analysis, we focus on a simplified variant of the algorithm from [3].) This brings the analysis of this subclass of methods in line with analyses that show almost-sure convergence of stochastic-gradient-based methods for the unconstrained setting; see §1.2. Other stochastic SQP-based methods for solving continuous optimization problems with nonlinear equality constraints have been proposed recently (see, e.g., [1, 4, 5, 19, 26, 27, 30, 33]), and for these methods different types of convergence guarantees have been proved due to the fact that they impose different (often stronger) requirements on the stochastic gradient estimates. Therefore, our work in this paper is distinct from analyses for these other methods, although we contend that the tools employed in this paper might be applicable when analyzing other related algorithms as well. One article that presents results comparable to ours is [28], although in that article the algorithm and assumptions are different than our setting.
Our first contribution is an analysis of situations in which almost-sure convergence of the primal iterates of a stochastic SQP method can be guaranteed. Our main assumption for this analysis can be viewed as a generalization of the Polyak-Łojasiewicz (PL) condition used in unconstrained optimization.
Our second contribution is an analysis of convergence of the sequence of Lagrange multipliers generated by a stochastic SQP method. The convergence behavior of the Lagrange multipliers is of interest for several reasons. For one thing, Lagrange multipliers are used for common certificates of stationarity. They can also play an essential role in sensitivity analysis and active-set identification when solving inequality-constrained problems. Since various algorithms for solving inequality-constrained problems employ algorithms for solving equality-constrained subproblems, it is important to provide convergence guarantees for the Lagrange multipliers when solving equality-constrained problems. In our analysis, we first show that the expected error in the Lagrange multiplier computed in any iteration of our stochastic SQP method of interest can be bounded by the distance of the primal iterate to a primal stationary point plus a term related to the error in the latest stochastic gradient estimate. Such a result is natural, and as in other settings of statistical estimation it suggests that better Lagrange multiplier estimates can be obtained through averaging the multipliers obtained in each iteration. We consider such an approach as well; in particular, we show conditions under which averaging can cause the expected error to vanish asymptotically. (Our approach to averaging is related to the ergodic convergence analysis of optimization algorithms; see, e.g., [12, 21, 32]. It is worthwhile to emphasize that it is different from dual averaging ideas that have been developed for acceleration [29].)
Under our combined assumptions, the stochastic SQP method that we analyze possesses almost-sure convergence guarantees of the primal iterates, Lagrange multipliers, and a stationarity measure while only employing stochastic gradient estimates. To illustrate the practical relevance of our theoretical contributions, we provide the results of numerical experiments. For example, we demonstrate situations when averaging of the Lagrange multipliers results in more accurate stationarity measures, which, as mentioned, is useful in practice for recognizing (approximate) stationarity and other reasons.
1.1 Notation
We use \(\mathbb {R} \) to denote the set of real numbers and \(\mathbb {R} _{\ge r}\) (resp., \(\mathbb {R} _{>r}\)) to denote the set of real numbers greater than or equal to (resp., greater than) \(r \in \mathbb {R} \). We use \(\mathbb {R}^{n}\) to denote the set of n-dimensional real vectors and \(\mathbb {R}^{m \times n}\) to denote the set of m-by-n-dimensional real matrices. We use \({\mathbb {S}}^n\) to denote the set of symmetric matrices in \(\mathbb {R}^{n \times n}\), \({\mathbb {S}}_{\succeq 0}^n\) to denote the set of positive semidefinite matrices in \({\mathbb {S}}^n\), and \({\mathbb {S}}_{\succ 0}^n\) to denote the set of positive definite matrices in \({\mathbb {S}}_{\succeq 0}^n\). We use \(\mathbb {N} \) to denote the positive integers, and for any \(k \in \mathbb {N} \) define \([k] {:}{=} \{1,\dots ,k\}\).
Given a real matrix \(A \in \mathbb {R}^{n \times m}\), we use \(\sigma _{\min }(A)\) to denote the smallest singular value of A and \(\sigma _{\max }(A) = \Vert A\Vert _2\) to denote the largest singular value of A, i.e., the spectral norm of A. Given such a matrix A with full column rank, we denote by \(A^\dagger = (A^TA)^{-1}A^T\) the Moore–Penrose pseudoinverse of A. For future reference, we state the following lemma pertaining to pseudoinverses.
Lemma 1.1
(see [37, Theorem 4.1]) If \(A \in \mathbb {R}^{n \times m}\) and \(B \in \mathbb {R}^{n \times m}\) have full column rank, then \(\Vert A^\dagger -B^\dagger \Vert _2 \le \Vert A^\dagger \Vert _2 \Vert B^\dagger \Vert _2 \Vert A-B\Vert _2\).
For a real-valued sequence \(\{x_k\}\) (of numbers, vectors, or matrices), we use \(\{x_k\} \subset \mathbb {R}^{n}\) to indicate that \(x_k \in \mathbb {R}^{n}\) for all \(k \in \mathbb {N} \). Similarly, for a sequence of random variables \(\{X_k\}\), we use \(\{X_k\} \subset \mathbb {R}^{n}\) to indicate that \(X_k \in \mathbb {R}^{n}\) for all \(k \in \mathbb {N} \), which in turn means that, for all \(k \in \mathbb {N} \), a realization/outcome of \(X_k\) is an element of \(\mathbb {R}^{n}\). Generally, we use a capital letter to denote a random variable and the corresponding lower-case letter to denote a realization of the random variable. For example, a stochastic objective gradient estimator at iteration \(k \in \mathbb {N} \) is denoted as \(G_k\), of which a realization is written as \(g_k\).
Let \(\{V_k\} \subset \mathbb {R}^{n \times m}\) be a sequence of random variables and V be a random variable all defined with respect to a probability space \((\varOmega , {{\mathcal {F}}}, \mathbb {P})\); i.e., a realization \(\omega \in \varOmega \) defines a realization \(V_k(\omega )\) for any \(k \in \mathbb {N} \) and a realization \(V(\omega )\). The sequence \(\{V_k\}\) converges in distribution to V if and only if the cumulative distribution functions (CDFs) of the elements of \(\{V_k\}\) converge pointwise to the CDF of V as \(k \rightarrow \infty \); e.g., for such \(\{V_k\} \subset \mathbb {R}^{n}\) and a matrix \(\varSigma \in {\mathbb {S}}^n\), we write \(\{V_k\} \xrightarrow {{d}}{{\mathcal {N}}}(0, \varSigma )\) to indicate that \(\{V_k\}\) converges in distribution to a multivariate normal random vector with mean zero and covariance matrix \(\varSigma \). The sequence \(\{V_k\}\) converges in probability to V as \(k \rightarrow \infty \), which we denote by \(\{V_k\} \xrightarrow {{p}}V\), if and only if for any \(\epsilon \in \mathbb {R} _{>0}\) one finds
Finally, \(\{V_k\}\) converges almost-surely to V as \(k \rightarrow \infty \), which we denote by \(\{V_k\} \xrightarrow {{a.s.}}V\), if and only if there exists \(\varOmega _0 \subset \varOmega \) with \(\mathbb {P}(\varOmega _0) = 0\) such that
We use \({\textbf{1}}_A\) to denote an indicator random variable for the event A, which takes the value 1 if event A occurs and takes the value 0 otherwise. Given a probability space \((\varOmega ,{{\mathcal {F}}},\mathbb {P})\) corresponding to outcomes, events, and associated probabilities pertaining to the possible runs of a stochastic algorithm, we use \(\mathbb {E}[\cdot ]\) to denote expectation taken with respect to \(\mathbb {P}[\cdot ]\) and, given a sub-\(\sigma \)-algebra \({\overline{{{\mathcal {F}}}}}\) of \({{\mathcal {F}}}\), we use \(\mathbb {E}[\cdot | {\overline{{{\mathcal {F}}}}}]\) to denote expectation conditioned on \({\overline{{{\mathcal {F}}}}}\) taken with respect to \(\mathbb {P}[\cdot ]\). In settings when a random variable \(\iota \) has been introduced, we use \(\mathbb {E}_{\iota }[\cdot ]\) to denote expectation taken with respect to the distribution of \(\iota \).
1.2 Mathematical Background
In the classical article by Robbins and Monro [35], it is shown that a straightforward approach of stochastic approximation for solving an equation with a unique root \(x_\star \) can lead to convergence in probability of the iterate sequence. Specifically, under certain basic assumptions, as long as the prescribed sequence of step sizes \(\{\alpha _k\}\) employed by the algorithm is unsummable, but square summable (e.g., \(\alpha _k = 1/k\) for all \(k \in \mathbb {N} \)), it can be shown that the generated sequence of solution estimates \(\{X_k\}\) satisfies
which in turn implies that \(\{X_k\} \xrightarrow {{p}}x_\star \). Cast into the context of minimizing a smooth, potentially nonconvex objective \(f : \mathbb {R}^{n} \rightarrow \mathbb {R} \) with a stochastic-gradient method, these same principles can be used to prove (see, e.g., [9])
In a later article by Robbins and Siegmund [36], certain conditions are shown to offer an almost-sure convergence guarantee for a sequence generated by stochastic approximation. In particular, the article first proves a general theorem, which we state in a slightly simplified form for our purposes.
Lemma 1.2
(see [36, Theorem 1]) Let \((\varOmega ,{{\mathcal {F}}},\mathbb {P})\) be a probability space and let \(\{{{\mathcal {F}}}_k\}\) with \({{\mathcal {F}}}_k \subseteq {{\mathcal {F}}}_{k+1}\) for all \(k \in \mathbb {N} \) be a sequence of sub-\(\sigma \)-algebras of \({{\mathcal {F}}}\). Let \(\{R_k\}\), \(\{P_k\}\), and \(\{Q_k\}\) be sequences of nonnegative random variables such that, for all \(k \in \mathbb {N} \), the random variables \(R_k\), \(P_k\), and \(Q_k\) are \({{\mathcal {F}}}_k\)-measurable. If \(\sum _{k=1}^\infty Q_k < \infty \) and, for all \(k \in \mathbb {N} \), one has
then, almost-surely, \(\sum _{k=1}^\infty P_k < \infty \) and \(\displaystyle \lim _{k \rightarrow \infty } R_k\) exists and is finite.
Applied to the context of stochastic approximation for solving an equation, it is shown in [36] that under certain assumptions (which we omit for brevity) the conclusion in (1) can be strengthened for the same algorithm to
which is to say that \(\{X_k\} \xrightarrow {{a.s.}}x_\star \). One can derive a similar such result in the context of minimizing a smooth, potentially nonconvex objective function \(f : \mathbb {R}^{n} \rightarrow \mathbb {R} \) with a stochastic-gradient method. For example, under a related set of conditions, it has been shown by Bertsekas and Tsitsiklis [7] that a stochastic-gradient method can be guaranteed to yield
Our main contributions in this paper are also almost-sure convergence guarantees, but for a stochastic SQP method in the context of nonlinear-equality-constrained optimization. In particular, we show almost-sure convergence guarantees for the primal iterates, Lagrange multipliers, and stationarity measures for a simplified variant of the algorithm from [3]. Some of our almost-sure convergence guarantees for the Lagrange multipliers computed by our method of interest pertains to an averaged sequence. For our analysis of this sequence, we require two key results that are known from the literature. The first result that we need is the central limit theorem (CLT) for a multidimensional martingale difference triangular array stated as Lemma 1.3.
Lemma 1.3
(Multidimensional martingale central limit theorem) Let \(\{(\eta _{k,i}, {{\mathcal {F}}}_{k,i})\}_{k \in \mathbb {N} ,i \in [k]}\) be an n-dimensional martingale difference triangular array, i.e., with an initial generating \(\sigma \)-algebra \({{\mathcal {F}}}_{k,1}\) for all \(k \in \mathbb {N} \), one has
-
(i)
\({{\mathcal {F}}}_{k,i} = \sigma (\eta _{k,1},\dots ,\eta _{k,i-1})\) for all \(k \in \mathbb {N} \) and \(i \in \{2,\dots ,k\}\) and
-
(ii)
\(\mathbb {E}[\eta _{k,i} | {{\mathcal {F}}}_{k,i}] = 0\) for all \(k \in \mathbb {N} \) and \(i \in [k]\).
If the array has the properties that
then \(\displaystyle \left\{ \sum _{i=1}^k \eta _{k,i}\right\} \xrightarrow {{d}}{{\mathcal {N}}}(0, \varSigma )\).
The multidimensional martingale CLT in Lemma 1.3 can be derived by applying the one-dimensional martingale CLT (see, e.g., [25, Theorem 2.3] and [20, Corollary 3.1]) to \(\eta _{k,i}^a {:}{=} a^T \eta _{k,i}\) for arbitrary \(a \in \mathbb {R}^{n}\) since \(\{\sum _{i=1}^k \eta _{k,i}^a\} \xrightarrow {{d}}{{\mathcal {N}}}(0, a^T\varSigma a)\) for any given \(a \in \mathbb {R}^{n}\) implies that \(\{\sum _{i=1}^k \eta _{k,i}\} \xrightarrow {{d}}{{\mathcal {N}}}(0,\varSigma )\); see, e.g., [18, Exercise 3.10.8]. A similar result is used in [24]. We refer to (3b) as Lindeberg’s condition, as is common in the literature.
The second key result that we need is the following, which we refer to as a moment convergence result. It follows, e.g., by [14, Theorem 4.5.2].
Lemma 1.4
(Moment convergence) Let \(\{X_k\} \subset \mathbb {R}^{n}\) be a sequence of random vectors such that \(\{X_k\} \xrightarrow {{d}}X\) and \(\displaystyle \sup _{k\in \mathbb {N} } \mathbb {E}[\Vert X_k\Vert _2^\varTheta ] < \infty \) for some \(\varTheta \in \mathbb {R} _{>0}\). Then, for all \(\theta \in (0,\varTheta )\), one finds that \(\displaystyle \lim _{k\rightarrow \infty } \mathbb {E}[\Vert X_k\Vert _2^\theta ] = \mathbb {E}[\Vert X\Vert _2^\theta ]\).
1.3 Outline
In Sect. 2, we present formally the continuous optimization problem of interest and the stochastic SQP algorithm for solving it that we analyze in the remainder of the paper. We also provide preliminary assumptions that we make about the problem and the algorithm, and state basic properties of the algorithm that transfer from the prior work in [3]. In Sect. 3, we prove convergence results for the primal iterates generated by the algorithm. In Sect. 4, we prove convergence results for Lagrange multiplier sequences that are generated by the algorithm. The results of numerical experiments are provided in Sect. 5 and concluding remarks are offered in Sect. 6.
2 Problem and Algorithm Descriptions
The algorithm that we study is designed to solve
where \(f : \mathbb {R}^{n} \rightarrow \mathbb {R} \) and \(c : \mathbb {R}^{n} \rightarrow \mathbb {R}^{m}\) are continuously differentiable, \(\iota \) is a random variable with associated probability space \((\varOmega _\iota , {{\mathcal {F}}}_\iota , \mathbb {P}_\iota )\), \(F : \mathbb {R}^{n} \times \varOmega _\iota \rightarrow \mathbb {R} \), and \(\mathbb {E}_\iota [\cdot ]\) denotes expectation taken with respect to \(\mathbb {P}_\iota \). Given an initial point \(x_1 \in \mathbb {R}^{n}\), any run of the algorithm generates a sequence of iterates \(\{x_k\} \subset \mathbb {R}^{n}\), i.e., a realization of a stochastic process \(\{X_k\} \subset \mathbb {R}^{n}\). We make the following assumption about problem (4) and the generated sequence of iterates.
Assumption 2.1
There exists open convex \({{\mathcal {X}}}\subseteq \mathbb {R}^{n}\) containing \(\{X_k\} \subset \mathbb {R}^{n}\) such that the following hold. The objective function \(f : \mathbb {R}^{n} \rightarrow \mathbb {R} \) is continuously differentiable and bounded below over \({{\mathcal {X}}}\) and its gradient function \(\nabla f : \mathbb {R}^{n} \rightarrow \mathbb {R}^{n}\) is bounded and Lipschitz continuous over \({{\mathcal {X}}}\). The constraint function \(c : \mathbb {R}^{n} \rightarrow \mathbb {R}^{m}\) (where \(m \le n\) \()\) is Lipschitz continuous, continuously differentiable, and bounded over \({{\mathcal {X}}}\), its Jacobian function \(\nabla c^T : \mathbb {R}^{n} \rightarrow \mathbb {R}^{m \times n}\) is Lipschitz continuous over \({{\mathcal {X}}}\), and \(\nabla c(x)^T\) has full row rank for all \(x \in {{\mathcal {X}}}\) with singular values that are bounded below uniformly by a positive constant over \({{\mathcal {X}}}\).
It would be possible to loosen Assumption 2.1 to require only that \({{\mathcal {X}}}\) contains \(\{X_k\}\) almost surely. However, since this would require repeated references to probability-one events throughout our analysis without strengthening our conclusions substantially, we do not bother with this level of generality.
Under Assumption 2.1, there exists a tuple of constants, which we denote by \((f_{\inf }, \kappa _{\nabla f}, \kappa _c, \kappa _{\nabla c}, r) \in \mathbb {R} \times \mathbb {R} _{>0} \times \mathbb {R} _{>0} \times \mathbb {R} _{>0} \times \mathbb {R} _{>0}\), such that
for all \(x \in {{\mathcal {X}}}\), and there exists \((L_{\nabla f}, L_c, L_{\nabla c}) \in \mathbb {R} _{>0} \times \mathbb {R} _{>0} \times \mathbb {R} _{>0}\) such that
for all \((x,\hspace{0.83328pt}\overline{\hspace{-0.83328pt}x}) \in {{\mathcal {X}}}\times {{\mathcal {X}}}\). The Lagrangian function \({{\mathcal {L}}}: \mathbb {R}^{n} \times {\mathbb {R}}^m \rightarrow \mathbb {R} \) corresponding to problem (4) is defined by \({{\mathcal {L}}}(x,y) = f(x) + c(x)^Ty\), and the first-order stationarity conditions for (4) are given by
We refer to any \((x,y) \in \mathbb {R}^{n} \times \mathbb {R}^{m}\) satisfying (7) as a stationary point for problem (4). In addition, we refer to any \(x \in \mathbb {R}^{n}\) such that there exists \(y \in \mathbb {R}^{m}\) with (x, y) satisfying (7) as a primal stationary point for (4).
Let us now describe the algorithm whose convergence properties are the subject of our study. We state the algorithm in terms of a realization of the quantities it generates. At an iterate \(x_k \in \mathbb {R}^{n}\), the algorithm generates a stochastic gradient estimate \(g_k \approx \nabla f(x_k) \in \mathbb {R}^{n}\) and makes use of a scaling matrix \(h_k \in {\mathbb {S}}^n\). (See upcoming Assumptions 2.3 and 2.4 about these quantities, which essentially state that the stochastic gradient estimators are unbiased and have bounded variance, and that in each iteration the scaling matrix is positive definite on the null space of the constraint Jacobian. We state these assumptions later, once we formally introduce the stochastic process defined by the algorithm.) Given \(g_k\) and \(h_k\), a direction \(d_k \in \mathbb {R}^{n}\) is computed by solving
Under upcoming Assumption 2.4 (which includes that \(d^Th_kd > 0\) for all \(d \in {{\,\textrm{Null}\,}}(\nabla c(x_k)^T)\)), the solution \(d_k\) of (8) as well as a Lagrange multiplier for the constraints, call it \(y_k \in \mathbb {R}^{m}\), can be obtained by solving
Upon computation of the search direction in iteration \(k \in \mathbb {N} \), the algorithm selects a step size \(\alpha _k \in (0,1]\). Specifically, with a merit parameter \(\tau \in \mathbb {R} _{>0}\), ratio parameter \(\xi \in \mathbb {R} _{>0}\) (see upcoming Assumption 2.2), and \((L_{\nabla f},L_{\nabla c})\) from (6), the algorithm that we analyze selects the step size for all \(k \in \mathbb {N} \) as
where \(\{\beta _k\}\) is unsummable, but square-summable. This choice of step sizes means that the algorithm that we analyze is a simplified variant of the algorithm from [3]; see [3] and below for further discussion about the merit parameter, ratio parameter, and the particular formula for the step size stated in (10). (We conjecture that our ultimate conclusions also hold for the original variant of the algorithm from [3] in the event considered in Section 3.2.1 of that paper, but for our aims in this paper we merely consider the simplified variant presented here in order to allow our analysis not to be obscured by auxiliary details.) One aspect that makes our variant a simplified one is that we assume that the merit and ratio parameters are initialized to values that are sufficiently small such that the algorithm does not need to update them adaptively. This means that, for our analysis, we can consider the merit function (with fixed merit parameter \(\tau \)) \(\phi _\tau : \mathbb {R}^{n} \rightarrow \mathbb {R} \) defined by \(\phi _\tau (x) = \tau f(x) + \Vert c(x)\Vert _1\). (Any convex norm can be used for the constraint violation. We use the \(\ell _1\)-norm for consistency with [3].) A local approximation of \(\phi _\tau \) at \(x \in \mathbb {R}^{n}\) can be defined through a function \(q_\tau :\mathbb {R}^{n} \times \mathbb {R}^{n} \times {\mathbb {S}}^n \times \mathbb {R}^{n} \rightarrow \mathbb {R} \); specifically, with \(g \in \mathbb {R}^{n}\) and \(h \in {\mathbb {S}}^n\), a local approximation of \(\phi _\tau \) at x as a function of d is given by
The reduction in \(q_\tau (x,g,h,\cdot )\) offered by d with \(c(x) + \nabla c(x)^Td = 0\) can be defined through \(\varDelta q_\tau :\mathbb {R}^{n} \times \mathbb {R}^{n} \times {\mathbb {S}}^n \times \mathbb {R}^{n} \rightarrow \mathbb {R} \) defined by
Consistent with [3, Section 3.2.1], we make the following assumption.
Assumption 2.2
The parameters \(\xi \in \mathbb {R} _{>0}\) and \(\tau \in \mathbb {R} _{>0}\) are chosen such that, for some \(\nu \in (0,1)\) and all \(k \in \mathbb {N} \) in any run, the following hold.
-
(i)
\(\xi \le \xi _k^{\textrm{trial}}\), where one defines
$$\begin{aligned} \xi _k^{\textrm{trial}} {:}{=} {\left\{ \begin{array}{ll} \infty & \text {if}\, d_k = 0 \\ \tfrac{\varDelta q_\tau (x_k, g_k, h_k, d_k)}{\tau \Vert d_k\Vert _2^2} & \text {otherwise;} \end{array}\right. } \end{aligned}$$ -
(ii)
\(\tau \le \tau _k^{\textrm{trial, true}}\), where, with \(d_k^\textrm{true}\) being the solution of (8) if \(g_k\) were replaced by \(\nabla f(x_k)\) and \(\rho _k {:}{=} \nabla f(x_k)^Td_k^\textrm{true}+ \max \{(d_k^\textrm{true})^Th_kd_k^\textrm{true}, 0\}\), one defines
$$\begin{aligned} \tau _k^{\mathrm{trial,\, true}} {:}{=} {\left\{ \begin{array}{ll} \infty & \text {if}\, \rho _k \le 0 \\ \displaystyle \tfrac{(1 - \nu )\Vert c(x_k)\Vert _1}{\rho _k} & \text {otherwise.} \end{array}\right. } \end{aligned}$$
The definition of \(\tau _k^{\mathrm{trial,\, true}}\) and Assumption 2.2 ensure for all \(k \in \mathbb {N} \) that
The stochastic SQP algorithm that we study is stated in Algorithm 1.
Algorithm 1 has been simplified from [3, Algorithm 2.1] in two main ways. First, in [3, Algorithm 2.1], the parameters \(\tau \) and \(\xi \) are not fixed constants; rather, the roles played by these quantities are played by elements of sequences \(\{\tau _k\}\) and \(\{\xi _k\}\) that are generated adaptively by the algorithm to be monotonically nonincreasing. Under an event of the type considered in [3, Section 3.2.1] (see also [2, Section 4.1]), these sequences eventually remain constant. Therefore, in Algorithm 1, to focus on analyzing the convergence properties of \(\{X_k\}\) and \(\{Y_k\}\) as \(k \rightarrow \infty \), we assume that the inputs \(\tau \) and \(\xi \) satisfy Assumption 2.2, in which case the parameter updating strategies in [3] would not trigger a change in these values. One finds from [3] that this setting ensures that the step from \(x_k\) to \(x_{k+1}\) yields a sufficient-decrease-type property that is relevant for our analysis; see upcoming Lemma 2.1 on page 11. Second, the step size \(\alpha _k\) in [3, Algorithm 2.1] is set adaptively based on the quantities \(\varDelta q(x_k, g_k, h_k, d_k)\), \(\Vert c(x_k)\Vert _1\), and \(\Vert d_k\Vert _2^2\), then projected onto the interval \(\big [\tfrac{\beta _k \xi _k \tau _k }{\tau _k L_{\nabla f} + L_{\nabla c}}, \tfrac{\beta _k \xi _k \tau _k }{\tau _k L_{\nabla f} + L_{\nabla c}} + \theta \beta _k^2\big ]\) for some \(\theta \in \mathbb {R}_{\ge 0}\). This adaptive choice yields practical benefits, since it allows relatively large step sizes when \(\theta \) is chosen to be large. However, to simplify the analysis for our setting, we set the step size based on (10), which amounts to the use of this rule only for \(\theta = 0\).
All that remains before presenting our analysis is to articulate some final assumptions and state a few key relations from [3]. Considering the outcomes of entire runs of Algorithm 1, henceforth we consider the probability space \((\varOmega , {{\mathcal {F}}}, \mathbb {P})\), where \(\varOmega {:}{=} \prod _{k=1}^\infty \varOmega _\iota \). In this manner, each realization of a run of the algorithm can be associated with \(\omega \in \varOmega \), an infinite-dimensional tuple whose kth element is \(\omega _k \in \varOmega _\iota \), which determines the stochastic gradient estimate. The stochastic process defined by Algorithm 1 can thus be expressed as
where, for all \(k \in \mathbb {N} \), the random variables are the iterate \(X_k(\omega )\), stochastic gradient estimator \(G_k(\omega )\), symmetric matrix \(H_k(\omega )\), search direction \(D_k(\omega )\), true search direction \(D_k^\textrm{true}(\omega )\), Lagrange multiplier \(Y_k(\omega )\), and true Lagrange multiplier \(Y_k^\textrm{true}(\omega )\). (As in Assumption 2.2, the true search direction and Lagrange multiplier are defined by the solution of (9) if the stochastic gradient were replaced by the true gradient.) For instance, given \(\omega \) that specifies a particular outcome of a run of the algorithm, the quantity \(X_k(\omega ) \in \mathbb {R}^{n}\) is the kth iterate. Given initial conditions (including that \(X_1(\omega ) = x_1 \in \mathbb {R}^{n}\) for all \(\omega \in \varOmega \)), let \({{{\mathcal {F}}}}_1\) denote the \(\sigma \)-algebra corresponding to the initial conditions and, for all \(k \in \mathbb {N} \setminus \{1\}\), let \({{{\mathcal {F}}}}_k\) denote the \(\sigma \)-algebra defined by the initial conditions and the random variables \(\{G_1,\dots ,G_{k-1}\}\). We assume the following.
Assumption 2.3
For all \(k \in \mathbb {N} \), the stochastic gradient estimator satisfies \(\mathbb {E}[G_k | {{{\mathcal {F}}}}_k] = \nabla f(X_k)\). In addition, there exists \(\sigma \in \mathbb {R} _{>0}\) such that, for all \(k \in \mathbb {N} \), it holds that \(\mathbb {E}[\Vert G_k - \nabla f(X_k)\Vert _2^2 | {{\mathcal {F}}}_k] \le \sigma ^2\).
Assumption 2.4
For all \(k \in \mathbb {N} \), the matrix \(H_k \in {\mathbb {S}}^n\) is \({{\mathcal {F}}}_k\)-measurable and bounded in \(\ell _2\)-norm by \(\kappa _H \in \mathbb {R} _{>0}\). In addition, there exists \(\zeta \in (0, \kappa _H]\) such that, for all \(k \in \mathbb {N} \), \(u^TH_ku \ge \zeta \Vert u\Vert _2^2\) for all \(u \in {{\,\textrm{Null}\,}}(\nabla c(X_k)^T)\).
Assumption 2.3 is common in the literature. It is well known that for stochastic-gradient-based methods, such an assumption can typically be relaxed without fundamentally changing the convergence properties that can be proved about the corresponding algorithm; see, e.g., [9], where a variety of convergence guarantees are shown for a stochastic-gradient-based method when the assumed inequality is relaxed by adding a factor times \(\Vert \nabla f(X_k)\Vert _2^2\) to the right-hand side. Since such relaxations are not our focus, we employ Assumption 2.3 here, which holds in various settings, especially those of real-world interest, such as when the objective function is a finite sum of Lipschitz functions. Under Assumptions 2.3 and 2.4, it follows that \(\{{{\mathcal {F}}}_k\}\) is a filtration for the probability space \((\varOmega , {{\mathcal {F}}}, \mathbb {P})\). In particular, the initial conditions and a realization of \(\{G_1,\dots ,G_{k-1}\}\) determine the realizations of \(\{(X_j, H_j, D_j^\textrm{true}, Y_j^\textrm{true})\}_{j=1}^k\) and \(\{(D_j,Y_j)\}_{j=1}^{k-1}\), i.e., for all \(k \in \mathbb {N} \), one has that \((X_k,H_k,D_k^\textrm{true},Y_k^\textrm{true})\) is \({{\mathcal {F}}}_k\)-measurable while \((G_k,D_k,Y_k)\) is \({{\mathcal {F}}}_{k+1}\)-measurable. Examples of symmetric matrices satisfying Assumption 2.4 for each \(k \in \mathbb {N} \) are \(H_k = I\) or \(H_k\) being (an approximation of) the Hessian of the Lagrangian at \((X_k,Y_{k-1})\) as long as safeguards are included to ensure that \(H_k\) is sufficiently positive definite in the null space of \(\nabla c(X_k)^T\), as required by Assumption 2.4.
The following lemma characterizes decreases of the merit function \(\phi _\tau \) and the boundedness of \(\varDelta q_\tau \); it is adapted from Lemmas 3.7, 3.8, and 3.12 in [3].
Lemma 2.1
For all \(k \in \mathbb {N} \), it follows that \(\mathbb {E}[D_k | {{\mathcal {F}}}_k] = D_k^\textrm{true}\),
and
3 Convergence of Primal Iterates
One can derive from the analysis in [3] that, under Assumptions 2.1, 2.2, 2.3, and 2.4, Algorithm 1 yields (recall the stationarity conditions (7))
see [3, Corollary 3.14] for further details. This lower limit in (16) does not imply that the primal iterate sequence converges in any particular sense to a primal stationary point. In this section, we establish conditions under which almost-sure convergence of the primal iterate sequence can be guaranteed, which when coupled with the results of the next section (in particular, Corollary 4.3) provides a rich picture of the convergence behavior of Algorithm 1.
Before presenting the results of this section, let us state a common decomposition of the search direction that is used in our analysis. Observe by the Fundamental Theorem of Linear Algebra that \(D_k\) can be decomposed as \(D_k = U_k + V_k\), where \(U_k \in {{\,\textrm{Null}\,}}(\nabla c(X_k)^T)\) and \(V_k \in {{\,\textrm{Range}\,}}(\nabla c(X_k))\). Under Assumption 2.1, we can now introduce \(Z_k \in \mathbb {R}^{n \times (n-m)}\) as a matrix whose columns form a basis for \({{\,\textrm{Null}\,}}(\nabla c(X_k)^T)\). It follows that \(U_k = Z_k W_k\) for some \(W_k \in \mathbb {R}^{n-m}\). Now, from (9), it follows that
Since \(\nabla c(X_k)^T\) has full row rank under Assumption 2.1, the unique \(V_k \in {{\,\textrm{Range}\,}}(\nabla c(X_k))\) that solves this linear system is given by
Hence, from (9) and (17), it follows under Assumption 2.4 that
where from the second line to the third, we use the facts that \(\nabla c(X_k)^T Z_k = 0\) and that \(Z_k^T H_k Z_k\) is invertible under Assumption 2.4.
Our main result of this section is Theorem 3.1 on page 14. Leading up to that result, we require two technical lemmas that are presented next. The first, Lemma 3.1 below, establishes a result about the asymptotic behavior of the sequence of merit function values, namely, \(\{\phi _\tau (X_k)\}\), and about the sequence of reductions in the model of the merit function corresponding to the true objective gradients and corresponding search directions. Note that the proof of the lemma uses the fact that the sequence \(\{\beta _k\}\) employed in Algorithm 1 is unsummable (i.e., \(\sum _{k=1}^\infty \beta _k = \infty \)), but square-summable (i.e., \(\sum _{k=1}^\infty \beta _k^2 < \infty \)).
Lemma 3.1
Suppose that Assumptions 2.1, 2.2, 2.3, and 2.4 hold. Then,
Proof
For all \(k \in \mathbb {N} \), define \(R_k {:}{=} \phi _\tau (X_k) - \tau f_{\inf }\), and observe from Assumption 2.1 that \(R_k \ge 0\) for all \(k \in \mathbb {N} \). Also, for arbitrary \(k \in \mathbb {N} \), one finds
where the first inequality follows from Lemma 2.1, the second follows from (15), and the third follows from (13) and the fact that \(\{\beta _k\} \subset (0,1]\). Now define
and observe that \(P_k \ge 0\) for all \(k \in \mathbb {N} \) follows by (13), \(Q_k \ge 0\) for all \(k \in \mathbb {N} \) follows since \(\{\beta _k\} \subset (0,1]\), and \(\sum _{k=1}^\infty Q_k < \infty \) since \(\sum _{k=1}^\infty \beta _k^2 < \infty \). Therefore, with (19), applying Lemma 1.2 shows that, almost surely,
and \(\lim _{k \rightarrow \infty } R_k = \lim _{k \rightarrow \infty } (\phi _\tau (X_k) - f_{\inf })\) exists and is finite. The latter fact leads directly to the first desired conclusion, whereas the former fact along with \(\sum _{k=1}^\infty \beta _k = \infty \) (so \(\sum _{k=1}^\infty \alpha _k=\infty \)) yields the second. \(\square \)
The components of our second technical lemma can be derived from various results from [3], namely, Lemmas 2.10, 2.11, 2.12, and 3.4 in [3]. Nonetheless, we provide a proof of the lemma for the sake of completeness.
Lemma 3.2
Suppose that Assumptions 2.1, 2.2, 2.3, and 2.4 hold. Then, with respect to the decomposition \(D_k^\textrm{true}= U_k^\textrm{true}+ V_k\) for all \(k \in \mathbb {N} \), where \(U_k^\textrm{true}\in {{\,\textrm{Null}\,}}(\nabla c(X_k)^T)\) and \(V_k \in {{\,\textrm{Range}\,}}(\nabla c(X_k))\), the following holds.
-
(a)
There exists \(\kappa _{uv} \in \mathbb {R} _{>0}\) such that, for all \(k \in \mathbb {N} \),
$$\begin{aligned} \Vert U_k^\textrm{true}\Vert _2^2 \ge \kappa _{uv} \Vert V_k\Vert _2^2 \implies (D_k^\textrm{true})^T H_k D_k^\textrm{true}\ge \tfrac{1}{2}\zeta \Vert U_k^\textrm{true}\Vert _2^2. \end{aligned}$$
In addition, with \(\kappa _{uv} \in \mathbb {R} _{>0}\) from part (a) and defining, for all \(k \in \mathbb {N} \),
the following statements hold.
-
(b)
There exists \(\kappa _\varPsi \in \mathbb {R} _{>0}\) such that, for all \(k \in \mathbb {N} \),
$$\begin{aligned} \Vert D_k^\textrm{true}\Vert _2^2 + \Vert c(X_k)\Vert _2 \le (\kappa _\varPsi + 1) \varPsi _k. \end{aligned}$$ -
(c)
There exists \(\kappa _q \in \mathbb {R} _{>0}\) such that, for all \(k \in \mathbb {N} \),
$$\begin{aligned} \varDelta q_\tau (X_k, \nabla f(X_k), H_k, D_k^\textrm{true}) \ge \kappa _q \varPsi _k. \end{aligned}$$
Proof
We prove each part in turn.
-
(a)
For arbitrary \(k \in \mathbb {N} \) and \(\kappa \in \mathbb {R} _{>0}\), one finds under Assumption 2.4 that \(\Vert U_k^\textrm{true}\Vert _2^2 \ge \kappa \Vert V_k\Vert _2^2\) implies
$$\begin{aligned} (D_k^\textrm{true})^T H_k D_k^\textrm{true}&= (U_k^\textrm{true})^T H_k U_k^\textrm{true}+ 2(U_k^\textrm{true})^T H_k V_k + V_k^T H_k V_k \\&\ge \zeta \Vert U_k^\textrm{true}\Vert _2^2 - 2\Vert U_k^\textrm{true}\Vert _2 \Vert H_k\Vert _2 \Vert V_k\Vert _2 - \Vert H_k\Vert _2 \Vert V_k\Vert _2^2 \\&\ge (\zeta - \tfrac{2\kappa _H}{\sqrt{\kappa }} - \tfrac{\kappa _H}{\kappa }) \Vert U_k^\textrm{true}\Vert _2^2. \end{aligned}$$Thus, the desired result holds for any \(\kappa _{uv} \in \mathbb {R} _{>0}\) such that \(\tfrac{2\kappa _H}{\sqrt{\kappa _{uv}}} + \tfrac{\kappa _H}{\kappa _{uv}} \le \tfrac{\zeta }{2}\).
-
(b)
If \(\Vert U_k^\textrm{true}\Vert _2^2 \ge \kappa _{uv} \Vert V_k\Vert _2^2\), it follows that
$$\begin{aligned} \Vert D_k^\textrm{true}\Vert _2^2 = \Vert U_k^\textrm{true}\Vert _2^2 + \Vert V_k\Vert _2^2&\le (1+\kappa _{uv}^{-1}) \Vert U_k^\textrm{true}\Vert _2^2 \\&\le (1+\kappa _{uv}^{-1}) (\Vert U_k^\textrm{true}\Vert _2^2 + \Vert c(X_k)\Vert _2). \end{aligned}$$Otherwise, with (17) and Assumption 2.1, it follows that
$$\begin{aligned} \Vert D_k^\textrm{true}\Vert _2^2 = \Vert U_k^\textrm{true}\Vert _2^2 + \Vert V_k\Vert _2^2&< (\kappa _{uv}+1) \Vert V_k\Vert _2^2 \\&= (\kappa _{uv}+1) \Vert (\nabla c(X_k)^\dagger )^T c(X_k)\Vert _2^2 \\&\le \kappa _c (\kappa _{uv}+1) r^{-2} \Vert c(X_k)\Vert _2. \end{aligned}$$Hence, \(\Vert D_k^\textrm{true}\Vert _2^2 \le \kappa _\varPsi \varPsi _k\) holds with \(\kappa _\varPsi {:}{=} \max \{1 + \kappa _{uv}^{-1}, \kappa _c (\kappa _{uv}+1) r^{-2}\}\). The desired conclusion then follows since, by definition, \(\varPsi _k \ge \Vert c(X_k)\Vert _2\).
-
(c)
By (13) and part (a), it follows that \(\Vert U_k^\textrm{true}\Vert _2^2 \ge \kappa _{uv} \Vert V_k\Vert _2^2\) implies
$$\begin{aligned} \varDelta q_\tau (X_k, \nabla f(X_k), H_k, D_k^\textrm{true}) \ge \tfrac{1}{4} \tau \zeta \Vert U_k^\textrm{true}\Vert _2^2 + \nu \Vert c(X_k)\Vert _1, \end{aligned}$$and otherwise one still finds \(\varDelta q_\tau (X_k, \nabla f(X_k), H_k, D_k^\textrm{true}) \ge \nu \Vert c(X_k)\Vert _1\). Hence, since \(\Vert \cdot \Vert _2 \le \Vert \cdot \Vert _1\), the result follows with \(\kappa _q {:}{=} \min \{\tfrac{1}{4} \tau \zeta , \nu \}\). \(\square \)
We are now prepared to prove our main theorem of this section. The theorem establishes conditions that guarantee the almost-sure convergence of the primal iterate sequence \(\{X_k\}\) to a primal stationary point \(x_\star \). (In fact, under the conditions of the theorem, \(x_\star \) is a local minimizer of problem (4).) This is done by showing that the conditions of the theorem guarantee the almost-sure convergence of \(\{\phi _\tau (X_k)\}\) to \(\phi (x_\star )\), where \(x_\star \) is a strict local minimizer of the merit function \(\phi \). After proving the result, we provide some additional commentary on the inequality (20) that is required for the theorem.
Theorem 3.1
Suppose that Assumptions 2.1, 2.2, 2.3, and 2.4 hold. In addition, suppose that there exists \(x_\star \in {{\mathcal {X}}}\) with \(c(x_\star ) = 0\), \(\mu \in \mathbb {R} _{>1}\), and \(\epsilon \in \mathbb {R} _{>0}\) such that for all \(x \in {{\mathcal {X}}}_{\epsilon ,x_\star } {:}{=} \{x \in {{\mathcal {X}}}: \Vert x - x_\star \Vert _2 \le \epsilon \}\) one finds
where for all \(x \in {{\mathcal {X}}}_{\epsilon ,x_\star }\) one defines \(Z(x) \in \mathbb {R}^{n \times (n-m)}\) as some orthonormal matrix whose columns form a basis for the null space of \(\nabla c(x)^T\). Then, if \(\displaystyle \limsup _{k\rightarrow \infty } \{\Vert X_k - x_\star \Vert _2\} \le \epsilon \) almost surely, it follows that
Proof
As before, for all \(k \in \mathbb {N} \), let \(Z_k \in \mathbb {R}^{n \times (n-m)}\) denote an orthonormal matrix whose columns form a basis for \({{\,\textrm{Null}\,}}(\nabla c(X_k)^T)\), where if \(X_k \in {{\mathcal {X}}}_{\epsilon ,x_\star }\) then \(Z_k = Z(X_k)\) is one such that (20) holds at \(x = X_k\). For arbitrary \(k \in \mathbb {N} \), it follows from the first equation in (9) that \(H_k D_k^\textrm{true}+ \nabla c(X_k) Y_k^\textrm{true}= -\nabla f(X_k)\). Using the decomposition \(D_k^\textrm{true}= U_k^\textrm{true}+ V_k\) and the fact that \(Z_k^T\nabla c(X_k) = 0\), one finds by left-multiplying this equation by \(Z_k^T\) that
It then follows with (17), (18), and Assumptions 2.1 and 2.4 that
Combining this with (20) and Lemma 3.2, one finds that \(X_k \in {{\mathcal {X}}}_{\epsilon ,x_\star }\) implies
Since by Lemma 3.1 one has \(\displaystyle \liminf _{k\rightarrow \infty } \varDelta q_\tau (X_k, \nabla f(X_k), H_k, D_k^\textrm{true}) = 0\) almost surely, it follows from above and the conditions of the theorem that the limit \(\displaystyle \lim _{k \rightarrow \infty } \phi _\tau (X_k)\) (which exists almost surely by Lemma 3.1) must be \(\phi _\tau (x_\star )\), as desired. The remaining conclusions follow from \(\{\phi _\tau (X_k)\} \xrightarrow {{a.s.}}\phi _\tau (x_\star )\) and the facts that, by (20), the point \(x_\star \) is the unique point in \({{\mathcal {X}}}_{\epsilon ,x_\star }\) with merit function value equal to \(\phi _\tau (x_\star )\) and that is a primal stationary point. \(\square \)
Observe that (20) can be viewed as a generalization of the well-known Polyak–Łojasiewicz condition from the unconstrained continuous optimization literature [31]. Indeed, if the constraints are affine and one considers x such that \(x - x_\star \in {{\,\textrm{Null}\,}}(\nabla c(x_\star )^T)\), then (20) says that the squared \(\ell _2\)-norm of the reduced gradient \(Z(x)^T\nabla f(x)\) is at least proportional to \(\phi _\tau (x) - \phi _\tau (x_\star )\). On the other hand, if the objective function remains constant along displacements in \({{\,\textrm{Range}\,}}(\nabla c(x))\), then (20) says that the norm of the constraint violation is at least proportional to this difference in merit function values. More generally, such as for nonlinear f and c, the condition (20) is a generalization of these special cases that says that the combination of the squared \(\ell _2\)-norm of the reduced gradient and constraint violation is at least proportional to the optimality gap in the merit function in a neighborhood of the point \(x_\star \).
4 Convergence of Lagrange Multipliers
In this section, we study convergence properties of the sequence \(\{Y_k\}\) generated by Algorithm 1. Let us begin by expressing the solution component \(Y_k\) of (9) for arbitrary \(k \in \mathbb {N} \) using the step decomposition stated in (17) and (18). One finds by substituting (17) and (18) back into (9) gives
from which it follows under Assumption 2.1 that
For notational convenience, let us now define the matrix
which can be viewed as the product of a pseudoinverse and a projection matrix, so that we may succinctly write from (21) that
(We remark in passing that an alternative to our subsequent discussions and analysis could be considered where, for all \(k \in \mathbb {N} \), the vector \(Y_k\) is not defined through (9), but rather is set as a so-called least squares multiplier, i.e., as \(Y_k = \arg \min _{Y\in \mathbb {R}^{m}} \Vert G_k + \nabla c(X_k)Y\Vert _2^2\). For one thing, this choice removes the dependence of \(Y_k\) on \(H_k\), which might have certain advantages. However, for our purposes, we focus on multipliers being computed through (9) since this is a popular approach in practice and does not require additional computation.)
Our goal in our analysis of properties of Lagrange multiplier estimators is to prove convergence of such estimators when the primal iterate approaches a primal stationary point for (4). After all, it is only when the primal iterate lies in such a neighborhood (or at least a neighborhood of the feasible region) that the Lagrange multiplier has meaning in terms of certifying stationarity. Consequently, a main focus in our results is how \(Y_k\) may be viewed through (23) as a function of the primal iterate \(X_k\). For our analysis here, we make the following assumption, where given \(x \in \mathbb {R}^{n}\) and \(\epsilon \in \mathbb {R} _{>0}\) we define (consistent with Theorem 3.1) the neighborhood \({{\mathcal {X}}}_{\epsilon ,x} {:}{=} \{\hspace{0.83328pt}\overline{\hspace{-0.83328pt}x}\in \mathbb {R}^{n} : \Vert \hspace{0.83328pt}\overline{\hspace{-0.83328pt}x}- x\Vert _2 \le \epsilon \}\).
Assumption 4.1
Given \(x_\star \in {{\mathcal {X}}}\) as a primal stationary point for (4), there exist \(\epsilon \in \mathbb {R} _{>0}\), \({{\mathcal {M}}}: \mathbb {R}^{n} \rightarrow \mathbb {R}^{m \times n}\), and \(L_{{\mathcal {M}}}\in \mathbb {R} _{>0}\) such that \(M_k = {{\mathcal {M}}}(X_k)\) when \(X_k \in {{\mathcal {X}}}_{\epsilon ,x_\star }\) and \(\Vert {{\mathcal {M}}}(x) - {{\mathcal {M}}}(\hspace{0.83328pt}\overline{\hspace{-0.83328pt}x})\Vert _2 \le L_{{\mathcal {M}}}\Vert x - \hspace{0.83328pt}\overline{\hspace{-0.83328pt}x}\Vert _2\) for all \((x,\hspace{0.83328pt}\overline{\hspace{-0.83328pt}x}) \in {{\mathcal {X}}}_{\epsilon ,x_\star } \times {{\mathcal {X}}}_{\epsilon ,x_\star }\).
A few important observations and justifications are in order.
-
Assumption 4.1 states that, in a neighborhood of a primal stationary point \(x_\star \), the matrix \(M_k\) is defined through a Lipschitz continuous function of \(X_k\). Since \(M_k\) in (22) is defined with respect to \(H_k\) (and \(\nabla c(X_k)\), as we discuss further below), it is reasonable to expect that Assumption 4.1 requires that \(H_k\) is set through a Lipschitz continuous function of \(X_k\) (at least in \({{\mathcal {X}}}_{\epsilon ,x_\star }\)). Indeed, there are multiple ways to set \(H_k\) through a Lipschitz continuous function of \(X_k\). Two examples are as follows. First, an easy choice is that the algorithm may choose \(H_k = H\) for all \(k \in \mathbb {N} \) for some prescribed \(H \in {\mathbb {S}}_{\succ 0}^n\) that guarantees that Assumption 2.4 holds for all \(k \in \mathbb {N} \). Second, if the second-order derivatives of f are Lipschitz continuous, then the algorithm can choose \(H_k\) as (an approximation of) the Hessian of the objective at \(X_k\), or even (an approximation of) the Hessian of the Lagrangian at \((X_k,\hspace{0.83328pt}\overline{\hspace{-0.83328pt}Y}_k)\) if the second-order derivatives of the components of c are also Lipschitz continuous and care is taken in the selection of \(\hspace{0.83328pt}\overline{\hspace{-0.83328pt}Y}_k\). In practice, one might consider \(\hspace{0.83328pt}\overline{\hspace{-0.83328pt}Y}_k = Y_{k-1}\) for all \(k \in \mathbb {N} \), but as one can see, this choice is influenced by noise in the stochastic gradient estimators, which can cause \(\{H_k\}\) to violate Assumption 2.4. An alternative choice that would not violate the assumption is to choose \(\hspace{0.83328pt}\overline{\hspace{-0.83328pt}Y}_k\) as a prescribed vector or at least one that remains fixed for all sufficiently large \(k \in \mathbb {N} \). In any case, one finds that there exist reasonable choices for the algorithm to set \(H_k\) through a Lipschitz continuous function of \(X_k\). (We remark in passing that deterministic Newton-based methods for solving constrained optimization problems can possess local convergence guarantees of the Lagrange multipliers under relatively strong assumptions when \(H_k\) is the Hessian of the Lagrangian at \((X_k,Y_{k-1})\), and this might also be achievable for a stochastic algorithm with highly accurate gradient estimates. However, since our algorithm of interest operates in a highly stochastic regime in which only Assumption 2.3 is assumed to hold, allowing unadulterated use of the Hessian of the Lagrangian at \((X_k,Y_{k-1})\) is not reasonable.)
-
Supposing that \(H_k\) is set through a Lipschitz continuous function of \(X_k\), and given Assumption 2.1 (which states that the constraint Jacobians have full row rank and the constraint Jacobian function is Lipschitz continuous over the set \({{\mathcal {X}}}\) containing the iterates), it might seem with Lemma 1.1 that Lipschitz continuity of \({{\mathcal {M}}}\) over \({{\mathcal {X}}}_{\epsilon ,x_\star }\) could follow as a consequence. However, one needs to be careful. To understand a potential issue when trying to draw such a conclusion, recall that a product of functions that are Lipschitz continuous on a bounded set is itself Lipschitz continuous on the set. Hence, one might attempt to justify our Lipschitz-continuity assumption about \({{\mathcal {M}}}\) by assuming that, for all \(k \in \mathbb {N} \), the null space basis \(Z_k\) can be viewed as being generated by a Lipschitz continuous function of \(X_k\). Note, however, that it has been shown in [11] that, even when \(\nabla c(\cdot )^T\) is continuous with values that have full row rank for all \(x \in \mathbb {R}^{n}\), there does not necessarily exist continuous \({{\mathcal {Z}}}: \mathbb {R}^{n} \rightarrow \mathbb {R}^{n \times (n-m)}\) such that for all \(x \in \mathbb {R}^{n}\) the columns of \({{\mathcal {Z}}}(x)\) form a basis for \({{\,\textrm{Null}\,}}(\nabla c(x)^T)\). That being said, as also shown in [11], it is possible to employ procedures such that the reduced Hessian (approximation), namely, \(Z_k^TH_kZ_k\), does not depend on the choice of \(Z_k\), and, given a point \(\hspace{0.83328pt}\overline{\hspace{-0.83328pt}x}\in \mathbb {R}^{n}\) at which \(\nabla c(\hspace{0.83328pt}\overline{\hspace{-0.83328pt}x})^T\) has full row rank, there exists a neighborhood of \(\hspace{0.83328pt}\overline{\hspace{-0.83328pt}x}\) over which one can define a “null space basis function” that is continuous over the neighborhood. See also [10], which discusses two procedures that ensure that, in the neighborhood of a given point, a null space basis can be defined by a Lipschitz continuous null space basis function. Overall, due to these observations, we contend that assuming Lipschitz continuity of \({{\mathcal {M}}}\)—in a sufficiently small neighborhood of \(x_\star \), as is stated in Assumption 4.1—is justified for our analysis.
We are now prepared to prove that, near a primary stationary point, the expected error in the Lagrange multiplier estimator is bounded by the distance of the primal iterate to the primal stationary point plus an error due to the stochastic gradient estimator. The following theorem may be viewed as the main result of this section since the subsequent results follow from it.
Theorem 4.1
Suppose that Assumptions 2.1, 2.2, 2.3, 2.4, and 4.1 hold, \((x_\star ,y_\star )\) is a stationary point for problem (4), and \(\epsilon \in \mathbb {R} _{>0}\) is defined as in Assumption 4.1. Then, for any \(k \in \mathbb {N} \), one finds \(\Vert X_k - x_\star \Vert _2 \le \epsilon \) implies
where \(\kappa _y {:}{=} \kappa _H L_c r^{-2} + L_{\nabla f} r^{-1} + \kappa _{\nabla f} L_{{\mathcal {M}}}\).
Proof
Under the conditions of the theorem, \(\Vert X_k - x_\star \Vert _2 \le \epsilon \) and (23) imply that \(Y_k = M_k(H_k(\nabla c(X_k)^\dagger )^Tc(X_k) - G_k)\), where \(M_k = {{\mathcal {M}}}(X_k)\), while one can similarly derive that \(y_\star = -M_\star \nabla f(x_\star )\), where \(M_\star = {{\mathcal {M}}}(x_\star )\). Hence, the difference \(Y_k - y_\star \) can be decomposed into three parts as follows:
Consequently, to bound \(\Vert Y_k-y_\star \Vert _2\) when \(\Vert X_k - x_\star \Vert _2 \le \epsilon \), one can employ the triangle inequality and bound the norms of the three terms on the right-hand side of (25) separately. First, one finds that
where the first inequality follows from (22) and properties of norms; the second follows from \(c(x_\star ) = 0\), Assumptions 2.1 and 2.4, and the fact that \(P_k {:}{=} I-H_kZ_k(Z_k^TH_kZ_k)^{-1}Z_k^T\) is a projection matrix, so \(\Vert P_k\Vert _2 \le 1\); and the last follows from Lipschitz continuity of c (see (6)). Second,
where the first inequality follows from (22) and properties of norms, the second follows from Assumption 2.1 and \(\Vert P_k\Vert _2 \le 1\) (see above), the third follows from \(\nabla f(x_\star )-G_k = \nabla f(x_\star ) - \nabla f(X_k) + \nabla f(X_k) - G_k\) and the triangle inequality, and the last follows from Lipschitz continuity of \(\nabla f\) (see (6)). Third,
follows from properties of norms and Assumptions 2.1 and 4.1. Substituting (26), (27), and (28) into (25) yields the desired conclusion. \(\square \)
Theorem 4.1 shows that whenever \(X_k\) is sufficiently close to a primal stationary point \(x_\star \), the Lagrange multiplier error \(\Vert Y_k-y_\star \Vert _2\) is bounded by a constant times the primal iterate error \(\Vert X_k-x_\star \Vert _2\) plus a constant times the error in the stochastic gradient estimator. The latter term is inevitable since, no matter the proximity of \(X_k\) to \(x_\star \), the Lagrange multiplier estimator \(Y_k\) is influenced by \(G_k\). To emphasize this point, we present the following corollary showing that, under the same conditions as Theorem 4.1, the Lagrange multiplier estimator \(Y_k^\textrm{true}\)—which appears in the convergence guarantee [3, Corollary 3.14], even though it is a quantity that is not actually computed in the algorithm—satisfies the same bound as \(Y_k\), except without the additional noise.
Corollary 4.1
Suppose that Assumptions 2.1, 2.2, 2.3, 2.4, and 4.1 hold, \((x_\star ,y_\star )\) is a stationary point for problem (4), and \(\epsilon \in \mathbb {R} _{>0}\) is defined as in Assumption 4.1. Then, for any \(k \in \mathbb {N} \), one finds \(\Vert X_k - x_\star \Vert _2 \le \epsilon \) implies
where \(\kappa _y \in \mathbb {R} _{>0}\) is defined as in Theorem 4.1.
Proof
Following the same steps as in the proof of Theorem 4.1, the difference \(Y_k^\textrm{true}- y_\star \) can now be decomposed as in (25) with \(G_k = \nabla f(X_k)\). Therefore, in place of (27), one instead obtains the bound
which when combined with (26), (28), and (25) gives the desired result. \(\square \)
The previous theorem and corollary lead to the following corollary, which considers the special case that there exists an iteration number beyond which the primal iterates remain within a neighborhood of a particular primal stationary point almost surely. In such a case, the expected error in the Lagrange multiplier is bounded by a constant times the distance from the primal iterate to the primal stationary point plus a term that is proportional to the bound on the noise of the stochastic gradient estimators. The corollary also adds that the “true” multiplier does not have this additional error term. (We remind the reader that, in Sect. 3, we have shown conditions under which one can conclude that \(\{X_k\} \xrightarrow {{a.s.}}x_\star \). While this convergence guarantee does not directly imply that \(\hspace{0.83328pt}\overline{\hspace{-0.83328pt}k}\in \mathbb {N} \) exists as in the following corollary, it motivates consideration of this special case as one of practical relevance.)
Corollary 4.2
Suppose that Assumptions 2.1, 2.2, 2.3, 2.4, and 4.1 hold, \((x_\star ,y_\star )\) is a stationary point for problem (4), \(\epsilon \in \mathbb {R} _{>0}\) is defined as in Assumption 4.1, and for some \(\hspace{0.83328pt}\overline{\hspace{-0.83328pt}k}\in \mathbb {N} \) one has almost surely that
Then, for any \(k \in \mathbb {N} \) with \(k \ge \hspace{0.83328pt}\overline{\hspace{-0.83328pt}k}\), one finds that
where \(\kappa _y \in \mathbb {R} _{>0}\) is defined as in Theorem 4.1.
Proof
Assumption 2.3 and Jensen’s inequality imply for all \(k \in \mathbb {N} \) that
The desired conclusions follow from this fact, the conditions of the corollary, and the conclusions of Theorem 4.1 and Corollary 4.1. \(\square \)
At this point, we have determined useful properties of two multiplier estimator sequences, \(\{Y_k\}\) and \(\{Y_k^\textrm{true}\}\). In the neighborhood of a stationary point \((x_\star ,y_\star )\), each element \(Y_k^\textrm{true}\) of the latter sequence has error that is proportional to the distance from \(X_k\) to the primal stationary point \(x_\star \), which is as much as can be expected. However, this estimator is not realized by the algorithm as it requires explicit knowledge of the true gradient \(\nabla f(X_k)\), which is not realized by the algorithm. Each element \(Y_k\) of the former sequence has the same error plus an additional error due to the stochastic gradient estimator. One can reduce this error by computing a more accurate gradient estimate, say, using a large mini-batch of gradient estimates. However, to avoid the need of computing a highly accurate gradient estimate in any given iteration, one might ask whether it is possible to obtain an estimator with lower error merely under Assumption 2.3? Using classical ideas from statistical estimation, specifically mean estimation, a natural idea to consider is whether a more accurate estimator can be defined by an average, say, \(Y_k^\textrm{avg}{:}{=} \tfrac{1}{k} \sum _{i=1}^k Y_k\). After all, if the stochastic process \(\{X_k\}\) converges in some sense to \(x_\star \), then with \(Y_k\) defined through a continuous function of \(X_k\) in a neighborhood of \(x_\star \) for all \(k \in \mathbb {N} \), one might expect \(\{Y_k^\textrm{avg}\}\) to converge in the same sense to \(y_\star \).
The main challenge in the analysis of \(\{Y_k^\textrm{avg}\}\) is that, in contrast to the classical setting of mean estimation using independent and identically distributed random variables, the multipliers \(\{Y_k\}\) are computed at points \(\{X_k\}\) at which the distributions of \(\{G_k\}\) (or \(\{\nabla f(X_k)\}\) for that matter) are neither independent nor identically distributed. Hence, our analysis of the averaged sequence requires some additional assumptions that are introduced in the following lemma. We prove the lemma and a following theorem, then discuss and provide justifications for these required assumptions in further detail.
Lemma 4.1
Suppose that Assumptions 2.1, 2.2, 2.3, 2.4, and 4.1 hold, and with \(M_k\) defined in (22) and \(\varDelta _k {:}{=} \nabla f(X_k) - G_k\) for all \(k \in \mathbb {N} \) one finds
Then, it follows that
Proof
For \((k,i) \in \mathbb {N} \times [k]\), let \({{\mathcal {F}}}_{k,i} {:}{=} {{\mathcal {F}}}_i\) and \(\eta _{k,i} {:}{=} \tfrac{1}{\sqrt{k}} M_i \varDelta _i\). Under Assumption 2.3, it follows that \(\mathbb {E}[\eta _{k,i} | {{\mathcal {F}}}_{k,i}] = 0\) for all \((k,i) \in \mathbb {N} \times [k]\) and that \(\{(\eta _{k,i}, {{\mathcal {F}}}_{k,i})\}_{k \in \mathbb {N} ,i \in [k]}\) is an n-dimensional martingale difference triangular array (see Lemma 1.3). Employing Lemma 1.3, one finds that
This fact, Lemma 1.4, and (30d) together yield
where \({\tilde{\sigma }} {:}{=} \mathbb {E}[\Vert X\Vert _2]\) for \(X \sim {{\mathcal {N}}}(0, \varSigma )\). Therefore,
which gives the desired result (31). \(\square \)
The previous lemma allows us to prove the following theorem. We remark upfront that the iteration index \(\hspace{0.83328pt}\overline{\hspace{-0.83328pt}k}\in \mathbb {N} \) that is used in the definition of \(\{Y_k^\textrm{avg}\}_{k=\hspace{0.83328pt}\overline{\hspace{-0.83328pt}k}}^\infty \) in the theorem is not necessarily known in advance by the algorithm. In particular, it is not necessarily known by the algorithm when the primal iterate sequence may be guaranteed to have entered a neighborhood of a primary stationary point such that the conditions of Assumption 4.1 hold with a certain \(\epsilon \in \mathbb {R} _{>0}\). A stronger assumption would be to assume that \(\hspace{0.83328pt}\overline{\hspace{-0.83328pt}k}= 1\), i.e., that the algorithm is initialized at a point within a sufficiently small neighborhood of \(x_\star \) in which it remains almost surely. However, we state the result for general \(\hspace{0.83328pt}\overline{\hspace{-0.83328pt}k}\) for the sake of generality, and since in practice it would be reasonable to consider a scheme that only computes an averaged Lagrange multiplier estimate over those recent multipliers such that the corresponding primal iterates are within a prescribed neighborhood of the current iterate. We consider such a strategy in our experiments in Sect. 5.
Theorem 4.2
Suppose that Assumptions 2.1, 2.2, 2.3, 2.4, and 4.1 hold, \((x_\star ,y_\star )\) is a stationary point for problem (4), \(\epsilon \in \mathbb {R} _{>0}\) is defined as in Assumption 4.1, for some \(\hspace{0.83328pt}\overline{\hspace{-0.83328pt}k}\in \mathbb {N} \) one has almost surely that
\(\{M_k\}\) and \(\{\varDelta _k\}\) are defined as in Lemma 4.1, and (30) holds. Then, with \(Y_k^\textrm{avg}{:}{=} \tfrac{1}{k-\hspace{0.83328pt}\overline{\hspace{-0.83328pt}k}+1} \sum _{i=\hspace{0.83328pt}\overline{\hspace{-0.83328pt}k}}^k Y_i\) for all \(k \in \mathbb {N} \) with \(k \ge \hspace{0.83328pt}\overline{\hspace{-0.83328pt}k}\), one finds that
Proof
As in the proof of Theorem 4.1, one finds for any \(k \in \mathbb {N} \) with \(k \ge \hspace{0.83328pt}\overline{\hspace{-0.83328pt}k}\) that
Thus, under the conditions of the theorem, it follows with Lemma 4.1 that
as desired. \(\square \)
The conditions in (30) in Lemma 4.1 are admittedly nontrivial, but upon closer inspection it is not surprising that conditions akin to these would be needed in order to draw any useful conclusions about an averaged Lagrange multiplier sequence. First, consider conditions (30a) and (30d), which essentially require that the expected value of the norm of \(M_k \varDelta _k\) is bounded uniformly over \(k \in \mathbb {N} \). Indeed, observe that if one were instead to assume that
then (30a) follows directly while one also finds
which yields (30d) for \(\varTheta =2\). It is not surprising that this expected value would be required to be bounded uniformly since the elements of \(\{M_k\varDelta _k\}\) arise as the error terms in the proof of Theorem 4.2. Second, beyond merely being bounded, which would not be sufficient on its own, conditions (30b) and (30c) require convergence properties of \(\{M_k\varDelta _k\}\) that match those needed for the multidimensional martingale central limit theorem, which we have stated as Lemma 1.3. Notice that if one considers the conditions of Assumption 4.1, which have that the elements of \(\{M_k\}\) are defined by a continuous (in fact, Lipschitz continuous) mapping \({{\mathcal {M}}}\) of the elements of \(\{X_k\}\), then (30b) and (30c) essentially require convergence in probability of averages of inner and outer products of the elements of \(\{{{\mathcal {M}}}(X_k) (\nabla f(X_k) - G_k)\}\), which is a reasonable assumption to consider when, say, \(\{X_k\} \xrightarrow {{a.s.}}x_\star \).
We close this section with the following corollary to Theorem 4.2.
Corollary 4.3
Suppose that the conditions of Theorem 4.2 hold and that the iterate sequence converges almost surely to \(x_\star \), i.e., \(\{X_k\} \xrightarrow {{a.s.}}x_\star \). Then,
Proof
The almost-sure convergence of \(\{X_k\}\) to \(x_\star \) implies that
In addition, from (29), one finds for all \(\omega \in \varOmega \) that
These facts combined with nonnegativity of norms yield
which again with nonnegativity of norms further implies
Consequently, \(\{Y_k^\textrm{true}\}_{k=\hspace{0.83328pt}\overline{\hspace{-0.83328pt}k}}^\infty \xrightarrow {{a.s.}}y_\star \), so \(\{Y_k^\textrm{true}\} \xrightarrow {{a.s.}}y_\star \), as desired.
Let us now consider the averaged sequence. Recall that under the conditions of the corollary one has for all \(\omega \in \varOmega \) that
Thus, one obtains along with (30d) that
Thus, there exists a null (possibly empty) set \(\varOmega _0 \subset \varOmega \) such that
On the other hand, \(\{X_k\} \xrightarrow {{a.s.}}x_\star \) implies \(\{\Vert X_k-x_\star \Vert _2\} \xrightarrow {{a.s.}}0\). Thus, there exists a null (possibly empty) set \(\varOmega _1 \subset \varOmega \) such that
Then, recall from (32) that for all \(\omega \in \varOmega \) one finds
which along with the fact that \(\varOmega _0\) and \(\varOmega _1\) are null sets, (34), and (35) implies
Combined with nonnegativity of norms, it follows that
which further implies that \(\{Y_k^\textrm{avg}\}_{k=\hspace{0.83328pt}\overline{\hspace{-0.83328pt}k}}^\infty \xrightarrow {{a.s.}}y_\star \), as desired. \(\square \)
5 Numerical Demonstrations
In this section, we provide the results of a small set of numerical experiments to demonstrate the results of our theoretical analysis. Specifically, we show when solving each instance in a set of test problems that the primal iterates generated by an implementation of Algorithm 1 approach a primal stationary point (in fact, a global minimizer), and as this occurs one finds that the true Lagrange multiplier sequence converges and averaged Lagrange multiplier sequences converge despite the fact that the sequence of Lagrange multipliers itself does not converge (due to gradient errors). This behavior of the algorithm is representative of behavior that we have witnessed when solving other problems (not shown due to space considerations) as well. In terms of averaged Lagrange multipliers, we consider averages over all iterations as well as averages determined through practical strategies that attempt to ignore multipliers obtained at primal iterates that are far from a primal stationary point (or at least the current iterate generated by the algorithm).
For our numerical demonstrations, we consider constrained logistic regression; see also [2]. In particular, we consider problem instances of the form
where \(D = [d_1\ \cdots \ d_N] \in \mathbb {R}^{n \times N}\) is a feature matrix, \(\gamma \in \mathbb {R}^{N}\) is a label vector, \(A \in \mathbb {R}^{m\times n}\), and \(b \in \mathbb {R}^{m}\). This problem is nonconvex due to the nonlinear constraint \(\Vert x\Vert _2^2 = 1\), although we confirmed in all runs that the primal iterates approach the globally optimal solution; see below. For the feature matrices and label vectors, we considered four datasets from LIBSVM [13], namely, a9a (\((n, N) = (123, 32561)\)), australian (\((n, N) = (14, 690)\)), ionosphere (\((n, N) = (34, 351)\)), and splice (\((n, N) = (60, 1000)\)). These datasets were selected as ones that led to relatively large errors in the stochastic gradient estimates. (See below for further information about how the gradient estimates were computed in our experiments.) For each problem instance, we choose \(m=10\) and randomly generated the initial point \(x_1\) and constraint data (A, b), keeping these quantities fixed for all runs for a given problem instance. Specifically, each entry of these quantities was drawn from a standard normal distribution and it was confirmed in each case that A had full row rank.
For the implementation of Algorithm 1, we used the Matlab implementation for [3] known as StochasticSQP.Footnote 1 For simplicity, we set \(h_k\) as the identity matrix I for all iterations in all runs of the algorithm. We solved each linear system (9) with Matlab’s built-in mldivide(\(\backslash \)) function, which in particular meant that—as would be recommended in general for implementations of the algorithm—all Lagrange multipliers were computed directly from this linear system, not by computing a null space basis \(Z_k\). We chose \(\beta _k = {{\mathcal {O}}}(\tfrac{1}{k})\) with \(\beta _1 = 1\). For each problem instance, to obtain \((x_\star ,y_\star )\), we initially ran StochasticSQP with true gradients in all iterations until a stationary point was found to high accuracy, then verified that at these points both first- and second-order stationary conditions held to high accuracy, thus verifying (approximate) global optimality of the final iterate. Then, for each instance, we ran StochasticSQP with mini-batch gradient estimates with mini-batch sizes of 16 to generate \(\{(x_k,y_k,y_k^\textrm{true})\}\). (For each run, the sequence \(\{\beta _k\}\) was chosen as an unsummable, yet square-summable sequence that was tuned in a manner such that the primal iterates tended toward the globally optimal primal solution.) We verified that in all runs the adaptive parameter rules implemented in StochasticSQP kept the merit parameter sequence \(\{\tau _k\}\) constant at the initial value of 0.1 and kept the ratio parameter sequence \(\{\xi _k\}\) constant at the initial value of 1.0; hence, the runs respected the simplified variant of the algorithm from [3] that we have analyzed as Algorithm 1.
The results for each problem instance are shown in Figure 1. Each plot in the figure shows the results of a single run with a budget of \(k = 10^5\) iterations, although we verified that the results were qualitatively consistent in other runs (with different random seeds) as well. In each plot, the figure shows the sequences \(\{\Vert x_k - x_\star \Vert _2\}\), \(\{\Vert y_k - y_\star \Vert _2\}\), and \(\{\Vert y_k^\textrm{true}- y_\star \Vert _2\}\), where the quantities \(x_k\), \(y_k\), and \(y_k^\textrm{true}\) are defined as in the prior sections. The figure also shows the sequences \(\{\Vert y_k^\textrm{avg}- y_\star \Vert _2\}\) and \(\{\Vert y_k^{\textrm{avg}_\epsilon } - y_\star \Vert _2\}\) for a few values of \(\epsilon \in \mathbb {R} _{>0}\), where \(y_k^\textrm{avg}\) is the averaged Lagrange multiplier over all iterations (from the initial point) whereas \(y_k^{\textrm{avg}_\epsilon }\) is the averaged Lagrange multiplier over iterations \(\{k',\dots ,k\}\), where for each \(k \in \mathbb {N} \) the index \(k'\) is the smallest value in [k] such that \(\Vert x_j - x_k\Vert _2 \le \epsilon \) for all \(j \in \{k',\dots ,k\}\). In this manner, for relatively large values of \(\epsilon \), the average is likely to be taken over more iterations (thus reducing noise), whereas for relatively small values of \(\epsilon \), the average is taken only over iterations in which the primal point is relatively close to the current primal iterate (to improve accuracy of the estimate as \(x_k\) nears \(x_\star \)).
Figure 1 demonstrates the results of our theoretical analysis. In each run, within the iterations performed before the computational budget was exceeded, the distance of the primal iterate to the primal solution typically reduces monotonically (due to the tuned step-size sequence, as mentioned). Thus, the run provides an instance in which the primal iterates appear to converge despite the use of stochastic gradient estimates. At the same time, the distance of the Lagrange multiplier estimate to the optimal Lagrange multiplier oscillates between relatively large values due to the errors in the stochastic gradient estimates, whereas the distances of true and averaged Lagrange multiplier estimates reach much smaller values. Since the true Lagrange multiplier estimates are unobtainable in practice, one finds that averaged Lagrange multipliers may be considered to obtain better estimates than the \(\{y_k\}\) sequence itself. In some cases, averaging over all iterations may be sufficient, but if desired one may consider choosing \(\epsilon \in \mathbb {R} _{>0}\) (tuning the value, if computationally feasible) and considering a sequence such as \(\{y_k^{\textrm{avg}_\epsilon }\}\) instead.
6 Conclusion
We have presented new convergence analyses for a stochastic SQP method built on the stochastic gradient methodology. In particular, for a simplified variant of the algorithm from [3], we have proved almost-sure convergence guarantees for the primal iterates, Lagrange multiplier estimates, and stationarity measures. It has been shown that the error in the Lagrange multipliers is bounded by the distance of the primal iterate to a primal stationary point plus the error due to stochastic gradient estimate. Furthermore, under modest assumptions, the latter error can be shown to vanish by incorporating a running average of the Lagrange multipliers. The results of numerical experiments demonstrate our proved theoretical guarantees.
Data Availibility
The datasets analyzed during the current study are available in the LIBSVM repository, https://www.csie.ntu.edu.tw/ cjlin/libsvm/.
References
Berahas, A S., Bollapragada, R., Zhou, B.: An adaptive sampling sequential quadratic programming method for equality constrained stochastic optimization. arXiv e-prints, arXiv:2206.00712v2, 2023
Berahas, A S., Curtis, F E., O’Neill, M J., Robinson, V.: A stochastic sequential quadratic optimization algorithm for nonlinear equality constrained optimization with rank-deficient Jacobians. Math. Operat. Res., page moor.2021.0154, 2023
Berahas, A.S., Curtis, F.E., Robinson, D.P., Zhou, B.: Sequential quadratic optimization for nonlinear equality constrained stochastic optimization. SIAM J. Optim. 31(2), 1352–1379 (2021)
Berahas, Albert S., Shi, Jiahao, Yi, Zihong, Zhou, Baoyu: Accelerating stochastic sequential quadratic programming for equality constrained optimization using predictive variance reduction. Comput. Optim. Appl. 86(1), 79–116 (2023)
Berahas, A S., Xie, M., Zhou, B.: A sequential quadratic programming method with high probability complexity bounds for nonlinear equality constrained stochastic optimization. arXiv e-prints, arXiv:2301.00477, 2023
Bertsekas, D.P.: Network Optimization: Continuous and Discrete Methods. Athena Scientific, Optimization and Neural Computation Series (1998)
Bertsekas, D.P., Tsitsiklis, J.N.: Gradient convergence in gradient methods with errors. SIAM J. Optim. 10(3), 627–642 (2000)
Betts, J T.: Practical Methods for Optimal Control and Estimation Using Nonlinear Programming. Advances in Design and Control. Society for Industrial and Applied Mathematics, 2nd edition, 2010
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Review 60(2), 223–311 (2018)
Byrd, R.H., Nocedal, J.: An analysis of reduced Hessian methods for constrained optimization. Math. Program. 49, 285–323 (1990)
Byrd, R.H., Schnabel, R.B.: Continuity of the null space basis and constrained optimization. Math. Program. 35, 32–41 (1986)
Chambolle, A., Pock, T.: On the ergodic convergence rates of a first-order primal-dual algorithm. Math. Program. 159(1–2), 253–287 (2016)
Chang, Chih-Chung., Lin, Chih-Jen.: LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)
Chung, K L.: A Course in Probability Theory, Revised Edition., volume 3rd ed. Academic Press, 2001
Cuomo, S., Di, C., Vincenzo, S., Giampaolo, F., Rozza, G., Raissi, M., Piccialli, F.: Scientific machine learning through physics-informed neural networks: where we are and what’s next. J. Sci. Comput. 92(3), 88 (2022)
Curtis, F.E., O’Neill, M.J., Robinson, D.P.: Worst-case complexity of an SQP method for nonlinear equality constrained stochastic optimization. Math. Progr. (2023). https://doi.org/10.1007/s10107-023-01981-1
Curtis, Frank E., Robinson, Daniel P., Zhou, Baoyu: Sequential quadratic optimization for stochastic optimization with deterministic nonlinear inequality and equality constraints. SIAM J. Optim. 34(4), 3592–3622 (2024)
Durrett,R.: Probability: Theory and Examples. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 5th edition, 2019
Fang, Y., Na, S., Mahoney, M.W., Kolar, M.: Fully stochastic trust-region sequential quadratic programming for equality-constrained optimization problems. SIAM J. Optim. 34(2), 2007–2037 (2024)
Hall, P., Heyde, C.C.: Martingale Limit Theory and Its Application. Academic Press, UK (2014)
Jiang, X., Vandenberghe, L.: Bregman three-operator splitting methods. J. Optim. Theory Appl. 196(3), 936–972 (2023)
Karniadakis, G.E., Kevrekidis, I.G., Lu, L., Perdikaris, P., Wang, S., Yang, L.: Physics-informed machine learning. Nat. Rev. Phys. 3(6), 422–440 (2021)
Kupfer, F.S., Sachs, E.W.: Numerical solution of a nonlinear parabolic control problem by a reduced SQP method. Comput. Optim. Appl. 1(1), 113–135 (1992)
Li, T., Xiao, T., Yang, G.: Revisiting the central limit theorems for the SGD-type methods. arXiv e-prints, arXiv:2207.11755, 2023
McLeish, D L.: Dependent central limit theorems and invariance principles. Ann. Probab., 2(4), 1974
Na, S., Anitescu, M., Kolar, M.: An adaptive stochastic sequential quadratic programming with differentiable exact augmented Lagrangians. Math. Program. 199(1–2), 721–791 (2023)
Na, S., Anitescu, M., Kolar, M.: Inequality constrained stochastic nonlinear optimization via active-set sequential quadratic programming. Math. Progr. (2023). https://doi.org/10.1007/s10107-023-01935-7
Na, S., Mahoney, M W..: Asymptotic convergence rate and statistical inference for stochastic sequential quadratic programming. arXiv e-prints, arXiv:2205.13687, 2022
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Oztoprak, F., Byrd, R., Nocedal, J.: Constrained optimization in the presence of noise. SIAM J. Optim. 33(3), 2118–2136 (2023)
Polyak, B.T.: Gradient methods for minimization of functionals. USSR Comput. Math. Math. Phys. 3(3), 643–653 (1963)
Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control. Optim. 30(4), 838–855 (1992)
Qiu, S., Kungurtsev, V.: A sequential quadratic programming method for optimization with stochastic objective functions, deterministic inequality constraints and robust subproblems. arXiv e-prints, arXiv:2302.07947, 2023
Rees, T., Dollar, H. S., Wathen, A. J.: Optimal solvers for PDE-constrained optimization. SIAM J. Sci. Comput., 32(1):271–298, 2010
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Robbins, H., Siegmund, D.: A convergence theorem for non-negative almost supermartingales and some applications. In Optimizing Methods in Statistics, pages 233–257. Elsevier, 1971
Wedin, P.: Perturbation theory for pseudo-inverses. BIT Numer. Math. 13(2), 217–232 (1973)
Acknowledgements
The authors thank Albert S. Berahas for sharing his Matlab code for the constrained logistic regression problem used in the numerical demonstrations for this paper. They also thank the editors and two anonymous reviewers for their useful suggestions that have helped to improve the manuscript. This work is supported by the U.S. National Science Foundation (NSF) award IIS-1704458 and U.S. Office of Naval Research award N00014-21-1-2532.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Wei Bian.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Curtis, F.E., Jiang, X. & Wang, Q. Almost-Sure Convergence of Iterates and Multipliers in Stochastic Sequential Quadratic Optimization. J Optim Theory Appl 204, 28 (2025). https://doi.org/10.1007/s10957-024-02568-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10957-024-02568-2