Frank–Wolfe and friends: a journey into projection-free first-order optimization methods

Bomze, Immanuel. M.; Rinaldi, Francesco; Zeffiro, Damiano

doi:10.1007/s10479-024-06251-7

Frank–Wolfe and friends: a journey into projection-free first-order optimization methods

Original Research
Open access
Published: 16 September 2024

Volume 343, pages 607–638, (2024)
Cite this article

Download PDF

You have full access to this open access article

Annals of Operations Research Aims and scope Submit manuscript

Frank–Wolfe and friends: a journey into projection-free first-order optimization methods

Download PDF

824 Accesses
Explore all metrics

Abstract

Invented some 65 years ago in a seminal paper by Marguerite Straus-Frank and Philip Wolfe, the Frank–Wolfe method recently enjoys a remarkable revival, fuelled by the need of fast and reliable first-order optimization methods in Data Science and other relevant application areas. This review tries to explain the success of this approach by illustrating versatility and applicability in a wide range of contexts, combined with an account on recent progress in variants, both improving on the speed and efficiency of this surprisingly simple principle of first-order optimization.

Frank–Wolfe and friends: a journey into projection-free first-order optimization methods

Article Open access 06 September 2021

A Note on Open Problems and Challenges in Optimization Theory and Algorithms

Nonlinear Optimization: A Brief Overview

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In their seminal work (Frank & Wolfe, 1956), Marguerite Straus-Frank and Philip Wolfe introduced a first-order algorithm for the minimization of convex quadratic objectives over polytopes, now known as Frank–Wolfe (FW) method. The main idea of the method is simple: to generate a sequence of feasible iterates by moving at every step towards a minimizer of a linearized objective, the so-called FW vertex. Subsequent works, partly motivated by applications in optimal control theory (see Dunn (1979) for references), generalized the method to smooth (possibly non-convex) optimization over closed subsets of Banach spaces admitting a linear minimization oracle (see Demyanov and Rubinov (1970), Dunn and Harshbarger (1978)).

Furthermore, while the ${{\mathcal {O}}}(1/k)$ rate in the original article was proved to be optimal when the solution lies on the boundary of the feasible set (Canon & Cullum, 1968), improved rates were given in a variety of different settings. In Levitin and Polyak (1966) and Demyanov and Rubinov (1970), a linear convergence rate was proved over strongly convex domains assuming a lower bound on the gradient norm, a result then extended in Dunn (1979) under more general gradient inequalities. In Guelat and Marcotte (1986), linear convergence of the method was proved for strongly convex objectives with the minimum obtained in the relative interior of the feasible set.

The slow convergence behaviour for objectives with solution on the boundary motivated the introduction of several variants, the most popular being Wolfe’s away step Wolfe 1970. Wolfe’s idea was to move away from bad vertices, in case a step of the FW method moving towards good vertices did not lead to sufficient improvement on the objective. This idea was successfully applied in several network equilibrium problems, where linear minimization can be achieved by solving a min-cost flow problem (see Fukushima (1984) and references therein). In Guelat and Marcotte (1986), some ideas already sketched by Wolfe were formalized to prove linear convergence of the Wolfe’s away step method and identification of the face containing the solution in finite time, under some suitable strict complementarity assumptions.

In recent years, the FW method has regained popularity thanks to its ability to handle the structured constraints appearing in machine learning and data science applications efficiently. Examples include LASSO, SVM training, matrix completion, minimum enclosing ball, density mixture estimation, cluster detection, to name just a few (see Sect. 3 for further details).

One of the main features of the FW algorithm is its ability to naturally identify sparse and structured (approximate) solutions. For instance, if the optimization domain is the simplex, then after k steps the cardinality of the support of the last iterate generated by the method is at most $k + 1$. Most importantly, in this setting every vertex added to the support at every iteration must be the best possible in some sense, a property that connects the method with many greedy optimization schemes (Clarkson, 2010). This makes the FW method pretty efficient on the abovementioned problem class. Indeed, the combination of structured solutions with often noisy data makes the sparse approximations found by the method possibly more desirable than high precision solutions generated by a faster converging approach. In some cases, like in cluster detection (see, e.g., Bomze (1997)), finding the support of the solution is actually enough to solve the problem independently from the precision achieved.

Another important feature is that the linear minimization used in the method is often cheaper than the projections required by projected-gradient methods. It is important to notice that, even when these two operations have the same complexity, constants defining the related bounds can differ significantly (see Combettes and Pokutta (2021) for some examples and tests). When dealing with large scale problems, the FW method hence has a much smaller per-iteration cost with respect to projected-gradient methods. For this reason, FW methods fall into the category of projection-free methods (Lan, 2020). Furthermore, the method can be used to approximately solve quadratic subproblems in accelerated schemes, an approach usually referred to as conditional gradient sliding (see, e.g., Lan and Zhou (2016)).

1.1 Organisation of the paper

The present review, which extends the work in Bomze et al. (2021), is not intended to provide an exhaustive literature survey, but rather as an advanced tutorial demonstrating versatility and power of this approach. The article is structured as follows: in Sect. 2, we introduce the classic FW method, together with a general scheme for all the methods we consider. In Sect. 3, we present applications from classic optimization to more recent machine learning problems. In Sect. 4, we review some important stepsizes for first order methods. In Sect. 5, we discuss the main theoretical results about the FW method and the most popular variants, including the ${{\mathcal {O}}}(1/k)$ convergence rate for convex objectives, affine invariance, the sparse approximation property, and support identification. In Sect. 6 we illustrate some recent improvements on the ${{\mathcal {O}}}(1/k)$ convergence rate.

In Sect. 7, we describe a generalization of the classic FW to the composite non-smooth optimization setting, and in particular its correspondence with mirror descent via Fenchel duality. In Sect. 8 we present recent FW variants fitting different optimization frameworks, in particular block coordinate, distributed, accelerated, and trace norm optimization. We highlight that all the proofs reported in the paper are either seminal, or simplified versions of proofs reported in published papers, and we believe they might give some useful technical insights to the interested reader.

1.2 Notation

For any integers a and b, denote by $[{a}\! : \! {b}] = \{ x \text{ integer }: a\le x\le b\}$ the integer range between them. For a set V, the power set $2^V$ denotes the system all subsets of V, whereas for any positive integer $s\in \mathbb {N}$ we set ${V\atopwithdelims ()s}:= \{ S\in 2^V: |S| = s\}$, with |S| denoting the number of elements in S. Matrices are denoted by capital sans-serif letters (e.g., the zero matrix ${\textsf{O}}$, or the $n\times n$ identity matrix ${\textsf{I}}_n$ with columns ${\textsf{e}}_i$ the length of which should be clear from the context). The all-ones vector is ${\textsf{e}}:=\sum _i {\textsf{e}}_i\in \mathbb {R}^n$. Generally, vectors are always denoted by boldface sans-serif letters ${\textsf{x}}$, and their transpose by ${\textsf{x}} ^{\intercal }$. The Euclidean norm of ${\textsf{x}}$ is then $\Vert {\textsf{x}} \Vert := \sqrt{{\textsf{x}} ^{\intercal }{\textsf{x}}}$ whereas the general p-norm is denoted by ${\Vert {\textsf{x}} \Vert }_p$ for any $p\ge 1$ (so ${\Vert {\textsf{x}} \Vert }_2=\Vert {\textsf{x}} \Vert $). By contrast, the so-called zero-norm simply counts the number of nonzero entries:

$$\begin{aligned} {\Vert {\textsf{x}} \Vert }_0:= |\{ i\in [{1}\! : \! {n}]: x_i\ne 0\}|\,. \end{aligned}$$

For a vector ${\textsf{d}}$ we denote as $\widehat{{\textsf{d}}}:=\frac{1}{\Vert {\textsf{d}} \Vert }\,{\textsf{d}}$ its normalization, with the convention $\widehat{{\textsf{d}}} = {\textsf{o}}$ if ${\textsf{d}}= {\textsf{o}}$. Here ${\textsf{o}}$ denotes the zero vector. In context of symmetric matrices, “$\mathop {\textrm{psd}}\limits $” abbreviates “positive-semidefinite”.

2 Problem and general scheme

We consider the following problem:

$$\begin{aligned} \min _{{\textsf{x}}\in C} f({\textsf{x}}) \end{aligned}$$

(1)

where, unless specified otherwise, $C$ is a convex and compact (i.e. bounded and closed) subset of $\mathbb {R}^n$ and f is a differentiable function having Lipschitz continuous gradient with constant $L>0$:

$$\begin{aligned} \Vert \nabla f({\textsf{x}}) - \nabla f({\textsf{y}}) \Vert \le L \Vert {\textsf{x}}- {\textsf{y}} \Vert \quad {\text{ for } \text{ all } \{{\textsf{x}},{\textsf{y}}\} \subset C\,.} \end{aligned}$$

This is a central property required in the analysis of first-order methods. Such a property indeed implies (and for a convex function is equivalent to) the so-called Descent Lemma (see, e.g., Bertsekas (2015), Proposition 6.1.2), which provides a quadratic upper approximation to the function f. Throughout the article, we denote by ${\textsf{x}}^*$ a (global) solution to (1) and use the symbol $f^*:=~f({\textsf{x}}^*)$ as a shorthand for the corresponding optimal value.

The general scheme of the first-order methods we consider for problem (1), reported in Algorithm 1, is based upon a set $F({\textsf{x}},{\textsf{g}})$ of directions feasible at ${\textsf{x}}$ using first-order local information on f around ${\textsf{x}}$, in the smooth case ${\textsf{g}}=\nabla f({\textsf{x}})$. From this set, a particular ${\textsf{d}}\in F({\textsf{x}},{\textsf{g}})$ is selected, with the maximal stepsize $\alpha ^{\max }$ possibly dependent from auxiliary information available to the method (at iteration k, we thus write $\alpha ^{\max }_k$), and not always equal to the maximal feasible stepsize.

2.1 The classical Frank–Wolfe method

The classical FW method for minimization of a smooth objective f generates a sequence of feasible points $\{{\textsf{x}}_k\}$ following the scheme of Algorithm 2. At the iteration k it moves toward a vertex i.e., an extreme point, of the feasible set minimizing the scalar product with the current gradient $\nabla f({\textsf{x}}_k)$. It therefore makes use of a linear minimization oracle (LMO) for the feasible set $C$

$$\begin{aligned} { \text {LMO}_{C}({\textsf{g}}) \in \mathop {\mathrm {arg\,min}}\limits _{{\textsf{z}}\in C} {\textsf{g}}^\intercal {\textsf{z}} }\,, \end{aligned}$$

(2)

defining the descent direction as

$$\begin{aligned} {\textsf{d}}_k = {\textsf{d}}_k^{FW}:= {\textsf{s}}_k - {\textsf{x}}_k, \ \ {\textsf{s}}_k \in \text {LMO}_{C}(\nabla f(x_k)) \,. \end{aligned}$$

(3)

In particular, the update at step 6 can be written as

$$\begin{aligned} {\textsf{x}}_{k + 1} = {\textsf{x}}_k + \alpha _k ({\textsf{s}}_k - {\textsf{x}}_k) = \alpha _k {\textsf{s}}_k + (1 - \alpha _k) {\textsf{x}}_k \end{aligned}$$

(4)

Since $\alpha _k \in [0, 1]$, by induction ${\textsf{x}}_{k + 1}$ can be written as a convex combination of elements in the set $S_{k + 1}:= \{{\textsf{x}}_0\} \cup \{{\textsf{s}}_i\}_{0 \le i \le k}$. When $C = \mathop {\textrm{conv}}\limits (A)$ for a set A of points with some common property, usually called "elementary atoms", if $x_0 \in A$ then $x_{k}$ can be written as a convex combination of $k + 1$ elements in A. Note that due to Caratheodory’s theorem, we can even limit the number of occurring atoms to $\min \{k,n\}+1$. In the rest of the paper the primal gap at iteration k is defined as $h_k=f({\textsf{x}}_k)-f^*$.

3 Examples

FW methods and variants are a natural choice for constrained optimization on convex sets admitting a linear minimization oracle significantly faster than computing a projection. We present here in particular the traffic assignment problem, submodular optimization, LASSO problem, matrix completion, adversarial attacks, minimum enclosing ball, SVM training, maximal clique search in graphs, sparse optimization.

3.1 Traffic assignment

Finding a traffic pattern satisfying the equilibrium conditions in a transportation network is a classic problem in optimization that dates back to Wardrop’s paper (Wardrop, 1952). Let $\mathcal {G}$ be a network with set of nodes $[{1}\! : \! {n}]$. Let $\{D(i, j)\}_{i \ne j}$ be demand coefficients, modeling the amount of goods with destination j and origin i. For any i, j with $i\ne j$ let furthermore $f_{ij}: \mathbb {R}\rightarrow \mathbb {R}$ be the non-linear (convex) cost functions, and $x_{ij}^s$ be the flow on link (i, j) with destination s. The traffic assignment problem can be modeled as the following non-linear multicommodity network problem (Fukushima, 1984):

$$\begin{aligned} \min \left\{ \sum _{i, j} f_{ij}\left( \sum _s x_{ij}^s\right) : \sum _{i} x_{\ell i}^s - \sum _{j} x^s_{j\ell } = D(\ell , s) \,, \text{ all } \ell \ne s, \, \, x_{ij}^s \ge 0 \right\} \,. \end{aligned}$$

(5)

Then the linearized optimization subproblem necessary to compute the FW vertex takes the form

$$\begin{aligned} \min \left\{ \sum _s\sum _{i,j}c_{ij}x_{ij}^s: \sum _{i} x_{\ell i}^s - \sum _{j} x^s_{j\ell } = D(\ell , s), \, \ell \ne s, \, \, x_{ij}^s \ge 0 \right\} \end{aligned}$$

(6)

and can be split in n shortest paths subproblems, each of the form

$$\begin{aligned} \min \left\{ \sum _{i,j}c_{ij}x_{ij}^s: \sum _{i} x_{\ell i}^s - \sum _{j} x^s_{j\ell } = D(\ell , s), \, \ell \ne s, \, \, x_{ij}^s \ge 0 \right\} \end{aligned}$$

(7)

for a fixed $s \in [{1}\! : \! {n}]$, with $c_{ij}$ the first-order derivative of $f_{ij}$ (see Fukushima (1984) for further details). A number of FW variants were proposed in the literature for efficiently handling this kind of problems (see, e.g., Bertsekas (2015), Fukushima (1984), LeBlanc et al. (1975), Mitradjieva and Lindberg (2013), Weintraub et al. (1985) and references therein for further details). Some of those variants represent a good (if not the best) choice when low or medium precision is required in the solution of the problem (Perederieieva et al., 2015).

In the more recent work (Joulin et al., 2014) a FW variant also solving a shortest path subproblem at each iteration was applied to image and video co-localization.

3.2 Submodular optimization

Given a finite set V, a function $r: 2^V \rightarrow \mathbb {R}$ is said to be submodular if for every $A, B \subset V$

$$\begin{aligned} r(A) + r(B) \ge r(A \cup B) + r(A \cap B) \,. \end{aligned}$$

(8)

As is common practice in the optimization literature (see e.g. Bach (2013), Section 2.1), here we always assume $s(\emptyset ) = 0$. A number of machine learning problems, including image segmentation and sensor placement, can be cast as minimization of a submodular function (see, e.g., Bach (2013), Chakrabarty et al. (2014) and references therein for further details):

$$\begin{aligned} \min _{A \subseteq V} r(A)\,. \end{aligned}$$

(9)

Submodular optimization can also be seen as a more general way to relate combinatorial problems to convexity, for example for structured sparsity (Bach, 2013; Jaggi, 2013). By a theorem from Fujishige (1980), problem (9) can be in turn reduced to an minimum norm point problem over the base polytope

$$\begin{aligned} B(G) = \{{\textsf{s}}\in \mathbb {R}^V: \sum _{a \in A} s_a \le r(A) \text{ for } \text{ all } A \subseteq V\,, \, \sum _{a \in V} s_a = r(V) \} \,. \end{aligned}$$

(10)

For this polytope, linear optimization can be achieved with a simple greedy algorithm. More precisely, consider the LP

$$\begin{aligned} \max _{{\textsf{s}}\in B(F)} {\textsf{w}} ^{\intercal }{\textsf{s}}\,. \end{aligned}$$

Then if the objective vecor ${\textsf{w}}$ has a negative component, the problem is clearly unbounded. Otherwise, a solution to the LP can be obtained by ordering ${\textsf{w}}$ in decreasing manner as $w_{j_1} \ge w_{j_2} \ge ... \ge w_{j_n}$, and setting

$$\begin{aligned} s_{j_k}:= r(\{j_{1},..., j_{k} \}) - r(\{j_1,..., j_{k - 1}\}) \,, \end{aligned}$$

(11)

for $k \in [{1}\! : \! {n}]$. We thus have a LMO with a $\mathcal {O}(n\log n)$ cost. This is the reason why FW variants are widely used in the context of submodular optimization; further details can be found in, e.g., (Bach 2013; Jaggi 2013).

3.3 LASSO problem

The LASSO, proposed by Tibshirani (1996), is a popular tool for sparse linear regression. Given the training set

$$\begin{aligned} T=\{({\textsf{r}}_i,b_i) \in \mathbb {R}^n\times \mathbb {R}: i\in [{1}\! : \! {m}]\}\,, \end{aligned}$$

where ${\textsf{r}}_i ^{\intercal }$ are the rows of an $m\times n$ matrix ${\textsf{A}}$, the goal is finding a sparse linear model (i.e., a model with a small number of non-zero parameters) describing the data. This problem is strictly connected with the Basis Pursuit Denoising (BPD) problem in signal analysis (see, e.g., Chen et al. (2001)). In this case, given a discrete-time input signal b, and a dictionary

$$\begin{aligned} \{{\textsf{a}}_j\in \mathbb {R}^m \, \ j\in [{1}\! : \! {n}] \} \end{aligned}$$

of elementary discrete-time signals, usually called atoms (here ${\textsf{a}}_j$ are the columns of a matrix ${\textsf{A}}$), the goal is finding a sparse linear combination of the atoms that approximate the real signal. From a purely formal point of view, LASSO and BPD problems are equivalent, and both can be formulated as follows:

$$\begin{aligned} \begin{array}{ll} \displaystyle {\min _{{\textsf{x}}\in \mathbb {R}^n}}& f({\textsf{x}}):=\Vert {\textsf{A}}{\textsf{x}}-{\textsf{b}}\Vert _2^2\\ s.t. & \Vert {\textsf{x}}\Vert _1\le \tau \,, \end{array} \end{aligned}$$

(12)

where the parameter $\tau $ controls the amount of shrinkage that is applied to the model (related to sparsity, i.e., the number of nonzero components in ${\textsf{x}}$). The feasible set is

$$\begin{aligned} C=\{{\textsf{x}}\in \mathbb {R}^n: \Vert {\textsf{x}}\Vert _1\le \tau \}=\mathop {\textrm{conv}}\limits \{\pm \tau {\textsf{e}}_i: \ i\in [{1}\! : \! {n}] \}\,. \end{aligned}$$

Thus we have the following LMO in this case:

$$\begin{aligned} \text {LMO}_C(\nabla f({\textsf{x}}_k))=\mathop {\textrm{sign}}\limits (-\nabla _{i_k} f({\textsf{x}}_k))\cdot \tau {\textsf{e}}_{i_k}\,, \end{aligned}$$

with $i_k \in \displaystyle \mathop {\mathrm {arg\,max}}\limits _{i} |\nabla _i f({\textsf{x}}_k)|$. It is easy to see that the FW per-iteration cost is then $\mathcal {O}(n)$. The peculiar structure of the problem makes FW variants well-suited for its solution. This is the reason why LASSO/BPD problems were considered in a number of FW-related papers (see, e.g., Jaggi (2011), Jaggi (2013), Lacoste-Julien and Jaggi (2015), Locatello et al. (2017)).

3.4 Matrix completion

Matrix completion is a widely studied problem that comes up in many areas of science and engineering, including collaborative filtering, machine learning, control, remote sensing, and computer vision (just to name a few; see also Candès and Recht (2009) and references therein). The goal is to retrieve a low rank matrix ${\textsf{X}}\in \mathbb {R}^{n_1 \times n_2}$ from a sparse set of observed matrix entries $\{U_{ij}\}_{(i,j) \in J}$ with $J \subset [{1}\! : \! {n_1}] \times [{1}\! : \! {n_2}]$. Thus the problem can be formulated as follows Freund et al. (2017):

$$\begin{aligned} \begin{array}{ll} \displaystyle {\min _{{\textsf{X}}\in \mathbb {R}^{n_1 \times n_2}}}& f({\textsf{X}}):= \displaystyle \sum _{(i,j)\in J} (X_{ij} - U_{ij})^2\\ \quad s.t. & \mathop {\textrm{rank}}\limits ({\textsf{X}})\le \delta , \end{array} \end{aligned}$$

(13)

where the function f is given by the squared loss over the observed entries of the matrix and $\delta >0$ is a parameter representing the assumed belief about the rank of the reconstructed matrix we want to get in the end. In practice, the low rank constraint is relaxed with a nuclear norm ball constraint, where we recall that the nuclear norm ${\Vert {\textsf{X}} \Vert }_*$ of a matrix ${\textsf{X}}$ is equal the sum of its singular values. Thus we get the following convex optimization problem:

$$\begin{aligned} \begin{array}{ll} \displaystyle {\min _{{\textsf{X}}\in \mathbb {R}^{n_1 \times n_2}}}& \displaystyle \sum _{(i,j)\in J}(X_{ij} - U_{ij})^2\\ \quad s.t. & {\Vert {\textsf{X}} \Vert }_*\le \delta \,. \end{array} \end{aligned}$$

(14)

The feasible set is the convex hull of rank-one matrices:

$$\begin{aligned} \begin{array}{rcl} C & = & \{{\textsf{X}}\in \mathbb {R}^{n_1 \times n_2}: {\Vert {\textsf{X}}\Vert }_*\le \delta \}\\[0.3em] & =& \mathop {\textrm{conv}}\limits \{\delta {\textsf{u}}{\textsf{v}} ^{\intercal }:{\textsf{u}}\in \mathbb {R}^{n_1},{\textsf{v}}\in \mathbb {R}^{n_2},\ \Vert {\textsf{u}}\Vert = \Vert {\textsf{v}}\Vert =1 \} \,.\end{array}\end{aligned}$$

If we indicate with ${\textsf{A}}_J$ the matrix that coincides with ${\textsf{A}}$ on the indices J and is zero otherwise, then we can write $\nabla f({\textsf{X}})={2}\,({\textsf{X}}-{\textsf{U}})_J$. Thus we have the following LMO in this case:

$$\begin{aligned} \text {LMO}_C(\nabla f({\textsf{X}}_k)) \in \mathop {\mathrm {arg\,min}}\limits \{{{\,\textrm{tr}\,}}(\nabla f({\textsf{X}}_k) ^{\intercal }{\textsf{X}}): {\Vert {\textsf{X}} \Vert }_* \le \delta \}\,, \end{aligned}$$

(15)

which boils down to computing the gradient, and the rank-one matrix $\delta {\textsf{u}}_1 {\textsf{v}}_1 ^{\intercal }$, with ${\textsf{u}}_1, {\textsf{v}}_1$ right and left singular vectors corresponding to the top singular value of $-\nabla f({\textsf{X}}_k)$. Consequently, the FW method at a given iteration approximately reconstructs the target matrix as a sparse combination of rank-1 matrices. Furthermore, as the gradient matrix is sparse (it only has |J| non-zero entries) storage and approximate singular vector computations can be performed much more efficiently than for dense matrices.^{Footnote 1} A number of FW variants has hence been proposed in the literature for solving this problem (see, e.g., Freund et al. (2017), Jaggi (2011), Jaggi (2013)).

3.5 Adversarial attacks in machine learning

Adversarial examples are maliciously perturbed inputs designed to mislead a properly trained learning machine at test time. An adversarial attack hence consists in taking a correctly classified data point $x_0$ and slightly modifying it to create a new data point that leads the considered model to misclassification (see, e.g., Carlini and Wagner (2017), Chen et al. (2017), Goodfellow et al. (2014) for further details). A possible formulation of the problem (see, e.g., Chen et al. (2020), Goodfellow et al. (2014)) is given by the so called maximum allowable $\ell _p$-norm attack that is,

$$\begin{aligned} \begin{aligned}&\min _{{\textsf{x}}\in \mathbb {R}^n} \, f({\textsf{x}}_0+{\textsf{x}}) \\&s.t. \quad {\Vert {\textsf{x}}\Vert }_p \le \varepsilon \,, \end{aligned} \end{aligned}$$

(16)

where f is a suitably chosen attack loss function, ${\textsf{x}}_0$ is a correctly classified data point, ${\textsf{x}}$ represents the additive noise/perturbation, $\varepsilon > 0$ denotes the magnitude of the attack, and $p\ge 1$. It is easy to see that the LMO has a cost $\mathcal {O}(n)$. If ${\textsf{x}}_0$ is a feature vector of a dog image correctly classified by our learning machine, our adversarial attack hence suitably perturbs the feature vector (using the noise vector ${\textsf{x}}$), thus getting a new feature vector ${\textsf{x}}_0+{\textsf{x}}$ classified, e.g., as a cat. In case a target adversarial class is specified by the attacker, we have a targeted attack. In some scenarios, the goal may not be to push ${\textsf{x}}_0$ to a specific target class, but rather push it away from its original class. In this case we have a so called untargeted attack. The attack function f will hence be chosen depending on the kind of attack we aim to perform over the considered model. Due to its specific structure, problem (16) can be nicely handled by means of tailored FW variants. Some FW frameworks for adversarial attacks were recently described in, e.g., (Chen et al. 2020; Kazemi et al. 2021; Sahu and Kar 2020).

3.6 Minimum enclosing ball

Given a set of points $P = \{{\textsf{p}}_1,\ldots , {\textsf{p}}_n\}\subset \mathbb {R}^d$, the minimum enclosing ball problem (MEB, see, e.g., Clarkson 2010; Yıldırım 2008) consists in finding the smallest ball containing P. Such a problem models numerous important applications in clustering, nearest neighbor search, data classification, machine learning, facility location, collision detection, and computer graphics, to name just a few. We refer the reader to Kumar et al. (2003) and the references therein for further details. Denoting by ${\textsf{c}}\in \mathbb {R}^d$ the center and by $\sqrt{\gamma }$ (with $\gamma \ge 0$) the radius of the ball, a convex quadratic formulation for this problem is

$$\begin{aligned}&\min _{({\textsf{c}},\gamma ) \in \mathbb {R}^d\times \mathbb {R}} & \, \gamma \end{aligned}$$

(17)

$$\begin{aligned}&\quad \ s.t. & \, \Vert {\textsf{p}}_i - {\textsf{c}} \Vert ^2 \le \gamma \, , \; \text { all } i \in [{1}\! : \! {n}]\, . \end{aligned}$$

(18)

This problem can be formulated via Lagrangian duality as a convex Standard Quadratic Optimization Problem (StQP, see, e.g. Bomze and de Klerk (2002))

$$\begin{aligned} { \min \left\{ {\textsf{x}}^{\intercal }{\textsf{A}}^{\intercal }{\textsf{A}}{\textsf{x}}- {\textsf{b}} ^{\intercal }{\textsf{x}}: {\textsf{x}}\in \Delta _{n - 1} \right\} } \end{aligned}$$

(19)

with ${\textsf{A}}= [{\textsf{p}}_1,..., {\textsf{p}}_n]$ and ${\textsf{b}} ^{\intercal }= [{\textsf{p}}_1^{\intercal }{\textsf{p}}_1, \ldots , {\textsf{p}}_n^{\intercal }{\textsf{p}}_n]$. The feasible set is the standard simplex

$$\begin{aligned} \Delta _{n - 1}:=\{{\textsf{x}}\in \mathbb {R}^n_+: {\textsf{e}}^\intercal {\textsf{x}}=1\}=\mathop {\textrm{conv}}\limits \{{\textsf{e}}_i: i\in [{1}\! : \! {n}] \}\,, \end{aligned}$$

and the LMO is defined as follows:

$$\begin{aligned} \text {LMO}_{\Delta _{n - 1}}(\nabla f({\textsf{x}}_k))={\textsf{e}}_{i_k}, \end{aligned}$$

with $i_k \in \mathop {\mathrm {arg\,min}}\limits _{i} \nabla _i f({\textsf{x}}_k)$. It is easy to see that cost per iteration is $\mathcal {O}(n)$. When applied to (19), the FW method can find an $\varepsilon $-cluster in ${{\mathcal {O}}}(\frac{1}{\varepsilon })$, where an $\varepsilon $-cluster is a subset $P'$ of P such that the MEB of $P'$ dilated by $1 + \varepsilon $ contains P (Clarkson, 2010). The set $P'$ is given by the atoms in P selected by the LMO in the first ${{\mathcal {O}}}(\frac{1}{\varepsilon })$ iterations. Further details related to the connections between FW methods and MEB problems can be found in, e.g., (Ahipaşaoğlu et al., 2008; Ahipaşaoğlu & Todd, 2013; Clarkson, 2010) and references therein.

3.7 Training linear Support Vector Machines

Support Vector Machines (SVMs) represent a very important class of machine learning tools (see, e.g., Vapnik (2013) for further details). Given a labeled set of data points, usually called training set:

$$\begin{aligned} TS=\{({\textsf{p}}_i, y_i),\ {\textsf{p}}_i \in \mathbb {R}^d,\ y_i \in \{-1, 1\},\ i=1,\dots ,n \}, \end{aligned}$$

the linear SVM training problem consists in finding a linear classifier ${\textsf{w}}\in \mathbb {R}^d$ such that the label $y_i$ can be deduced with the "highest possible confidence" from ${\textsf{w}}^{\intercal }{\textsf{p}}_i$. A convex quadratic formulation for this problem is the following (Clarkson, 2010):

$$\begin{aligned} \begin{array}{cl} \displaystyle {\min _{{\textsf{w}}\in \mathbb {R}^d, \rho \in \mathbb {R}}}& \rho + \frac{\Vert {\textsf{w}} \Vert ^2}{2}\\ s.t. & \rho + y_i\, {\textsf{w}}^\intercal {\textsf{p}}_i \ge 0\,, \quad \text {all }i\in [{1}\! : \! {n}]\,, \end{array} \end{aligned}$$

(20)

where the slack variable $\rho $ stands for the negative margin and we can have $\rho < 0$ if and only if there exists an exact linear classifier, i.e. ${\textsf{w}}$ such that ${\textsf{w}}^\intercal {\textsf{p}}_i = \mathop {\textrm{sign}}\limits (y_i)$. The dual of (20) is again an StQP:

$$\begin{aligned} { \min \left\{ {\textsf{x}}^{\intercal }{\textsf{A}}^{\intercal }{\textsf{A}}{\textsf{x}}: {\textsf{x}}\in \Delta _{n - 1} \right\} } \end{aligned}$$

(21)

with ${\textsf{A}}= [y_1{\textsf{p}}_1,..., y_n{\textsf{p}}_n]$. Notice that problem (21) is equivalent to an MNP problem on $ \mathop {\textrm{conv}}\limits \{ y_i{\textsf{p}}_i: i\in [{1}\! : \! {n}]\}$, see Sect. 8.3 below. Some FW variants (like, e.g., the Pairwise Frank–Wolfe) are closely related to classical working set algorithms, such as the SMO algorithm used to train SVMs (Lacoste-Julien & Jaggi, 2015). Further details on FW methods for SVM training problems can be found in, e.g., Clarkson (2010), Jaggi (2011).

3.8 Finding maximal cliques in graphs

In the context of network analysis the clique model, dating back at least to the work of Luce and Perry (1949) about social networks, refers to subsets with every two elements in a direct relationship. The problem of finding maximal cliques has numerous applications in domains including telecommunication networks, biochemistry, financial networks, and scheduling (see, e.g., Bomze et al. (1999), Wu and Hao (2015)). Let $G=(V,E)$ be a simple undirected graph with V and E set of vertices and edges, respectively. A clique in G is a subset $C\subseteq V$ such that $(i,j)\in E$ for each $(i,j)\in C$, with $i\ne j$. The goal in finding a clique C such that |C| is maximal (i.e., it is not contained in any strictly larger clique). This corresponds to find a local minimum for the following equivalent (this time non-convex) StQP (see, e.g., Bomze (1997), Bomze et al. (1999), Hungerford and Rinaldi (2019) for further details):

$$\begin{aligned} \max \left\{ {\textsf{x}}^{\intercal }{\textsf{A}}_G {\textsf{x}}+\frac{1}{2}\Vert {\textsf{x}}\Vert ^2: {\textsf{x}}\in \Delta _{n - 1} \right\} \end{aligned}$$

(22)

where ${\textsf{A}}_G$ is the adjacency matrix of G. Due to the peculiar structure of the problem, FW methods can be fruitfully used to find maximal cliques (see, e.g., Hungerford and Rinaldi (2019)).

3.9 Finding sparse points in a set

Given a non-empty polyhedron $P\subset \mathbb {R}^n$, the goal is finding a sparse point ${\textsf{x}}\in P$ (i.e., a point with as many zero components as possible). This sparse optimization problem can be used to model a number of real-world applications in fields like, e.g., machine learning, pattern recognition and signal processing (see Rinaldi et al. (2010) and references therein). Ideally, what we would like to get is an optimal solution for the following problem:

$$\begin{aligned} \min \left\{ \Vert {\textsf{x}}\Vert _0: {\textsf{x}}\in P\right\} . \end{aligned}$$

(23)

Since the zero norm is non-smooth, a standard procedure is to replace the original formulation (23) with an equivalent concave optimization problem of the form:

$$\begin{aligned} \min \left\{ \displaystyle \sum _{i=1}^n \phi (y_i): {\textsf{x}}\in P,\, \, -{\textsf{y}}\le {\textsf{x}}\le {\textsf{y}}\right\} , \end{aligned}$$

(24)

where $\phi :\left[ 0\right. \!, +\infty \left[ \right. \rightarrow \mathbb {R}$ is a suitably chosen smooth concave univariate function bounded from below, like, e.g.,

$$\begin{aligned} \phi (t)=(1-e^{-\alpha t})\,, \end{aligned}$$

with $\alpha $ a large enough positive parameter (see, e.g., Mangasarian (1996), Rinaldi et al. (2010) for further details). The LMO in this case gives a vertex solution for the linear programming problem:

$$\begin{aligned} \min \left\{ \displaystyle {{\textsf{c}}}_k^\intercal {\textsf{y}}: {\textsf{x}}\in P,\, \, -{\textsf{y}}\le {\textsf{x}}\le {\textsf{y}}\right\} , \end{aligned}$$

with $({{\textsf{c}}}_k)_i$ the first-order derivative of $\phi $ calculated in $({{\textsf{y}}}_k)_i$. Variants of the unit-stepsize FW method have been proposed in the literature (see, e.g., Mangasarian (1996), Rinaldi et al. (2010)) to tackle the smooth equivalent formulation (24).

4 Stepsizes

Popular rules for determining the stepsize are:

unit stepsize:
$$\begin{aligned} \alpha _k=1, \end{aligned}$$
mainly used when the problem has a concave objective function. Finite convergence can be proved, under suitable assumptions, both for the unit-stepsize FW and some of its variants described in the literature (see, e.g., Rinaldi et al. (2010) for further details).
diminishing stepsize:
$$\begin{aligned} \alpha _k = \frac{2}{ k + 2} \,, \end{aligned}$$
(25)
mainly used for the classic FW (see, e.g., Freund and Grigas (2016), Jaggi (2013)).
exact line search:
$$\begin{aligned} \alpha _k = \min {\mathop {\mathrm {arg\,min}}\limits _{\alpha \in [0, \alpha _k^{\max }]} \varphi (\alpha )} \quad {\text{ with } \varphi (\alpha ):=f({\textsf{x}}_k + \alpha \, {\textsf{d}}_k) }\,, \end{aligned}$$
(26)
where we pick the smallest minimizer of the function $\varphi $ for the sake of being well-defined even in rare cases of ties (see, e.g., Bomze et al. (2020), Lacoste-Julien and Jaggi (2015)).
Armijo line search: the method iteratively shrinks the step size in order to guarantee a sufficient reduction of the objective function. It represents a good way to replace exact line search in cases when it gets too costly. In practice, we fix parameters $\delta \in (0,1)$ and $\gamma \in (0,\frac{1}{2})$, then try steps $\alpha =\delta ^{m}\alpha _k^{\max }$ with $m\in \{ 0,1,2,\dots \}$ until the sufficient decrease inequality
$$\begin{aligned} f({\textsf{x}}_k+\alpha \,{\textsf{d}}_k)\le f({\textsf{x}}_k)+\gamma \alpha \, \nabla f({\textsf{x}}_k)^{\intercal } {\textsf{d}}_k \end{aligned}$$
(27)
holds, and set $\alpha _k=\alpha $ (see, e.g., Bomze et al. (2019) and references therein).
Lipschitz constant dependent step size:
$$\begin{aligned} \alpha _k = \alpha _k(L):= \min \left\{ -\, \frac{ \nabla f({\textsf{x}}_k) ^{\intercal }{\textsf{d}}_k}{L\Vert {\textsf{d}}_k \Vert ^2}, \alpha _k^{max} \right\} \,, \end{aligned}$$
(28)
with L the Lipschitz constant of $\nabla f$ (see, e.g., Bomze et al. (2020), Pedregosa et al. (2020)).

The Lipschitz constant dependent step size can be seen as the minimizer of the quadratic model $m_k(\cdot ; L)$ overestimating f along the line ${\textsf{x}}_k+\alpha \, {\textsf{d}}_k$:

$$\begin{aligned} m_k( \alpha ; L) = f({\textsf{x}}_k) + \alpha \, \nabla f({\textsf{x}}_k)^\intercal {\textsf{d}}_k + \frac{L\alpha ^2}{2}\, \Vert {\textsf{d}}_k \Vert ^2 \ge f({\textsf{x}}_k + \alpha \, {\textsf{d}}_k)\,, \end{aligned}$$

(29)

where the inequality follows by the standard Descent Lemma.

In case L is unknown, it is even possible to approximate L using a backtracking line search (see, e.g., Kerdreux et al. (2021), Pedregosa et al. (2020)).

We now report a lower bound for the improvement on the objective obtained with the stepsize (28), often used in the convergence analysis.

Lemma 1

If $\alpha _k$ is given by (28) and $\alpha _k < \alpha _k^{\max }$ then

$$\begin{aligned} f({\textsf{x}}_{k + 1}) \le f({\textsf{x}}_k) - \frac{1}{2L}(\nabla f({\textsf{x}}_k)^\intercal \widehat{{\textsf{d}}}_k)^2 \,. \end{aligned}$$

(30)

Proof

We have

$$\begin{aligned} \begin{array}{rcl} f({\textsf{x}}_k + \alpha _k \, {\textsf{d}}_k) & \le & f({\textsf{x}}_k) + \alpha _k \nabla f({\textsf{x}}_k)^\intercal {\textsf{d}}_k + \frac{L\alpha _k^2}{2}\, \Vert {\textsf{d}}_k \Vert ^2 \\[0.3em] & = & f({\textsf{x}}_k) - \frac{(\nabla f({\textsf{x}}_k)^\intercal {\textsf{d}}_k)^2}{2L\Vert {\textsf{d}}_k \Vert ^2} = f({\textsf{x}}_k) - \frac{1}{2L}(\nabla f({\textsf{x}}_k)^\intercal \widehat{{\textsf{d}}_k})^2 \,, \end{array} \end{aligned}$$

(31)

where we used the standard Descent Lemma in the inequality. $\square $

5 Properties of the FW method and its variants

5.1 The FW gap

A key parameter often used as a measure of convergence is the FW gap

$$\begin{aligned} G({\textsf{x}}) = \max _{{\textsf{s}}\in C} - \nabla f({\textsf{x}})^\intercal ({\textsf{s}}- {\textsf{x}}) \,, \end{aligned}$$

(32)

which is always nonnegative and equal to 0 only in first order stationary points. This gap is, by definition, readily available during the algorithm. If f is convex, using that $\nabla f({\textsf{x}})$ is a subgradient we obtain

$$\begin{aligned} G({\textsf{x}}) \ge - \nabla f({\textsf{x}})^\intercal ({\textsf{x}}^* - {\textsf{x}}) \ge f({\textsf{x}}) - f^*\,, \end{aligned}$$

(33)

so that $G({\textsf{x}})$ is an upper bound on the optimality gap at ${\textsf{x}}$. Furthermore, $G({\textsf{x}})$ is a special case of the Fenchel duality gap (Lacoste-Julien et al., 2013).

If $C=\Delta _{n - 1}$ is the simplex, then G is related to the Wolfe dual as defined in Clarkson (2010). Indeed, this variant of Wolfe’s dual reads

$$\begin{aligned} \begin{aligned} \max \ &f({\textsf{x}}) + \lambda ({\textsf{e}}^{\intercal }{\textsf{x}}-1)- {\textsf{u}}^\intercal {\textsf{x}} \\ \text {s.t.}\;\;&\nabla _i f({\textsf{x}}) - u_i + {\lambda } = 0 \,,\quad i \in [{1}\! : \! {n}] \,, \\&({\textsf{x}}, {\textsf{u}}, {\lambda }) \in \mathbb {R}^n \times \mathbb {R}^n_+ \times \mathbb {R}\ \end{aligned} \end{aligned}$$

(34)

and for a fixed ${\textsf{x}}\in \mathbb {R}^n$, the optimal values of $({\textsf{u}}, {\lambda })$ are

$$\begin{aligned} {\lambda }_{\textsf{x}}= - \min _{j} \nabla _j f({\textsf{x}})\,, \, \ \ u_i({\textsf{x}}):= \nabla _i f({\textsf{x}}) - \min _j \nabla _j f({\textsf{x}}) \ge 0 \,. \end{aligned}$$

Performing maximization in problem (34) iteratively, first for $({\textsf{u}},\lambda )$ and then for ${\textsf{x}}$, this implies that (34) is equivalent to

$$\begin{aligned} \begin{array}{rcl} & & \max _{{\textsf{x}}\in \mathbb {R}^n}\left[ f({\textsf{x}}) + \lambda _{\textsf{x}}({\textsf{e}} ^{\intercal }{\textsf{x}}-1) - {\textsf{u}}({\textsf{x}}) ^{\intercal }{\textsf{x}}\right] \\[0.3em] & = & \max _{{\textsf{x}}\in \mathbb {R}^n}\left[ f({\textsf{x}})- \max _j ({\textsf{e}}_j-{\textsf{x}}) ^{\intercal }\nabla f({\textsf{x}}) \right] = \max _{{\textsf{x}}\in \mathbb {R}^n}\left[ f({\textsf{x}}) - G({\textsf{x}})\right] \,.\end{array} \end{aligned}$$

(35)

Furthermore, since Slater’s condition is satisfied, strong duality holds by Slater’s theorem (Boyd & Vandenberghe, 2004), resulting in $ G({\textsf{x}}^*) = 0$ for every solution ${\textsf{x}}^*$ of the primal problem.

The FW gap is related to several other measures of convergence (see e.g. Lan 2020, Section 7.5.1). First, consider the projected gradient

$$\begin{aligned} \widetilde{{\textsf{g}}}_k:= \pi _C({\textsf{x}}_k - \nabla f({\textsf{x}}_k)) - {\textsf{x}}_k \,. \end{aligned}$$

(36)

with $\pi _{B}$ the projection on a convex and closed subset $B\subseteq \mathbb {R}^n$. We have $\Vert \widetilde{{\textsf{g}}}_k \Vert = 0$ if and only if ${\textsf{x}}_k$ is stationary, with

$$\begin{aligned} \begin{array}{rcl} \Vert \widetilde{{\textsf{g}}}_k \Vert ^2 & = & \widetilde{{\textsf{g}}}_k^\intercal \widetilde{{\textsf{g}}}_k \,\le \, \widetilde{{\textsf{g}}}_k^\intercal [({\textsf{x}}_k - \nabla f({\textsf{x}}_k) ) - \pi _C({\textsf{x}}_k - \nabla f({\textsf{x}}_k))] + \widetilde{{\textsf{g}}}_k^\intercal \widetilde{{\textsf{g}}}_k \\[0.4em] & = & -\widetilde{{\textsf{g}}}_k^\intercal \nabla f({\textsf{x}}_k) \, = \, -(\pi _C({\textsf{x}}_k - \nabla f({\textsf{x}}_k)) - {\textsf{x}}_k) ^\intercal \nabla f({\textsf{x}}_k) \\[0.4em] & \le & \max \limits _{{\textsf{y}}\in C} - ({\textsf{y}}- {\textsf{x}}_k)^\intercal \nabla f({\textsf{x}}_k) \,=\, G({\textsf{x}}_k) \,, \end{array} \end{aligned}$$

(37)

where we used $[{\textsf{y}}- \pi _{C}({\textsf{x}})]^\intercal [{\textsf{x}}- \pi _{C}({\textsf{x}})] \le 0$ in the first inequality, with ${\textsf{x}}= {\textsf{x}}_k - \nabla f({\textsf{x}}_k)$ and ${\textsf{y}}= {\textsf{x}}_k$.

Let now $N_{C}(x)$ denote the normal cone to $C$ at a point ${\textsf{x}}\in C$:

$$\begin{aligned} N_{C}({\textsf{x}}):= \{{\textsf{r}}\in \mathbb {R}^n: {\textsf{r}}^\intercal ({\textsf{y}}- {\textsf{x}}) \le 0 \; \text{ for } \text{ all } {\textsf{y}}\in C\} \,. \end{aligned}$$

(38)

First-order stationarity conditions are equivalent to $ - \nabla f({\textsf{x}}) \in N_{C}({\textsf{x}})$, or

$$\begin{aligned} {{\,\textrm{dist}\,}}(N_{C}({\textsf{x}}), - \nabla f({\textsf{x}})) = \Vert - \nabla f({\textsf{x}}) - \pi _{N_{C}({\textsf{x}})}(-\nabla f({\textsf{x}})) \Vert = 0 \,. \end{aligned}$$

The FW gap provides a lower bound on the distance from the normal cone ${{\,\textrm{dist}\,}}(N_{C}({\textsf{x}}), - \nabla f({\textsf{x}}))$, inflated by the diameter $D>0$ of $C$, as follows:

$$\begin{aligned} \begin{array}{rcl} G({\textsf{x}}_k) & = & -({\textsf{s}}_k - {\textsf{x}}_k)^\intercal \nabla f({\textsf{x}}_k) \\[0.4em] & = & ({\textsf{s}}_k - {\textsf{x}}_k)^\intercal [\pi _{N_{C}({\textsf{x}}_k)}(-\nabla f({\textsf{x}}_k)) - (\pi _{N_{C}({\textsf{x}}_k)}(-\nabla f({\textsf{x}}_k)) + \nabla f({\textsf{x}}_k))] \\[0.4em] & \le & \Vert {\textsf{s}}_k - {\textsf{x}}_k \Vert \, \Vert \pi _{N_{C}({\textsf{x}}_k)}(-\nabla f({\textsf{x}}_k)) + \nabla f({\textsf{x}}_k) \Vert \\[0.4em] & \le & D\, {{\,\textrm{dist}\,}}(N_{C}({\textsf{x}}_k), -\nabla f({\textsf{x}}_k)) \,, \end{array} \end{aligned}$$

(39)

where in the first inequality we used $({\textsf{s}}_k - {\textsf{x}}_k)^\intercal [\pi _{N_{C}({\textsf{x}}_k)}(-\nabla f({\textsf{x}}_k))] \le 0$ together with the Cauchy-Schwarz inequality, and $\Vert {\textsf{s}}_k - {\textsf{x}}_k \Vert \le D$ in the second.

5.2 ${{\mathcal {O}}}(1/k)$ rate for convex objectives

If f is non-convex, it is possible to prove a ${{\mathcal {O}}}(1/\sqrt{k})$ rate for $\min _{i \in [1:k]} G(x_i)$ (see, e.g., Lacoste-Julien 2016). On the other hand, if f is convex, we have an ${{\mathcal {O}}}(1/k)$ rate on the optimality gap (see, e.g., Frank and Wolfe (1956), Levitin and Polyak (1966)) for all the stepsizes discussed in Sect. 4. Here we include a proof for the Lipschitz constant dependent stepsize $\alpha _k$ given by (28).

Theorem 1

If f is convex and the stepsize is given by (28), then for every $k \ge 1$

$$\begin{aligned} f({\textsf{x}}_k) - f^* \le \frac{2LD^2}{k + 2} \,. \end{aligned}$$

(40)

Before proving the theorem we prove a lemma concerning the decrease of the objective in the case of a full FW step, that is a step with ${\textsf{d}}_k = {\textsf{d}}_k^{FW}$ and with $\alpha _k$ equal to 1, the maximal feasible stepsize.

Lemma 2

If $\alpha _k = 1$ and ${\textsf{d}}_k = {\textsf{d}}_k^{FW}$ then

$$\begin{aligned} f({\textsf{x}}_{k + 1}) - f^* \le \frac{1}{2} \, \min \left\{ L\Vert {\textsf{d}}_k \Vert ^2, f({\textsf{x}}_k) - f^* \right\} \,. \end{aligned}$$

(41)

Proof

If $\alpha _k = 1 = \alpha _k^{\max }$ then by Definitions (3) and (32)

$$\begin{aligned} G({\textsf{x}}_k) = -\nabla f({\textsf{x}}_k)^\intercal {\textsf{d}}_k \ge L \Vert {\textsf{d}}_k \Vert ^2 \,, \end{aligned}$$

(42)

the last inequality following by Definition (28) and the assumption that $\alpha _k = 1$. By the standard Descent Lemma it also follows

$$\begin{aligned} f({\textsf{x}}_{k + 1}) - f^* = f({\textsf{x}}_k + {\textsf{d}}_k) - f^* \le f({\textsf{x}}_k) - f^* + \nabla f({\textsf{x}}_k)^\intercal {\textsf{d}}_k + \frac{L}{2}\,\Vert {\textsf{d}}_k \Vert ^2 \,. \end{aligned}$$

(43)

Considering the definition of ${\textsf{d}}_k$ and convexity of f, we get

$$\begin{aligned} f({\textsf{x}}_k) - f^* + \nabla f({\textsf{x}}_k)^\intercal {\textsf{d}}_k \le f({\textsf{x}}_k) - f^* + \nabla f({\textsf{x}}_k)^\intercal ({\textsf{x}}^*-{\textsf{x}}_k) \le 0\,, \end{aligned}$$

so that (43) entails $ f({\textsf{x}}_{k + 1}) - f^* \le \frac{L}{2}\,\Vert {\textsf{d}}_k \Vert ^2 $. To conclude, it suffices to apply to the RHS of (43) the inequality

$$\begin{aligned} f({\textsf{x}}_k) - f^* + \nabla f({\textsf{x}}_k)^\intercal {\textsf{d}}_k + {\textstyle {\frac{L}{2}}}\,\Vert {\textsf{d}}_k \Vert ^2 \le f({\textsf{x}}_k) - f^* -{\textstyle {\frac{1}{2}}}\,G({\textsf{x}}_k) \le {\textstyle {\frac{f({\textsf{x}}_k) - f^*}{2}}}\, \end{aligned}$$

(44)

where we used (42) in the first inequality and $G({\textsf{x}}_k) \ge f({\textsf{x}}_k) - f^*$ in the second. $\square $

We can now proceed with the proof of the main result.

Proof

(Theorem 1)

For $k = 0$ and $\alpha _0 = 1$ then by Lemma 2

$$\begin{aligned} f({\textsf{x}}_1) - f^* \le \frac{L\Vert {\textsf{d}}_0 \Vert ^2}{2} \le \frac{ L D^2}{2} \,. \end{aligned}$$

(45)

If $\alpha _0 < 1$ then

$$\begin{aligned} f({\textsf{x}}_0) - f^* \le G({\textsf{x}}_0) < L\Vert {\textsf{d}}_0 \Vert ^2 \le LD^2 \,. \end{aligned}$$

(46)

Therefore in both cases (30) holds for $k = 0$.

Reasoning by induction, if (40) holds for k with $\alpha _k = 1$, then the claim is clear by (41). On the other hand, if $\alpha _k <\alpha _k^{\max }= 1 $ then by Lemma 1, we have

$$\begin{aligned} \begin{array}{rcl} f({\textsf{x}}_{k + 1}) - f^* & \le & f({\textsf{x}}_k) - f^* - \frac{1}{2L}\, (\nabla f(x_k)^\intercal \widehat{{\textsf{d}}}_k)^2\\[0.4em] & \le & f({\textsf{x}}_k) - f^* - \frac{(\nabla f({\textsf{x}}_k)^\intercal {\textsf{d}}_k)^2}{2LD^2} \\[0.4em] & \le & f({\textsf{x}}_k) - f^* - \frac{(f({\textsf{x}}_k) - f^*)^2}{2LD^2}\\[0.4em] & = & (f({\textsf{x}}_k) - f^*)(1 - \frac{f({\textsf{x}}_k) - f^*}{2LD^2}) \, \le \, \frac{2LD^2}{k + 3} \,, \end{array} \end{aligned}$$

(47)

where we used $\Vert {\textsf{d}}_k \Vert \le D$ in the second inequality, $\nabla f({\textsf{x}}_k)^\intercal {\textsf{d}}_k = G({\textsf{x}}_k) \ge f({\textsf{x}}_k) - f^*$ in the third inequality; and the last inequality follows by induction hypothesis. $\square $

As can be easily seen from above argument, the convergence rate of ${{\mathcal {O}}}(1/k)$ is true also in more abstract normed spaces than $\mathbb {R}^n$, e.g. when $C$ is a convex and weakly compact subset of a Banach space (see, e.g., Demyanov and Rubinov (1970), Dunn and Harshbarger (1978)). A generalization for some unbounded sets is given in Ferreira and Sosa (2021). The bound is tight due to a zigzagging behaviour of the method near solutions on the boundary, leading to a rate of $\Omega (1/k^{1 + \delta })$ for every $\delta > 0$ (see Canon and Cullum (1968) for further details), when the objective is a strictly convex quadratic function and the domain is a polytope.

Also the minimum FW gap $\min _{i \in [0: k]} G({\textsf{x}}_i) $ converges at a rate of ${{\mathcal {O}}}(1/k)$ (see Jaggi (2013); Freund and Grigas (2016)). In Freund and Grigas (2016), a broad class of stepsizes is examined, including $\alpha _k= \frac{1}{k + 1}$ and $\alpha _k = \bar{\alpha }$ constant. For these stepsizes a convergence rate of ${{\mathcal {O}}}\left( \frac{\ln (k)}{k}\right) $ is proved.

5.3 Variants

Active set FW variants mostly aim to improve over the ${{\mathcal {O}}}(1/k)$ rate and also ensure support identification in finite time. They generate a sequence of active sets $\{A_k\}$, such that ${\textsf{x}}_k \in \mathop {\textrm{conv}}\limits (A_k)$, and define alternative directions making use of these active sets.

For the pairwise FW (PFW) and the away step FW (AFW) (see Clarkson (2010); Lacoste-Julien and Jaggi (2015)) we have that $A_k$ must always be a subset of $S_k$, with ${\textsf{x}}_k$ a convex combination of the elements in $A_k$. The away vertex ${\textsf{v}}_k$ is then defined by

$$\begin{aligned} {\textsf{v}}_k \in \mathop {\mathrm {arg\,max}}\limits _{{\textsf{y}}\in A_k} \nabla f({\textsf{x}}_k)^\intercal {\textsf{y}} \,. \end{aligned}$$

(48)

The AFW direction, introduced in Wolfe (1970), is hence given by

$$\begin{aligned} \begin{array}{ll} {\textsf{d}}^{AS}_k & = {\textsf{x}}_k - {\textsf{v}}_k \\ {\textsf{d}}_k & \in \mathop {\mathrm {arg\,max}}\limits \{-\nabla f({\textsf{x}}_k)^\intercal {\textsf{d}}: {\textsf{d}}\in \{{\textsf{d}}_k^{AS}, {\textsf{d}}_k^{FW}\} \} \,, \end{array} \end{aligned}$$

(49)

while the PFW direction, as defined in Lacoste-Julien and Jaggi (2015) and inspired by the early work (Mitchell et al., 1974), is

$$\begin{aligned} {\textsf{d}}^{PFW}_k ={\textsf{d}}_k^{FW}+{\textsf{d}}^{AS}_k= {\textsf{s}}_k - {\textsf{v}}_k \,, \end{aligned}$$

(50)

with ${\textsf{s}}_k$ defined in (3).

The FW method with in-face directions (FDFW) (see Freund et al. (2017), Guelat and Marcotte (1986)), also known as Decomposition invariant Conditional Gradient (DiCG) when applied to polytopes (Bashiri & Zhang, 2017), is defined exactly as the AFW, but with the minimal face $\mathcal {F}({\textsf{x}}_k)$ of $C$ containing ${\textsf{x}}_k$ as the active set. The extended FW (EFW) was introduced in Holloway (1974) and is also known as simplicial decomposition (Von Hohenbalken, 1977). At every iteration the method minimizes the objective in the current active set $A_{k + 1}$

$$\begin{aligned} {\textsf{x}}_{k + 1} \in \mathop {\mathrm {arg\,min}}\limits _{{\textsf{y}}\in \mathop {\textrm{conv}}\limits (A_{k + 1})} f({\textsf{y}})\,, \end{aligned}$$

(51)

where $A_{k + 1} \subseteq A_k \cup \{s_k\}$ (see, e.g., Clarkson 2010, Algorithm 4.2). A more general version of the EFW, only approximately minimizing on the current active set, was introduced in Lacoste-Julien and Jaggi (2015) under the name of fully corrective FW. In Table 1, we report the main features of the classic FW and of the variants under analysis.

Table 1 FW method and variants covered in this review

Full size table

5.4 Sparse approximation properties

As discussed in the previous section, for the classic FW method and the AFW, PFW, EFW variants ${\textsf{x}}_k$ can always be written as a convex combination of elements in $A_k \subset S_k$, with $|A_k| \le k + 1$. Even for the FDFW we still have the weaker property that ${\textsf{x}}_k$ must be an affine combination of elements in $A_k \subset A$ with $ |A_k| \le k + 1$. It turns out that the convergence rate of methods with this property is $\Omega (\frac{1}{k})$ in high dimension. More precisely, if $C= \mathop {\textrm{conv}}\limits (A)$ with A compact, the ${{\mathcal {O}}}(1/k)$ rate of the classic FW method is worst case optimal given the sparsity constraint

$$\begin{aligned} x_k \in {{\,\textrm{aff}\,}}(A_k) \ \text {with }A_k \subset A, \ |A_k| \le k + 1 \,. \end{aligned}$$

(52)

An example where the ${{\mathcal {O}}}(1/k)$ rate is tight was presented in Jaggi (2013). Let $C= \Delta _{n - 1}$ and $f({\textsf{x}}) = \Vert {\textsf{x}}- \frac{1}{n}\, {\textsf{e}} \Vert ^2$. Clearly, $f^* = 0$ with ${\textsf{x}}^* = \frac{1}{n}\, {\textsf{e}}$. Then it is easy to see that $\min \{f({\textsf{x}}) - f^*: {\Vert {\textsf{x}}\Vert }_0 \le k + 1 \} \ge \frac{1}{k + 1} - \frac{1}{n}$ for every $k \in \mathbb {N}$, so that in particular under (52) with $A_k = \{e_i: i \in [{1}\! : \! {n}]\}$, the rate of any FW variant must be $\Omega (\frac{1}{k})$.

5.5 Affine invariance

The FW method and the AFW, PFW, EFW are affine invariant (Jaggi, 2013). More precisely, let ${\textsf{P}}$ be a linear transformation, $\hat{f}$ be such that $\hat{f}({\textsf{P}}{\textsf{x}}) = f({\textsf{x}})$ and $\hat{C} = {\textsf{P}}(C)$. Then for every sequence $\{{\textsf{x}}_k\}$ generated by the methods applied to $(f, C)$, the sequence $\{{\textsf{y}}_k\}:= \{{\textsf{P}}{\textsf{x}}_k\}$ can be generated by the FW method with the same stepsizes applied to $(\hat{f}, \hat{C})$. As a corollary, considering the special case where ${\textsf{P}}$ is the matrix collecting the elements of A as columns, one can prove results on $C= \Delta _{|A| - 1}$ and generalize them to $\hat{C}:= \mathop {\textrm{conv}}\limits (A)$ by affine invariance.

An affine invariant convergence rate bound for convex objectives can be given using the curvature constant

$$\begin{aligned} \kappa _{f, C}:= \sup \left\{ 2{\textstyle { \frac{f( \alpha {\textsf{y}}+(1-\alpha ){\textsf{x}}) - f({\textsf{x}}) - \alpha \nabla f({\textsf{x}})^\intercal ({\textsf{y}}-{\textsf{x}})}{\alpha ^2}}}: \{{\textsf{x}},{\textsf{y}}\}\subset C, \, \alpha \in (0, 1] \right\} \,. \end{aligned}$$

(53)

It is easy to prove that $\kappa _{f, C} \le LD^2$ if D is the diameter of $C$. In the special case where $C= \Delta _{n - 1}$ and $f({\textsf{x}}) = {\textsf{x}}^\intercal \tilde{{\textsf{A}}}^\intercal \tilde{{\textsf{A}}} {\textsf{x}}+ {\textsf{b}} ^{\intercal }{\textsf{x}}$, then $\kappa _{f, C} \le {{\,\textrm{diam}\,}}({\textsf{A}}\Delta _{n - 1})^2$ for ${\textsf{A}} ^{\intercal }= [\tilde{{\textsf{A}}} ^{\intercal }, {\textsf{b}}]$; see (Clarkson, 2010).

When the method uses the stepsize sequence (25), it is possible to give the following affine invariant convergence rate bounds (see Freund and Grigas 2016):

$$\begin{aligned} \begin{aligned} f({\textsf{x}}_k) - f^*&\le \frac{2\kappa _{f, C}}{k + 4} \,, \\ \min _{i \in ]0:k]} G({\textsf{x}}_i)&\le \frac{9\kappa _{f, C}}{2k} \,, \end{aligned} \end{aligned}$$

(54)

thus in particular slightly improving the rate we gave in Theorem 1 since we have that $\kappa _{f, C} \le LD^2$.

5.6 Support identification for the AFW

It is a classic result that the AFW under some strict complementarity conditions and for strongly convex objectives identifies in finite time the face containing the solution (Guelat & Marcotte, 1986). Here we report some explicit bounds for this property proved in Bomze et al. (2020). We first assume that $C= \Delta _{n - 1}$, and introduce the multiplier functions

$$\begin{aligned} \lambda _i({\textsf{x}}) = \nabla f({\textsf{x}})^\intercal ({\textsf{e}}_i - {\textsf{x}}) \end{aligned}$$

(55)

for $i \in [{1}\! : \! {n}]$. Let ${\textsf{x}}^*$ be a stationary point for f, with the objective f not necessarily convex. It is easy to check that $\{\lambda _i({\textsf{x}}^*)\}_{i \in [1:n]}$ coincide with the Lagrangian multipliers. Furthermore, by complementarity conditions we have $x^*_i \lambda _i({\textsf{x}}^*) = 0$ for every $i \in [{1}\! : \! {n}]$. It follows that the set

$$\begin{aligned} I({\textsf{x}}^*):= \{i \in [{1}\! : \! {n}]: \lambda _{i}({\textsf{x}}^*) = 0\} \end{aligned}$$

contains the support of ${\textsf{x}}^*$,

$$\begin{aligned} {{\,\textrm{supp}\,}}({\textsf{x}}^*):=\{ i\in [{1}\! : \! {n}]: x_i^*>0\}\,. \end{aligned}$$

The next lemma uses $\lambda _i$, and the Lipschitz constant L of $\nabla f$, to give a lower bound of the so-called active set radius $r_*$, defining a neighborhood of ${\textsf{x}}^*$. Starting the algorithm in this neighbourhood, the active set (the minimal face of $C$ containing ${\textsf{x}}^*$) is identified in a limited number of iterations.

Lemma 3

Let ${\textsf{x}}^*$ be a stationary point for f on the boundary of $\Delta _{n - 1}$, $\delta _{\min } = \min _{i: \lambda _{i}({\textsf{x}}^*) > 0} \lambda _i({\textsf{x}}^*)$ and

$$\begin{aligned} r_* = \frac{\delta _{\min }}{\delta _{\min } + 2L} \,. \end{aligned}$$

(56)

Let. Assume that for every k for which ${\textsf{d}}_k = {\textsf{d}}_k^{\mathcal {A}}$ holds, the step size $\alpha _k$ is not smaller than the stepsize given by (28), $\alpha _k(L) \le \alpha _k$. If ${\Vert {\textsf{x}}_k - {\textsf{x}}^*\Vert }_1 < r_*$, then for some

$$\begin{aligned} j \le \min \{ n - |I({\textsf{x}}^*)|, |{{\,\textrm{supp}\,}}({\textsf{x}}_k)| - 1\} \end{aligned}$$

we have ${{\,\textrm{supp}\,}}({\textsf{x}}_{k + j}) \subseteq I({\textsf{x}}^*)$ and $\Vert {\textsf{x}}_{k + j} - {\textsf{x}}^* \Vert _1 < r_*$.

Proof

Follows from (Bomze et al. (2020), Theorem 3.3), since under the assumptions the AFW sets one variable in ${{\,\textrm{supp}\,}}(x_k)\setminus I({\textsf{x}}^*)$ to zero at every step without increasing the 1-norm distance from ${\textsf{x}}^*$. $\square $

The above lemma does not require convexity and was applied in Bomze et al. (2020) to derive active set identification bounds in several convex and non-convex settings. Here we focus on the case where the domain $C= \mathop {\textrm{conv}}\limits (A)$ with $|A| < + \infty $ is a generic polytope, and where f is $\mu $-strongly convex for some $\mu >0$, i.e.

$$\begin{aligned} f({\textsf{y}}) \ge f({\textsf{x}}) + \nabla f({\textsf{x}})^\intercal ({\textsf{y}}- {\textsf{x}}) + \frac{\mu }{2} \Vert {\textsf{x}}- {\textsf{y}} \Vert ^2\quad \text{ for } \text{ all } \{{\textsf{x}}, {\textsf{y}}\} \subset C\,. \end{aligned}$$

(57)

Let $E_{C}({\textsf{x}}^*)$ be the face of $C$ exposed by $\nabla f(x^*)$:

$$\begin{aligned} E_{C}({\textsf{x}}^*):= \mathop {\mathrm {arg\,min}}\limits _{{\textsf{x}}\in C} \nabla f({\textsf{x}}^*)^\intercal {\textsf{x}} \,, \end{aligned}$$

(58)

Let then $\theta _{A}$ be the Hoffman constant (see Beck and Shtern 2017) related to $[\bar{{\textsf{A}}} ^{\intercal }, {\textsf{I}}_n, {\textsf{e}}, -{\textsf{e}}] ^{\intercal }$, with $\bar{{\textsf{A}}}$ the matrix having as columns the elements in A. Finally, consider the function $ {f}_A({\textsf{y}}):= f(\bar{{\textsf{A}}}{\textsf{y}})$ on $ \Delta _{|A| - 1}$, and let $L_{A}$ be the Lipschitz constant of ${\nabla } {f_A}$ as well as

$$\begin{aligned} { \delta _{\min }:= \min _{{\textsf{a}}\in A \setminus E_{C}({\textsf{x}}^*)}\nabla f({\textsf{x}}^*)^\intercal ({\textsf{a}}- {\textsf{x}}^*)\quad \text{ and }\quad r_*({\textsf{x}}^*):= \frac{\delta _{\min }}{\delta _{\min } + 2L_A}\,.} \end{aligned}$$

Using linearity of AFW convergence for strongly convex objectives (see Sect. 6.1), we have the following result:

Theorem 2

The sequence $\{{\textsf{x}}_k\}$ generated by the AFW with ${\textsf{x}}_0 \in A$ enters $E_{C}({\textsf{x}}^*)$ for

$$\begin{aligned} k \ge \max \left\{ 2\frac{\ln (h_0) - \ln (\mu _A r_*({\textsf{x}}^*)^2/2)}{\ln (1/q)}, 0\right\} \,, \end{aligned}$$

(59)

where $\mu _A = \frac{\mu }{n\theta _{A}^2}$ and $q\in (0,1)$ is the constant related to the linear convergence rate of the AFW, i.e. $h_k\le q^k h_0$ for all k.

Proof

(sketch) We present an argument in the case $C= \Delta _{n - 1}$, $A = \{e_i\}_{i \in [1:n]}$ which can be easily extended by affine invariance to the general case (see Bomze et al. (2020) for details). In this case $\theta _{A} \ge 1$ and we can define ${\bar{\mu }:= {\mu }/n }\ge \mu _{A} $.

To start with, the number of steps needed to reach the condition

$$\begin{aligned} h_k \le \frac{\mu }{2n}r_*({\textsf{x}}^*)^2 = \frac{\bar{\mu }}{2} r_*({\textsf{x}}^*)^2 \end{aligned}$$

(60)

is at most

$$\begin{aligned} \bar{k} = \max \left\{ \Bigg \lceil \frac{\ln (h_0) - \ln (\bar{\mu } r_*(x^*)^2/2)}{\ln (1/q)}\Bigg \rceil , 0\right\} . \end{aligned}$$

Now we combine $n\Vert \cdot \Vert \ge {\Vert \cdot \Vert }_1$ with strong convexity and relation (60) to obtain ${\Vert {\textsf{x}}_k - {\textsf{x}}^* \Vert }_1 \le r_*({\textsf{x}}^*)$, hence in particular ${\Vert {\textsf{x}}_k - {\textsf{x}}^* \Vert }_1 \le r_*({\textsf{x}}^*)$ for every $k \ge \bar{k}$. Since ${\textsf{x}}_0$ is a vertex of the simplex, and at every step at most one coordinate is added to the support of the current iterate, $|{{\,\textrm{supp}\,}}({\textsf{x}}_{\bar{k}})| \le \bar{k} + 1$. The claim follows by applying Lemma 3. $\square $

Additional bounds under a quadratic growth condition weaker than strong convexity and strict complementarity are reported in Garber (2020).

Convergence and finite time identification for the PFW and the AFW are proved in Bomze et al. (2019) for a specific class of non-convex minimization problems over the standard simplex, under the additional assumption that the sequence generated has a finite set of limit points. In another line of work, active set identification strategies combined with FW variants have been proposed in Cristofari et al. (2020) and Sun (2020).

5.7 Inexact linear oracle

In many real-world applications, linear subproblems can only be solved approximately. This is the reason why the convergence of FW variants is often analyzed under some error term for the linear minimization oracle (see, e.g., Braun et al. (2019), Braun et al. (2017), Freund and Grigas (2016), Jaggi (2013), Konnov (2018)). A common assumption, relaxing the FW vertex exact minimization property, is to have access to a point (usually a vertex) $\tilde{{\textsf{s}}}_k$ such that

$$\begin{aligned} \nabla f({\textsf{x}}_k)^\intercal (\tilde{{\textsf{s}}}_k-{\textsf{x}}_k) \le \min _{{\textsf{s}}\in C} \nabla f({\textsf{x}}_k)^\intercal ({\textsf{s}}-{\textsf{x}}_k) + \delta _k \,, \end{aligned}$$

(61)

for a sequence $\{\delta _k\}$ of non negative approximation errors.

If the sequence $\{\delta _k\}$ is constant and equal to some $\delta > 0$, then trivially the lowest possible approximation error achieved by the FW method is $\delta $. At the same time, (Freund and Grigas (2016), Theorem 5.1) implies a rate of ${{\mathcal {O}}}(\frac{1}{k} + \delta )$ if the stepsize $\alpha _k= \frac{2}{k + 2}$ is used.

The ${{\mathcal {O}}}(1/k)$ rate can be instead retrieved by assuming that $\{\delta _k\}$ converges to 0 quickly enough, and in particular if

$$\begin{aligned} \delta _k = \frac{\delta \kappa _{f, C}}{k + 2} \end{aligned}$$

(62)

for a constant $\delta > 0 $. Under (62), in Jaggi (2013) a convergence rate of

$$\begin{aligned} f(x_k) - f^* \le \frac{2\kappa _{f, C}}{k + 2}(1 + \delta ) \end{aligned}$$

(63)

was proved for the FW method with $\alpha _k$ given by exact line search or equal to $\frac{2}{k + 2}$, as well as for the EFW.

A linearly convergent variant making use of an approximated linear oracle recycling previous solutions to the linear minimization subproblem is studied in Braun et al. (2019). In Freund and Grigas (2016), Hogan (1971), the analysis of the classic FW method is extended to the case of inexact gradient information. In particular in Freund and Grigas (2016), assuming the availability of the $(\delta , L)$ oracle introduced in Devolder et al. (2014), a convergence rate of ${{\mathcal {O}}}(1/k + \delta k)$ is proved.

6 Improved rates for strongly convex objectives

See Table 2

Table 2 Known convergence rates for the FW method and the variants covered in this review

Full size table

6.1 Linear convergence under an angle condition

In the rest of this section we assume that f is $\mu $-strongly convex (57). We also assume that the stepsize is given by exact linesearch or by (28).

Under this assumption, an asymptotic linear convergence rate for the FDFW on polytopes was given in the early work (Guelat & Marcotte, 1986). Furthermore, in Garber and Hazan (2016) a linearly convergent variant was proposed, making use however of an additional local linear minimization oracle.

Recent works obtain linear convergence rates by proving the angle condition

$$\begin{aligned} { - \nabla f({\textsf{x}}_k)^\intercal \widehat{{\textsf{d}}}_k \ge \frac{\tau }{{\Vert {\textsf{x}}_k - {\textsf{x}}^* \Vert }} \, \nabla f({\textsf{x}}_k)^\intercal ( {\textsf{x}}_k - {\textsf{x}}^*)} \end{aligned}$$

(64)

for some $\tau >0$ and some ${\textsf{x}}^* \in \mathop {\mathrm {arg\,min}}\limits _{{\textsf{x}}\in C} f({\textsf{x}})$. As we shall see in the next lemma, under (64) it is not difficult to prove linear convergence rates in the number of good steps. These are FW steps with $\alpha _k = 1$ and steps in any descent direction with $\alpha _k < 1$.

Lemma 4

If the step k is a good step and (64) holds, then

$$\begin{aligned} { h_{k+1} \le \max \left\{ {\textstyle { \frac{1}{2}}, 1 - \frac{\tau ^2\mu }{L}} \right\} h_k \,.} \end{aligned}$$

(65)

Proof

If the step k is a full FW step then Lemma 2 entails $h_{k+1}\le \frac{1}{2}\, h_k$. In the remaining case, first observe that by strong convexity

$$\begin{aligned} \begin{array}{rcl} f^* & = & f({\textsf{x}}^*) \ge f({\textsf{x}}_k) + \nabla f({\textsf{x}}_k)^\intercal ({\textsf{x}}^* - {\textsf{x}}_k) + \frac{\mu }{2}\Vert {\textsf{x}}_k - {\textsf{x}}^* \Vert ^2 \\[0.5em] & \ge & \min \limits _{\alpha \in \mathbb {R}}\left[ f({\textsf{x}}_k) + \alpha \nabla f({\textsf{x}}_k)^\intercal ({\textsf{x}}^* - {\textsf{x}}_k) + \frac{\alpha ^2\mu }{2}\Vert {\textsf{x}}_k - {\textsf{x}}^* \Vert ^2 \right] \\[0.5em] & = & f({\textsf{x}}_k) - \frac{1}{2\mu {\Vert {\textsf{x}}_k - {\textsf{x}}^* \Vert }^2} \left[ \nabla f({\textsf{x}}_k)^\intercal ( {\textsf{x}}_k - {\textsf{x}}^*)\right] ^2 \,, \end{array} \end{aligned}$$

(66)

which means

$$\begin{aligned} h_k \le \frac{1}{2\mu {\Vert {\textsf{x}}_k - {\textsf{x}}^* \Vert }^2}\left[ \nabla f({\textsf{x}}_k)^\intercal ({\textsf{x}}_k - {\textsf{x}}^*)\right] ^2 \,.\end{aligned}$$

(67)

We can then proceed using the bound (30) from Lemma 1 in the following way:

$$\begin{aligned} \begin{array}{rcl} h_{k+1} & = & f({\textsf{x}}_{k + 1}) - f^* \le f({\textsf{x}}_k) - f^* - \frac{1}{2L}\left[ \nabla f({\textsf{x}}_k)^\intercal \widehat{{\textsf{d}}}_k\right] ^2 \\[0.5em] & \le & h_k - \frac{\tau ^2}{2L{\Vert {\textsf{x}}_k - {\textsf{x}}^* \Vert }^2} \left[ \nabla f({\textsf{x}}_k)^\intercal ( {\textsf{x}}_k - {\textsf{x}}^* ) \right] ^2 \\[0.5em] & \le & h_k \left( 1 - \frac{ \tau ^2\mu }{L}\right) \,, \end{array} \end{aligned}$$

(68)

where we used (64) in the second inequality and (67) in the third one. $\square $

As a corollary, under (64) we have the rate

$$\begin{aligned} f({\textsf{x}}_{k}) - f^* {= h_k} \le \max \left\{ {\textstyle { \frac{1}{2}}, 1 - \frac{\tau ^2\mu }{L}}\right\} ^{\gamma (k)} {h_0} \end{aligned}$$

(69)

for any method with non increasing $\{f({\textsf{x}}_k)\}$ and following Algorithm 1, with $\gamma (k) \le k$ an integer denoting the number of good steps until step k. It turns out that for all the variants we introduced in this review we have $\gamma (k) \ge Tk$ for some constant $T > 0$. When ${\textsf{x}}^*$ is in the relative interior of $C$, the FW method satisfies (64) and we have the following result (see Guelat and Marcotte (1986), Lacoste-Julien and Jaggi (2015)):

Theorem 3

If ${\textsf{x}}^* \in {{\,\textrm{ri}\,}}(C)$, then

$$\begin{aligned} f({\textsf{x}}_{k}) - f^* \le \left[ 1 - \frac{\mu }{L} \left( \frac{{{\,\textrm{dist}\,}}({\textsf{x}}^*, \partial C)}{D}\right) ^2\right] ^k (f({\textsf{x}}_0) - f^*) \,. \end{aligned}$$

(70)

Proof

We can assume for simplicity ${{\,\textrm{int}\,}}(C) \ne \emptyset $, since otherwise we can restrict ourselves to the affine hull of $C$. Let $\delta ={{\,\textrm{dist}\,}}({\textsf{x}}^*, \partial C)$ and ${\textsf{g}}= -\nabla f({\textsf{x}}_k)$. First, by assumption we have ${\textsf{x}}^* + \delta \widehat{{\textsf{g}}} \in C$. Therefore

$$\begin{aligned} {\textsf{g}}^\intercal {\textsf{d}}_k^{FW} \ge {\textsf{g}}^\intercal (({\textsf{x}}^*+\delta \widehat{{\textsf{g}}} ) - {\textsf{x}}) = \delta {\textsf{g}}^\intercal \widehat{{\textsf{g}}} + {\textsf{g}}^\intercal ({\textsf{x}}^* - {\textsf{x}}) \ge \delta \Vert {\textsf{g}} \Vert + f({\textsf{x}}) - f^* \ge \delta \Vert {\textsf{g}} \Vert \,, \end{aligned}$$

(71)

where we used ${\textsf{x}}^*+\delta \widehat{{\textsf{g}}} \in C$ in the first inequality and convexity in the second. We can conclude

$$\begin{aligned} {\textsf{g}}^\intercal \frac{{\textsf{d}}_k^{FW}}{\Vert {\textsf{d}}_k^{FW} \Vert } \ge {\textsf{g}}^\intercal \frac{{\textsf{d}}_k^{FW}}{D} \ge \frac{\delta }{D} \Vert {\textsf{g}} \Vert \ge \frac{\delta }{D} {\textsf{g}}^\intercal \left( \frac{{\textsf{x}}_k - {\textsf{x}}^*}{\Vert {\textsf{x}}_k - {\textsf{x}}^* \Vert }\right) \,. \end{aligned}$$

(72)

The thesis follows by Lemma 4, noticing that for $\tau = \frac{{{\,\textrm{dist}\,}}({\textsf{x}}^*, \partial C)}{D} \le \frac{1}{2}$ we have $1 - \tau ^2\frac{\mu }{L} > \frac{1}{2}$. $\square $

In Lacoste-Julien and Jaggi (2015), the authors proved that directions generated by the AFW and the PFW on polytopes satisfy condition (64), with $\tau = {{\,\textrm{PWidth}\,}}(A)/D$ and ${{\,\textrm{PWidth}\,}}(A)$, pyramidal width of A. While ${{\,\textrm{PWidth}\,}}(A)$ was originally defined with a rather complex minmax expression, in Peña and Rodriguez (2018) it was then proved

$$\begin{aligned} {{\,\textrm{PWidth}\,}}(A) = \min _{F \in {{\,\textrm{faces}\,}}(C)} {{\,\textrm{dist}\,}}(F, \mathop {\textrm{conv}}\limits (A \setminus F)) \,. \end{aligned}$$

(73)

This quantity can be explicitly computed in a few special cases. For $A = \{0, 1\}^n$ we have ${{\,\textrm{PWidth}\,}}(A) = 1/\sqrt{n}$, while for $A = \{e_i\}_{i \in [1:n]}$ (so that $C$ is the $n - 1$ dimensional simplex)

$$\begin{aligned} {{\,\textrm{PWidth}\,}}(A) = {\left\{ \begin{array}{ll} \frac{2}{\sqrt{n}} & \text { if }n \text { is even} \\ \frac{2}{\sqrt{n - 1/n}} & \text { if }n \text { is odd.} \end{array}\right. } \end{aligned}$$

(74)

Angle conditions like (64) with $\tau $ dependent on the number of vertices used to represent $x_k$ as a convex combination were given in Bashiri and Zhang (2017) and Beck and Shtern (2017) for the FDFW and the PFW respectively. In particular, in Beck and Shtern (2017) a geometric constant $\Omega _{C}$ called vertex-facet distance was defined as

$$\begin{aligned} \Omega _{C} = \min \{{{\,\textrm{dist}\,}}({\textsf{v}}, H): {\textsf{v}}\in V(C) \,, H \in {{\mathcal {H}}}(C), \, {\textsf{v}}\notin H \} \,, \end{aligned}$$

(75)

with $V(C)$ the set of vertices of $C$, and ${{\mathcal {H}}}(C)$ the set of supporting hyperplanes of $C$ (containing a facet of $C$). Then condition (64) is satisfied for $\tau = \Omega _{C}/s$, with ${\textsf{d}}_k$ the PFW direction and s the number of points used in the active set $A_k$.

In Bashiri and Zhang (2017), a geometric constant $H_s$ was defined depending on the minimum number s of vertices needed to represent the current point $x_k$, as well as on the proper^{Footnote 2} inequalities ${\textsf{q}}_i ^{\intercal }{\textsf{x}}\le b_i$, $i \in [{1}\! : \! {m}]$, appearing in a description of $C$. For each of these inequalities the second gap $g_i$ was defined as

$$\begin{aligned} g_i = \max _{{\textsf{v}}\in V(C)} {\textsf{q}}_i^\intercal {\textsf{v}} - \mathop {\textrm{secondmax}}\limits _{{\textsf{v}}\in V(C)} {\textsf{q}}_i^\intercal {\textsf{v}} \,,\quad i\in [{1}\! : \! {m}]\,, \end{aligned}$$

(76)

with the secondmax function giving the second largest value achieved by the argument. Then $H_s$ is defined as

$$\begin{aligned} H_s:= {\max {\textstyle {\left\{ \sum \limits _{j = 1}^n \left( \sum \limits _{i \in S} \frac{a_{ij}}{g_i} \right) ^2: S \in { {[1:m] }\atopwithdelims ()s} \right\} }}}\,. \end{aligned}$$

(77)

The arguments used in the paper imply that (64) holds with $\tau = \frac{1}{{2}D\sqrt{H_s}}$ if ${\textsf{d}}_k$ is a FDFW direction and ${\textsf{x}}_k$ the convex combination of at most s vertices. We refer the reader to Peña and Rodriguez (2018) and Rademacher and Shu (2022) for additional results on these and related constants.

The linear convergence results for strongly convex objectives are extended to compositions of strongly convex objectives with affine transformations in Beck and Shtern (2017), Lacoste-Julien and Jaggi (2015), Peña and Rodriguez (2018). In Gutman and Pena (2020), the linear convergence results for the AFW and the FW method with minimum in the interior are extended with respect to a generalized condition number $L_{f, C, D}/\mu _{f, C, D}$, with D a distance function on $C$.

For the AFW, the PFW and the FDFW, linear rates with no bad steps ($\gamma (k) = k$) are given in Rinaldi and Zeffiro (2023) for non-convex objectives satisfying a Kurdyka-Łojasiewicz inequality. In Rinaldi and Zeffiro (2020), condition (64) was proved for the FW direction and orthographic retractions on some convex sets with smooth boundary. Extensions to block-coordinate variants are reported in Bomze et al. (2024), Bomze et al. (2022). The work (Combettes & Pokutta, 2020) introduces a new FW variant using a subroutine to align the descent direction with the projection on the tangent cone of the negative gradient, thus implicitly maximizing $\tau $ in (64).

6.2 Strongly convex domains

When $C$ is strongly convex we have a ${{\mathcal {O}}}(1/k^2)$ rate (see, e.g., Garber and Hazan (2015), Kerdreux et al. (2021)) for the classic FW method. Furthermore, when $C$ is $\beta _{C}$-strongly convex and $\Vert \nabla f({\textsf{x}}) \Vert \ge c > 0$, then we have the linear convergence rate (see Demyanov and Rubinov (1970), Dunn (1979), Kerdreux et al. (2021), Levitin and Polyak (1966))

$$\begin{aligned} {h_{k+1}} \le \max \left\{ {\textstyle {\frac{1}{2}, 1 - \frac{L}{2c\beta _{C}} }}\right\} {h_k} \,. \end{aligned}$$

(78)

Finally, it is possible to interpolate between the ${{\mathcal {O}}}(1/k^2)$ rate of the strongly convex setting and the ${{\mathcal {O}}}(1/k)$ rate of the general convex one by relaxing strong convexity of the objective with Hölderian error bounds Xu and Yang (2018) and also by relaxing strong convexity of the domain with uniform convexity (Kerdreux et al., 2021).

7 Generalized FW for composite non-smooth optimization

Consider the following composite non-smooth optimization problem, generalizing (1):

$$\begin{aligned} \min _{{\textsf{x}}\in {\mathbb {R}}^n} \left[ f({\textsf{x}}) + g({\textsf{x}})\right] \end{aligned}$$

(79)

with $\nabla f$ being L-Lipschitz and g convex l.s.c. with compact domain $C$. Notice that problem (1) corresponds to the special case where g is the indicator function of $C$. In the work (Bredies et al., 2009), the following generalization of the FW direction was proposed for problem (79) (originally referred to as "generalized conditional gradient"; the reader is referred to (Beck (2017), Chapter 13) for a thorough treatment of the subject):

$$\begin{aligned} {\textsf{d}}_k^{GFW} = {\textsf{s}}_k - {\textsf{x}}_k \text { with } {\textsf{s}}_k \in \mathop {\mathrm {arg\,min}}\limits \limits _{{\textsf{s}}\in {\mathbb {R}}^n} \left[ ({\textsf{s}}- {\textsf{x}}_k)^\intercal \nabla f({\textsf{x}}_k) + g({\textsf{s}}) \right] \,. \end{aligned}$$

(80)

Of course when $g({\textsf{x}})$ is the indicator function of $C$ we retrieve the classic FW direction. The FW gap can be also extended as follows:

$$\begin{aligned} \begin{array}{rcl} G({\textsf{x}}) & = & g({\textsf{x}}) + \max \limits _{{\textsf{s}}\in {\mathbb {R}}^n} \left[ - \nabla f({\textsf{x}})^\intercal ({\textsf{s}}- {\textsf{x}}) - g({\textsf{s}}) \right] \\ & = & g({\textsf{x}}) + {\textsf{x}}^\intercal \nabla f({\textsf{x}})+ \max \limits _{{\textsf{s}}\in {\mathbb {R}}^n} \left[ - \nabla f({\textsf{x}})^\intercal {\textsf{s}} - g({\textsf{s}}) \right] ,\end{array} \end{aligned}$$

(81)

maintaining its optimality measure properties (see Beck (2017), Theorem 13.6).

We focus here on the case where f is convex, and discuss two important properties of the generalized conditional gradient method connected with the Fenchel dual of problem (79)

$$\begin{aligned} \max _{{\textsf{y}}\in \mathbb {R}^n} - f^*({\textsf{y}}) - g^*(-{\textsf{y}}) \,, \end{aligned}$$

(82)

with $f^*$ and $g^*$ convex conjugates of f and g.

Let $h({\textsf{x}}) = f({\textsf{x}}) + g({\textsf{x}})$ be the primal objective and $h^*({\textsf{y}}) = f^*({\textsf{y}}) + g^*(-{\textsf{y}})$ be the (negative) dual objective. The first property is that the gap G turns out to be equal to the Fenchel duality gap $h({\textsf{x}}) + h^*(y)$ when taking ${\textsf{y}}= \nabla f({\textsf{x}})$ as dual variable. This is a direct extension of the analogous property for the classic FW gap mentioned in Sect. 5. It can be deduced directly from, e.g., (Beck (2017), Lemma 13.5). We report here a short and self-contained argument.

Theorem 4

If f, g are convex and f is differentiable then

$$\begin{aligned} G({\textsf{x}}) = h({\textsf{x}}) + h^*(\nabla f({\textsf{x}})) \,. \end{aligned}$$

(83)

Proof

We have

$$\begin{aligned} f^*(\nabla f({\textsf{x}})) = \max _{{\textsf{z}}\in {\mathbb {R}}^n} \left[ {\textsf{z}}^\intercal \nabla f({\textsf{x}}) - f({\textsf{z}}) \right] = {\textsf{x}}^\intercal \nabla f({\textsf{x}}) - f({\textsf{x}}) \end{aligned}$$

(84)

and therefore, using (81) and (84),

$$\begin{aligned} \begin{array}{rcl} h^{*}(\nabla f({\textsf{x}})) & = & f^*(\nabla f({\textsf{x}})) + \max \limits _{{\textsf{s}}\in \mathbb {R}^n} \left[ -{\textsf{s}}^\intercal \nabla f({\textsf{x}}) - g({\textsf{s}})\right] \\ & = & {\textsf{x}}^\intercal \nabla f({\textsf{x}}) - f({\textsf{x}}) + G({\textsf{x}})-g({\textsf{x}}) - {\textsf{x}}^\intercal \nabla f({\textsf{x}}) \\ & = & G({\textsf{x}}) - h({\textsf{x}}) \,, \end{array} \end{aligned}$$

(85)

which establishes the claim. $\square $

The above result also extends to non-differentiable objectives f, but in this case the gap also depends on the particular subgradient considered in $\partial f({\textsf{x}})$.

The second property, described below in detail and proved in Bach (2015), is that the generalized FW is equivalent to the mirror gradient descent applied to the Fenchel dual, when considering the gradient as dual variable. Recall that the mirror descent method applied to the composite optimization problem $\min _{{\textsf{y}}\in \mathbb {R}^n}\left[ h_1({\textsf{y}}) + q({\textsf{y}})\right] $ has updates of the form

$$\begin{aligned} {\textsf{y}}_{k + 1} \in \mathop {\mathrm {arg\,min}}\limits _{{\textsf{y}}\in {\mathbb {R}}^n} \nabla h_1({\textsf{y}}_k)^\intercal ({\textsf{y}}- {\textsf{y}}_k) + \frac{1}{\alpha _k}D_q({\textsf{y}}_k, {\textsf{y}}) \end{aligned}$$

(86)

with $D_q({\textsf{x}}, {\textsf{y}})$ the Bregman divergence associated to q:

$$\begin{aligned} D_q({\textsf{x}}, {\textsf{y}}) = q({\textsf{y}}) - q({\textsf{x}}) - \nabla q({\textsf{x}})^\intercal ({\textsf{y}}- {\textsf{x}}) \,. \end{aligned}$$

(87)

The result proved in Bach (2015) states that if

$$\begin{aligned} {\textsf{x}}_{k + 1} = {\textsf{x}}_k + \alpha _k {\textsf{d}}_k^{GFW} \,, \end{aligned}$$

then a mirror gradient descent update as given in (86) on the Fenchel dual starting with ${\textsf{y}}_k = \nabla f({\textsf{x}}_k)$ gives

$$\begin{aligned} {\textsf{y}}_{k + 1} = \nabla f({\textsf{x}}_{k + 1}) \,. \end{aligned}$$

As a corollary, a convergence rate of ${{\mathcal {O}}}(1/k)$ under the stepsize $\alpha _k = 2/(k + 2)$ was also derived in Bach (2015) for the generalized FW method.

We finish this section mentioning briefly some additional results. In the general nonconvex case and with stepsize $\alpha _k = 2/(k + 2)$, it was proved that the method converges with rate ${{\mathcal {O}}}(1/\sqrt{k})$ (see, e.g., Beck (2017)). The analysis of the generalized FW was extended to the case of inexact oracles in Yu et al. (2017), Yurtsever et al. (2018). We refer the reader to Yu et al. (2017) for a more abstract analysis in Banach spaces, as well as for additional results for g equal to a gauge function (see Yu et al. (2017), Section 2). In Nesterov (2018) parametrized convergence rates were given under the Hölder condition for the gradient, and a trust region version was introduced. In the special case where g is a norm, an adaption of the generalized FW was descrbied in Harchaoui et al. (2015) (see also Pierucci et al. (2014) for a version using smoothing techniques).

8 Extensions

8.1 Block coordinate Frank–Wolfe method

The block coordinate FW (BCFW) was introduced in Lacoste-Julien et al. (2013) for block product domains of the form $C= C^{(1)} \times ... \times C^{(m)} \subseteq \mathbb {R}^{n_1 +... + n_m} $, and applied to structured SVM training. The algorithm operates by selecting a random block and performing a FW step in that block. Formally, for ${\textsf{s}}\in \mathbb {R}^{m_i}$ let ${\textsf{s}}^{(i)} \in \mathbb {R}^n$ be the vector with all blocks equal to ${\textsf{o}}$ except for the i-th block equal to ${\textsf{s}}$. We can write the direction of the BCFW as

$$\begin{aligned} \begin{aligned} {\textsf{d}}_k&= {\textsf{s}}_k^{(i)} - {\textsf{x}}_k \\ {\textsf{s}}_k&\in \mathop {\mathrm {arg\,min}}\limits _{{\textsf{s}}\in C^{(i)}} \nabla f({\textsf{x}}_k)^\intercal {\textsf{s}}^{(i)} \end{aligned} \end{aligned}$$

(88)

for a random index $i \in [{1}\! : \! {n}]$.

In Lacoste-Julien et al. (2013), a convergence rate of

$$\begin{aligned} \mathbb {E}[f(x_k)] - f^* \le \frac{{2Km}}{k + 2m} \end{aligned}$$

(89)

is proved, for $K = h_0 + \kappa _f^{\otimes }$, with $\kappa _f^{\otimes }$ the product domain curvature constant, defined as $\kappa _f^{\otimes } = \sum \kappa _f^{\otimes , i}$ where $\kappa _f^{\otimes , i}$ are the curvature constants of the objective fixing the blocks outside $C^{(i)}$:

$$\begin{aligned} \kappa _f^{\otimes , i}:= {\sup \left\{ 2{\textstyle { \frac{f( {\textsf{x}}+\alpha {\textsf{d}}^{(i)}) - f({\textsf{x}}) - \alpha \nabla f({\textsf{x}})^\intercal {\textsf{d}}^{(i)}}{\alpha ^2}}}: {\textsf{d}}\in C-{\textsf{x}},\, {\textsf{x}}\in C, \, \alpha \in (0, 1]\right\} \,.} \end{aligned}$$

(90)

An asynchronous and parallel generalization for this method was given in Wang et al. (2016). This version assumes that a cloud oracle is available, modeling a set of worker nodes each sending information to a server at different times. This information consists of an index i and the following LMO on $C^{(i)}$:

$$\begin{aligned} {\textsf{s}}_{(i)} \in \mathop {\mathrm {arg\,min}}\limits \limits _{{\textsf{s}}\in C^{(i)}} \nabla f({\textsf{x}}_{\widetilde{k}})^\intercal {\textsf{s}}^{(i)} \,. \end{aligned}$$

(91)

The algorithm is called asynchronous because $\widetilde{k}$ can be smaller than k, modeling a delay in the information sent by the node. Once the server has collected a minibatch S of $\tau $ distinct indexes (overwriting repetitions), the descent direction is defined as

$$\begin{aligned} {\textsf{d}}_k = \sum _{i \in S} {\textsf{s}}^{(i)}_{{(i)}} \,, \end{aligned}$$

(92)

If the indices sent by the nodes are i.i.d., then under suitable assumptions on the delay, a convergence rate of

$$\begin{aligned} \mathbb {E}[f({\textsf{x}}_k)] - f^* \le \frac{2mK_{\tau }}{\tau ^2k + 2m} \end{aligned}$$

(93)

can be proved, where $K_{\tau } = m\kappa _{f, \tau }^{\otimes }(1 + \delta ) + h_0$ for $\delta $ depending on the delay error, with $\kappa _{f, \tau }^{\otimes }$ the average curvature constant in a minibatch keeping all the components not in the minibatch fixed.

In Osokin et al. (2016), several improvements are proposed for the BCFW, including an adaptive criterion to prioritize blocks based on their FW gap, and block coordinate versions of the AFW and the PFW variants.

In Shah et al. (2015), a multi plane BCFW approach is proposed in the specific case of the structured SVM, based on caching supporting planes in the primal, corresponding to block linear minimizers in the dual. In Berrada et al. (2019), the duality for structured SVM between BCFW and stochastic subgradient descent is exploited to define a learning rate schedule for neural networks based only on one hyper parameter. The block coordinate approach is extended to the generalized FW in Beck et al. (2015), with coordinates however picked in a cyclic order.

8.2 Conditional gradient sliding

The conditional gradient sliding (CGS) method, introduced in Lan and Zhou (2016), is based on the application of the classic FW method as a subroutine to compute projections in the accelerated projected gradient (APG) method. More precisely, let ${\textsf{g}}$ be the current negative gradient, in general different from $-\nabla f({\textsf{x}}_k)$ for the APG method. Since for a stepsize $\eta > 0$ we have

$$\begin{aligned} \pi _{C}({\textsf{x}}_k + \eta {\textsf{g}}) = \mathop {\mathrm {arg\,min}}\limits _{{\textsf{y}}\in C} \left[ {\textsf{g}}^\intercal ({\textsf{x}}_k - {\textsf{y}}) + \frac{1}{2\eta }\Vert {\textsf{y}}- {\textsf{x}}_k \Vert ^2\right] \,, \end{aligned}$$

(94)

the FW method can be applied to

$$\begin{aligned} \bar{V}_{{\textsf{g}}, {\textsf{x}}_k, \eta }({\textsf{y}}):= {\textsf{g}}^\intercal ({\textsf{x}}_k - {\textsf{y}}) + \frac{1}{2\eta }\Vert {\textsf{y}}- {\textsf{x}}_k \Vert ^2 \,, \end{aligned}$$

(95)

generating an auxiliary sequence $\{{\textsf{u}}_t^{(k)}\}$ with starting point ${\textsf{u}}_0^{(k)} = {\textsf{x}}_k$. The stopping criterion is based on the FW gap for $\bar{V}$:

$$\begin{aligned} \max _{{\textsf{x}}\in C} \, ( \beta ({\textsf{x}}_k - {\textsf{u}}_t^{(k)})+{\textsf{g}})^\intercal ({\textsf{x}}- {\textsf{u}}_t^{(k)}) \le \varepsilon _k \,, \end{aligned}$$

(96)

with $\varepsilon _k$ a decreasing sequence converging to 0. Notice how both the gradient and exact linesearch are readily available for the projection objective $\bar{V}$, so that the main costs of the projection subroutine come from the extra linear minimization oracle calls. In Lan and Zhou (2016), convergence rates of ${{\mathcal {O}}}(1/k^2)$ and ${{\mathcal {O}}}(1/k)$ in the number of gradient computations and LMO calls respectively were proved for convex objectives, thus achieving both the optimal ${{\mathcal {O}}}(1/k^2)$ rate for gradient descent methods (see e.g. Nesterov 1998) and the optimal ${{\mathcal {O}}}(1/k)$ in FW steps. The parameters used to achieve this rate were $\eta _k = \frac{3\,L}{k + 1}$ and $\varepsilon _k = \frac{LD^2}{k(k + 1)}$. In the strongly convex case, a rate linear in the number of gradient computations can be proved.

We refer the reader to (Lan (2020), sections 7.2, 7.3) for a comprehensive study of CGS, including also stochastic and finite sum objectives. The method was also studied in the composite setting in Cheung and Lou (2015) and in the smooth non-convex setting in Qu et al. (2018) and (with no acceleration) in Thekumparampil et al. (2020). A variant for unbounded domains was given in Gonçalves et al. (2020). In a different line of work, in Diakonikolas et al. (2020) a strategy switching between the AFW and accelerated gradient descent (roughly speaking, after support identification) was proved to achieve the optimal asymptotic convergence rate of ${{\mathcal {O}}}(\sqrt{\frac{L}{\mu }} \ln (1/\varepsilon ))$ in the strongly convex case.

8.3 Variants for the min-norm point problem

Consider the min-norm point (MNP) problem

$$\begin{aligned} \min _{{\textsf{x}}\in C} {\Vert {\textsf{x}}\Vert }_{*} \,, \end{aligned}$$

(97)

with $C$ a closed convex subset of $\mathbb {R}^n$ and ${\Vert \cdot \Vert }_{*}$ a norm on $\mathbb {R}^n$. In Wolfe (1976), a FW variant is introduced to solve the problem when $C$ is a polytope and ${\Vert \cdot \Vert }_*$ is the standard Euclidean norm $\Vert \cdot \Vert $. Similarly to the variants introduced in Sect. 5.3, it generates a sequence of active sets $\{A_k\}$ with $ {\textsf{s}}_k\in A_{k + 1}$. At the step k the norm is minimized on the affine hull ${{\,\textrm{aff}\,}}(A_k)$ of the current active set $A_k$, that is

$$\begin{aligned} {\textsf{v}}_k = \mathop {\mathrm {arg\,min}}\limits _{{\textsf{y}}\in {{\,\textrm{aff}\,}}(A_k)} \Vert {\textsf{y}} \Vert \,. \end{aligned}$$

(98)

The descent direction ${\textsf{d}}_k$ is then defined as

$$\begin{aligned} {\textsf{d}}_k = {\textsf{v}}_k - {\textsf{x}}_k\,, \end{aligned}$$

(99)

and the stepsize is given by a tailored linesearch that allows to remove some of the atoms in the set $A_k$ (see, e.g. Lacoste-Julien and Jaggi (2015), Wolfe (1976)). Whenever ${\textsf{x}}_{k + 1}$ is in the relative interior of $\mathop {\textrm{conv}}\limits (A_k)$, the FW vertex is added to the active set (that is, ${\textsf{s}}_k \in A_{k + 1}$). Otherwise, at least one of the vertices not appearing in a convex representation of ${\textsf{x}}_k$ is removed. This scheme converges linearly when applied to generic smooth strongly convex objectives (see, e.g., Lacoste-Julien and Jaggi 2015).

In Harchaoui et al. (2015), a FW variant is proposed for minimum norm problems of the form

$$\begin{aligned} \min \{{\Vert {\textsf{x}} \Vert }_*: f({\textsf{x}}) \le 0, \, {\textsf{x}}\in K \} \end{aligned}$$

(100)

with K a convex cone, f convex with L-Lipschitz gradient. In particular, the optimization domain is $C= \{{\textsf{x}}\in \mathbb {R}^n: f({\textsf{x}}) \le 0\} \cap K$. The technique proposed in the article applies the standard FW method to the problems

$$\begin{aligned} \min \{f({\textsf{x}}): {\Vert {\textsf{x}} \Vert }_* \le \delta _k, \, {\textsf{x}}\in K\} \,, \end{aligned}$$

with $\{\delta _k\}$ an increasing sequence convergent to the optimal value $\bar{\delta }$ of the problem (100). Let $C(\delta ) = \{{\textsf{x}}\in \mathbb {R}^n: {\Vert {\textsf{x}} \Vert }_* \le \delta \} \cap K $ for $\delta \ge 0$, and let

$$\begin{aligned} {\text {LM}}({\textsf{r}}) \in \mathop {\mathrm {arg\,min}}\limits \limits _{{{\textsf{x}}\in C(1)}} {\textsf{r}}^\intercal {\textsf{x}} \,, \end{aligned}$$

so that by homogeneity for every k the linear minimization oracle on $C(\delta _k)$ is given by

$$\begin{aligned} \text {LMO}_{C(\delta _k)}({\textsf{r}}) = \delta _k {\text {LM}}({\textsf{r}}) \,. \end{aligned}$$

(101)

For every k, applying the FW method with suitable stopping conditions an approximate minimizer ${\textsf{x}}_k$ of $f( {\textsf{x}})$ over $C(\delta _k)$ is generated, with an associated lower bound on the objective, an affine function in ${\textsf{y}}$:

$$\begin{aligned} f_{k}({\textsf{y}}):= f({\textsf{x}}_k) + \nabla f({\textsf{x}}_k)^\intercal ({\textsf{y}}- {\textsf{x}}_k) \,. \end{aligned}$$

(102)

Then the function

$$\begin{aligned} \ell _{k}(\delta ):= {\min _{{\textsf{y}}\in C(\delta )} f_k({\textsf{y}}) } = f_{k}(\delta {\text {LM}({\textsf{g}}_k)) \quad \text {with } {\textsf{g}}_k= \nabla f({\textsf{x}}_k)} \end{aligned}$$

(103)

is decreasing and affine in $\delta $ and satisfies

$$\begin{aligned} \ell _{k}(\delta ) = \min _{{\textsf{y}}\in C(\delta )} f_k({\textsf{y}}) \le F(\delta ): =\min _{{\textsf{y}}\in C(\delta )} f({\textsf{y}}) \,. \end{aligned}$$

(104)

Therefore, for

$$\begin{aligned} \bar{\ell }_k(\delta ) = \max _{i \in [1:k]} \ell _i(\delta ) \le F(\delta ) \end{aligned}$$

the quantity $\delta _{k + 1}$ can be defined as $\min \{\delta \ge 0: {\bar{\ell }_{k}(\delta )} \le 0 \}$, hence $F(\delta _{k + 1}) \ge 0$. A complexity bound of ${{\mathcal {O}}}(\frac{1}{\varepsilon } \ln (\frac{1}{\varepsilon }))$ was given to achieve precision $\varepsilon $ applying this method, with ${{\mathcal {O}}}(1/\varepsilon )$ iterations per subproblem and length of the sequence $\{\delta _k\}$ at most ${{\mathcal {O}}}(\ln (1/\varepsilon ))$ (see Harchaoui et al. (2015), Theorem 2 for details).

8.4 Variants for optimization over the trace norm ball

The FW method has found many applications for optimization problems over the trace norm ball. In this case, as explained in Example 3.4, linear optimization can be obtained by computing the top left and right singular vectors of the matrix $-\nabla f({\textsf{X}}_k)$, an operation referred to as 1-SVD (see Allen-Zhu et al. (2017)).

In the work (Freund et al., 2017), the FDFW is applied to the matrix completion problem (13), thus generating a sequence of matrices $\{{\textsf{X}}_k\}$ with ${\Vert {\textsf{X}}_k \Vert }_* \le \delta $ for every k. The method can be implemented efficiently exploiting the fact that for ${\textsf{X}}$ on the boundary of the nuclear norm ball, there is a simple expression for the face $\mathcal {F}({\textsf{X}})$. For ${\textsf{X}}\in \mathbb {R}^{m \times n}$ with $\mathop {\textrm{rank}}\limits ({\textsf{X}}) = k$ let ${\textsf{U}}{\textsf{D}}{\textsf{V}}^{\intercal }$ be the thin SVD of ${\textsf{X}}$, so that ${\textsf{D}}\in \mathbb {R}^{k \times k}$ is the diagonal matrix of non zero singolar values for ${\textsf{X}}$, with corresponding left and right singular vectors in the columns of ${\textsf{U}}\in \mathbb {R}^{m \times k}$ and ${\textsf{V}}\in \mathbb {R}^{n \times k}$ respectively. If ${\Vert {\textsf{X}} \Vert }_* = \delta $ then the minimal face of the domain containing ${\textsf{X}}$ is the set

$$\begin{aligned} \mathcal {F}({\textsf{X}}) = \{{\textsf{X}}\in \mathbb {R}^{m \times n}: {\textsf{X}}= {\textsf{U}}{\textsf{M}}{\textsf{V}}^{\intercal } \text { for } {{\textsf{M}}={\textsf{M}} ^{\intercal }\ \mathop {\textrm{psd}}\limits \text { with }} {\Vert {\textsf{M}} \Vert }_{*} = \delta \} \,. \end{aligned}$$

(105)

It is not difficult to see that we have $\mathop {\textrm{rank}}\limits ({\textsf{X}}_k) \le k + 1$ for every $k \in \mathbb {N}$, as well. Furthermore, the thin SVD of the current iterate ${\textsf{X}}_k$ can be updated efficiently both after FW steps and after in face steps. The convergence rate of the FDFW in this setting is still ${{\mathcal {O}}}(1/k)$.

In the recent work (Wang et al., 2022), an unbounded variant of the FW method is applied to solve a generalized version of the trace norm ball optimization problem:

$$\begin{aligned} \min _{{\textsf{X}}\in \mathbb {R}^{m \times n}}\{f({\textsf{X}}): {\Vert {\textsf{P}}{\textsf{X}}{\textsf{Q}} \Vert }_* \le \delta \} \end{aligned}$$

(106)

with ${\textsf{P}}, {\textsf{Q}}$ singular matrices. The main idea of the method is to decompose the domain in the sum $S + T$ between the kernel T of the linear function $\varphi _{{\textsf{P}}, {\textsf{Q}}}({\textsf{X}})= {\textsf{P}}{\textsf{X}}{\textsf{Q}}$and a bounded set $S \subset T^{\perp }$. Then gradient descent steps in the unbounded component T are alternated to FW steps in the bounded component S. The authors apply this approach to the generalized LASSO as well, using the AFW for the bounded component.

In Allen-Zhu et al. (2017), a variant of the classic FW using k-SVD (computing the top k left and right singular vectors for the SVD) is introduced, and it is proved that it converges linearly for strongly convex objectives when the solution has rank at most k. In Mu et al. (2016), the FW step is combined with a proximal gradient step for a quadratic problem on the product of the nuclear norm ball with the $\ell _1$ ball. Approaches using an equivalent formulation on the spectrahedron introduced in Jaggi and Sulovský (2010) are analyzed in Ding et al. (2020), Garber (2023).

9 Conclusions

While the concept of the FW method is quite easy to understand, its advantages, witnessed by a multitude of related work, may not be apparent to someone not closely familiar with the subject. Therefore we considered, in Sect. 3, several motivating applications, ranging from classic optimization to more recent machine learning problems. As in any line search-based method, the proper choice of stepsize is an important ingredient to achieve satisfactory performance. In Sect. 4, we review several options for stepsizes in first order methods, which are closely related both to the theoretical analysis as well as to practical implementation issues, guaranteeing fast convergence. This scope was investigated in more detail in Sect. 5 covering main results about the FW method and its most popular variants, including the ${{\mathcal {O}}}(1/k)$ convergence rate for convex objectives, affine invariance, the sparse approximation property, and support identification. The account is complemented by a report on recent progress in improving on the ${{\mathcal {O}}}(1/k)$ convergence rate in Sect. 6. Versatility and efficiency of this approach was also illustrated in the final Sect. 8 describing present recent FW variants fitting different optimization frameworks and computational environments, in particular block coordinate, distributed, accelerated, and trace norm optimization. For sure many other interesting and relevant aspects of FW and friends could not find their way into this review because of space and time limitations, but the authors hope to have convinced readers that FW merits a consideration even by non-experts in first-order optimization.

Notes

Details related to the LMO cost can be found in, e.g., Jaggi (2013).
i.e., those inequalities strictly satisfied for some ${\textsf{x}}\in C$.

References

Ahipaşaoğlu, S. D., Sun, P., & Todd, M. J. (2008). Linear convergence of a modified Frank–Wolfe algorithm for computing minimum-volume enclosing ellipsoids. Optimisation Methods and Software, 23(1), 5–19.
Article Google Scholar
Ahipaşaoğlu, S. D., & Todd, M. J. (2013). A modified Frank–Wolfe algorithm for computing minimum-area enclosing ellipsoidal cylinders: Theory and algorithms. Computational Geometry, 46(5), 494–519.
Article Google Scholar
Allen-Zhu, Z., Hazan, E., Hu, W., & Li, Y. (2017). Linear convergence of a Frank–Wolfe type algorithm over trace-norm balls. Advances in Neural Information Processing Systems, 2017, 6192–6201.
Google Scholar
Bach, F. (2013). Learning with submodular functions: A convex optimization perspective. Foundations and Trends in Machine Learning, 6(2–3), 145–373.
Article Google Scholar
Bach, F. (2015). Duality between subgradient and conditional gradient methods. SIAM Journal on Optimization, 25(1), 115–129.
Article Google Scholar
Bashiri, M.A., & Zhang, X. (2017). Decomposition-invariant conditional gradient for general polytopes with line search. Advances in Neural Information Processing Systems, pp. 2690–2700
Beck, A. (2017). First-order methods in optimization. Philadelphia: SIAM.
Book Google Scholar
Beck, A., Pauwels, E., & Sabach, S. (2015). The cyclic block conditional gradient method for convex optimization problems. SIAM Journal on Optimization, 25(4), 2024–2049.
Article Google Scholar
Beck, A., & Shtern, S. (2017). Linearly convergent away-step conditional gradient for non-strongly convex functions. Mathematical Programming, 164(1–2), 1–27.
Article Google Scholar
Berrada, L., Zisserman, A., & Kumar, M.P. (2019). Deep Frank-Wolfe for neural network optimization. In: International conference on learning representations
Bertsekas, D. P. (2015). Convex optimization algorithms. Athena Scientific
Bomze, I. M. (1997). Evolution towards the maximum clique. Journal of Global Optimization, 10(2), 143–164.
Article Google Scholar
Bomze, I. M., Budinich, M., Pardalos, P. M., Pelillo, M. (1999). The maximum clique problem. In: Handbook of combinatorial optimization, pp. 1–74. Springer
Bomze, I. M., & de Klerk, E. (2002). Solving standard quadratic optimization problems via linear, semidefinite and copositive programming. Journal of Global Optimization, 24(2), 163–185.
Article Google Scholar
Bomze, I. M., Rinaldi, F., & Rota Bulò, S. (2019). First-order methods for the impatient: Support identification in finite time with convergent Frank–Wolfe variants. SIAM Journal on Optimization, 29(3), 2211–2226.
Article Google Scholar
Bomze, I. M., Rinaldi, F., & Zeffiro, D. (2020). Active set complexity of the away-step Frank–Wolfe algorithm. SIAM Journal on Optimization, 30(3), 2470–2500.
Article Google Scholar
Bomze, I. M., Rinaldi, F., Zeffiro, D. (2021). Frank-Wolfe and friends: A journey into projection-free first-order optimization methods. 4OR 19(3), 313–345
Bomze, I. M., Rinaldi, F., & Zeffiro, D. (2022). Fast cluster detection in networks by first order optimization. SIAM Journal on Mathematics of Data Science, 4(1), 285–305.
Article Google Scholar
Bomze, I. M., Rinaldi, F., & Zeffiro, D. (2024). Projection free methods on product domains. Computational Optimization and Applications. https://doi.org/10.1007/s10589-024-00585-5
Article Google Scholar
Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
Book Google Scholar
Braun, G., Pokutta, S., Tu, D., & Wright, S. (2019). Blended conditonal gradients. In: International conference on machine learning, pp. 735–743. PMLR
Braun, G., Pokutta, S., & Zink, D. (2017). Lazifying conditional gradient algorithms. In: ICML, pp. 566–575
Bredies, K., Lorenz, D. A., & Maass, P. (2009). A generalized conditional gradient method and its connection to an iterative shrinkage method. Computational Optimization and Applications, 42, 173–193.
Article Google Scholar
Candès, E. J., & Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6), 717–772.
Article Google Scholar
Canon, M. D., & Cullum, C. D. (1968). A tight upper bound on the rate of convergence of Frank–Wolfe algorithm. SIAM Journal on Control, 6(4), 509–516.
Article Google Scholar
Carlini, N., & Wagner, D. (2017). Towards evaluating the robustness of neural networks. In: 2017 IEEE symposium on security and privacy (sp), pp. 39–57. IEEE
Chakrabarty, D., Jain, P., & Kothari, P. (2014). Provable submodular minimization using Wolfe’s algorithm. Advances in Neural Information Processing Systems, 27, 802–809.
Google Scholar
Chen, J., Zhou, D., Yi, J., & Gu, Q. (2020). A Frank–Wolfe framework for efficient and effective adversarial attacks. In: Proceedings of the AAAI conference on artificial intelligence, Vol. 34 no. 04, pp. 3486–3494
Chen, P. Y., Zhang, H., Sharma, Y., Yi, J., & Hsieh, C. J. (2017). ZOO: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In: Proceedings of the 10th ACM workshop on artificial intelligence and security, pp. 15–26
Chen, S. S., Donoho, D. L., & Saunders, M. A. (2001). Atomic decomposition by basis pursuit. SIAM Review, 43(1), 129–159.
Article Google Scholar
Cheung, Y., & Lou, J. (2015). Efficient generalized conditional gradient with gradient sliding for composite optimization. In: Twenty-fourth international joint conference on artificial intelligence
Clarkson, K. L. (2010). Coresets, sparse greedy approximation, and the Frank–Wolfe algorithm. ACM Transactions on Algorithms, 6(4), 1–30.
Article Google Scholar
Combettes, C., & Pokutta, S. (2020) Boosting frank-wolfe by chasing gradients. In: International conference on machine learning, pp. 2111–2121. PMLR
Combettes, C. W., & Pokutta, S. (2021). Complexity of linear minimization and projection on some sets. Operations Research Letters, 49(4), 565–571.
Article Google Scholar
Cristofari, A., De Santis, M., Lucidi, S., & Rinaldi, F. (2020). An active-set algorithmic framework for non-convex optimization problems over the simplex. Computational Optimization and Applications, 77, 57–89.
Article Google Scholar
Demyanov, V. F., & Rubinov, A. M. (1970). Approximate methods in optimization problems. American Elsevier
Devolder, O., Glineur, F., & Nesterov, Y. (2014). First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146(1), 37–75.
Article Google Scholar
Diakonikolas, J., Carderera, A., & Pokutta, S. (2020). Locally accelerated conditional gradients. In: International conference on artificial intelligence and statistics, pp. 1737–1747. PMLR
Ding, L., Fei, Y., Xu, Q., & Yang, C. (2020). Spectral Frank-Wolfe algorithm: Strict complementarity and linear convergence. In: International conference on machine learning, pp. 2535–2544. PMLR
Dunn, J. C. (1979). Rates of convergence for conditional gradient algorithms near singular and nonsingular extremals. SIAM Journal on Control and Optimization, 17(2), 187–211.
Article Google Scholar
Dunn, J. C., & Harshbarger, S. (1978). Conditional gradient algorithms with open loop step size rules. Journal of Mathematical Analysis and Applications, 62(2), 432–444.
Article Google Scholar
Ferreira, O., & Sosa, W. (2021). On the Frank–Wolfe algorithm for non-compact constrained optimization problems. Optimization pp. 1–15
Frank, M., & Wolfe, P. (1956). An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1–2), 95–110.
Article Google Scholar
Freund, R. M., & Grigas, P. (2016). New analysis and results for the Frank–Wolfe method. Mathematical Programming, 155(1–2), 199–230.
Article Google Scholar
Freund, R. M., Grigas, P., & Mazumder, R. (2017). An extended Frank–Wolfe method with in-face directions, and its application to low-rank matrix completion. SIAM Journal on Optimization, 27(1), 319–346.
Article Google Scholar
Fujishige, S. (1980). Lexicographically optimal base of a polymatroid with respect to a weight vector. Mathematics of Operations Research, 5(2), 186–196.
Article Google Scholar
Fukushima, M. (1984). A modified Frank–Wolfe algorithm for solving the traffic assignment problem. Transportation Research Part B: Methodological, 18(2), 169–177.
Article Google Scholar
Garber, D. (2020). Revisiting Frank–Wolfe for polytopes: Strict complementarity and sparsity. Advances in Neural Information Processing Systems 33
Garber, D. (2023). Linear convergence of Frank–Wolfe for rank-one matrix recovery without strong convexity. Mathematical Programming, 199(1), 87–121.
Article Google Scholar
Garber, D., & Hazan, E. (2015). Faster rates for the Frank–Wolfe method over strongly-convex sets. ICML, 15, 541–549.
Google Scholar
Garber, D., & Hazan, E. (2016). A linearly convergent variant of the conditional gradient algorithm under strong convexity, with applications to online and stochastic optimization. SIAM Journal on Optimization, 26(3), 1493–1528.
Article Google Scholar
Gonçalves, M. L., Melo, J. G., & Monteiro, R. D. (2020) Projection-free accelerated method for convex optimization. Optimization Methods and Software pp. 1–27
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. Advances in Neural Information Processing Systems, pp. 2672–2680
Guelat, J., & Marcotte, P. (1986). Some comments on Wolfe’s away step. Mathematical Programming, 35(1), 110–119.
Article Google Scholar
Gutman, D. H., & Pena, J. F. (2020). The condition number of a function relative to a set. Mathematical Programming pp. 1–40
Harchaoui, Z., Juditsky, A., & Nemirovski, A. (2015). Conditional gradient algorithms for norm-regularized smooth convex optimization. Mathematical Programming, 152(1), 75–112.
Article Google Scholar
Hogan, W. W. (1971). Convergence results for some extensions of the Frank-Wolfe method. Tech. rep.: California Univ Los Angeles Western Management Science Inst.
Holloway, C. A. (1974). An extension of the Frank and Wolfe method of feasible directions. Mathematical Programming, 6(1), 14–27.
Article Google Scholar
Hungerford, J. T., & Rinaldi, F. (2019). A general regularized continuous formulation for the maximum clique problem. Mathematics of Operations Research, 44(4), 1161–1173.
Article Google Scholar
Jaggi, M. (2011). Sparse convex optimization methods for machine learning. Ph.D. thesis, ETH Zurich
Jaggi, M. (2013). Revisiting Frank–Wolfe: Projection-free sparse convex optimization. ICML, 1, 427–435.
Google Scholar
Jaggi, M., Sulovský, M. (2010). A simple algorithm for nuclear norm regularized problems. In: ICML, pp. 471–478
Joulin, A., Tang, K., & Fei-Fei, L. (2014). Efficient image and video co-localization with Frank-Wolfe algorithm. In: European conference on computer vision, pp. 253–268. Springer
Kazemi, E., Kerdreux, T., & Wang, L. (2021). Generating structured adversarial attacks using Frank–Wolfe method. arXiv:2102.07360
Kerdreux, T., d’Aspremont, A., & Pokutta, S. (2021). Projection-free optimization on uniformly convex sets. In: International conference on artificial intelligence and statistics, pp. 19–27. PMLR
Kerdreux, T., Liu, L., Lacoste-Julien, S., & Scieur, D. (2021). Affine invariant analysis of Frank–Wolfe on strongly convex sets. International conference on machine learning, pp. 5398–5408
Konnov, I. (2018). Simplified versions of the conditional gradient method. Optimization, 67(12), 2275–2290.
Article Google Scholar
Kumar, P., Mitchell, J. S., & Yıldırım, E. A. (2003). Approximate minimum enclosing balls in high dimensions using core-sets. Journal of Experimental Algorithmics, 8, 1–1.
Article Google Scholar
Lacoste-Julien, S. (2016). Convergence rate of Frank–Wolfe for non-convex objectives. arXiv:1607.00345
Lacoste-Julien, S., Jaggi, M. (2015). On the global linear convergence of Frank–Wolfe optimization variants. In: Advances in neural information processing systems, pp. 496–504
Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletscher, P. (2013). Block-coordinate Frank-Wolfe optimization for structural SVMs. In: S. Dasgupta, D. McAllester (eds.) Proceedings of the 30th international conference on machine learning, Proceedings of Machine Learning Research, Vol. 28, pp. 53–61. PMLR, Atlanta, Georgia, USA
Lan, G. (2020). First-order and stochastic optimization methods for machine learning. Springer
Lan, G., & Zhou, Y. (2016). Conditional gradient sliding for convex optimization. SIAM Journal on Optimization, 26(2), 1379–1409.
Article Google Scholar
LeBlanc, L. J., Morlok, E. K., & Pierskalla, W. P. (1975). An efficient approach to solving the road network equilibrium traffic assignment problem. Transportation Research, 9(5), 309–318.
Article Google Scholar
Levitin, E. S., & Polyak, B. T. (1966). Constrained minimization methods. USSR Computational Mathematics and Mathematical Physics, 6(5), 1–50.
Article Google Scholar
Locatello, F., Khanna, R., Tschannen, M., & Jaggi, M. (2017). A unified optimization view on generalized matching pursuit and Frank–Wolfe. In: Artificial intelligence and statistics, pp. 860–868. PMLR
Luce, R. D., & Perry, A. D. (1949). A method of matrix analysis of group structure. Psychometrika, 14(2), 95–116.
Article Google Scholar
Mangasarian, O.: Machine learning via polyhedral concave minimization. In: Applied mathematics and parallel computing, pp. 175–188. Springer
Mitchell, B., Demyanov, V. F., & Malozemov, V. (1974). Finding the point of a polyhedron closest to the origin. SIAM Journal on Control, 12(1), 19–26.
Article Google Scholar
Mitradjieva, M., & Lindberg, P. O. (2013). The stiff is moving—conjugate direction Frank–Wolfe methods with applications to traffic assignment. Transportation Science, 47(2), 280–293.
Article Google Scholar
Mu, C., Zhang, Y., Wright, J., & Goldfarb, D. (2016). Scalable robust matrix recovery: Frank–Wolfe meets proximal methods. SIAM Journal on Scientific Computing, 38(5), A3291–A3317.
Article Google Scholar
Nesterov, Y. (1998). Introductory lectures on convex [p]rogramming Volume I: Basic course. Lecture notes
Nesterov, Y. (2018). Complexity bounds for primal-dual methods minimizing the model of objective function. Mathematical Programming, 171(1), 311–330.
Article Google Scholar
Osokin, A., Alayrac, J. B., Lukasewitz, I., Dokania, P., & Lacoste-Julien, S. (2016). Minding the gaps for block Frank–Wolfe optimization of structured svms. In: International conference on machine learning, pp. 593–602. PMLR
Peña, J., & Rodriguez, D. (2018). Polytope conditioning and linear convergence of the Frank–Wolfe algorithm. Mathematics of Operartions Research, 44(1), 1–18.
Google Scholar
Pedregosa, F., Negiar, G., Askari, A., & Jaggi, M. (2020). Linearly convergent Frank–Wolfe with backtracking line-search. In: International conference on artificial intelligence and statistics, pp. 1–10. PMLR
Perederieieva, O., Ehrgott, M., Raith, A., & Wang, J. Y. (2015). A framework for and empirical study of algorithms for traffic assignment. Computers & Operations Research, 54, 90–107.
Article Google Scholar
Pierucci, F., Harchaoui, Z., & Malick, J. (2014). A smoothing approach for composite conditional gradient with nonsmooth loss. Tech. rep., RR-8662, INRIA Grenoble
Qu, C., Li, Y., & Xu, H. (2018). Non-convex conditional gradient sliding. In: International conference on machine learning, pp. 4208–4217. PMLR
Rademacher, L., & Shu, C. (2022). The smoothed complexity of Frank–Wolfe methods via conditioning of random matrices and polytopes. Mathematical Statistics and Learning, 5(3), 273–310.
Article Google Scholar
Rinaldi, F., Schoen, F., & Sciandrone, M. (2010). Concave programming for minimizing the zero-norm over polyhedral sets. Computational Optimization and Applications, 46(3), 467–486.
Article Google Scholar
Rinaldi, F., & Zeffiro, D. (2020). A unifying framework for the analysis of projection-free first-order methods under a sufficient slope condition. arXiv:2008.09781
Rinaldi, F., & Zeffiro, D. (2023). Avoiding bad steps in Frank–Wolfe variants. Computational Optimization and Applications, 84(1), 225–264.
Article Google Scholar
Sahu, A. K., & Kar, S. (2020). Decentralized zeroth-order constrained stochastic optimization algorithms: Frank–Wolfe and variants with applications to black-box adversarial attacks. Proceedings of the IEEE, 108(11), 1890–1905.
Article Google Scholar
Shah, N., Kolmogorov, V., & Lampert, C. H. (2015). A multi-plane block-coordinate Frank–Wolfe algorithm for training structural svms with a costly max-oracle. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2737–2745
Sun, Y. (2020). Safe screening for the generalized conditional gradient method. Image 1, 2
Thekumparampil, K. K., Jain, P., Netrapalli, P., Oh, S. (2020). Projection efficient subgradient method and optimal nonsmooth Frank–Wolfe method. In: H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, H. Lin (eds.) Advances in neural information processing systems, vol. 33, pp. 12211–12224. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/8f468c873a32bb0619eaeb2050ba45d1-Paper.pdf
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.
Article Google Scholar
Vapnik, V. (2013). The Nature of Statistical Learning Theory. Springer
Von Hohenbalken, B. (1977). Simplicial decomposition in nonlinear programming algorithms. Mathematical Programming, 13(1), 49–68.
Article Google Scholar
Wang, H., Lu, H., & Mazumder, R. (2022). Frank-Wolfe methods with an unbounded feasible region and applications to structured learning. SIAM Journal on Optimization, 32(4), 2938–2968.
Article Google Scholar
Wang, Y. X., Sadhanala, V., Dai, W., Neiswanger, W., Sra, S., Xing, E. (2016). Parallel and distributed block-coordinate Frank–Wolfe algorithms. In: International conference on machine learning, pp. 1548–1557. PMLR
Wardrop, J. G. (1952). Road paper some theoretical aspects of road traffic research. Proceedings of the Institution of Civil Engineers, 1(3), 325–362.
Article Google Scholar
Weintraub, A., Ortiz, C., & González, J. (1985). Accelerating convergence of the Frank-Wolfe algorithm. Transportation Research Part B: Methodological, 19(2), 113–122.
Article Google Scholar
Wolfe, P. (1970). Convergence theory in nonlinear programming. In: J. Abadie (ed.) Integer and nonlinear programming, pp. 1–36. North Holland
Wolfe, P. (1976). Finding the nearest point in a polytope. Mathematical Programming, 11(1), 128–149.
Article Google Scholar
Wu, Q., & Hao, J. K. (2015). A review on algorithms for maximum clique problems. European Journal of Operational Research, 242(3), 693–709.
Article Google Scholar
Xu, Y., Yang, T. (2018). Frank-Wolfe method is automatically adaptive to error bound condition. arXiv:1810.04765
Yıldırım, E. A. (2008). Two algorithms for the minimum enclosing ball problem. SIAM Journal on Optimization, 19(3), 1368–1391.
Article Google Scholar
Yu, Y., Zhang, X., & Schuurmans, D. (2017). Generalized conditional gradient for sparse estimation. Journal of Machine Learning Research, 18(144), 1–46.
Google Scholar
Yurtsever, A., Fercoq, O., Locatello, F., Cevher, V. (2018). A conditional gradient framework for composite convex minimization with applications to semidefinite programming. In: International conference on machine learning, pp. 5727–5736. PMLR

Download references

Funding

Open access funding provided by University of Vienna.

Author information

Authors and Affiliations

Faculty of Mathematics & ds:UniVie, Universität Wien, Wien, Austria
Immanuel. M. Bomze
Dipartimento di Matematica “Tullio Levi-Civita”, Università di Padova, Padova, Italy
Francesco Rinaldi & Damiano Zeffiro

Authors

Immanuel. M. Bomze
View author publications
You can also search for this author inPubMed Google Scholar
Francesco Rinaldi
View author publications
You can also search for this author inPubMed Google Scholar
Damiano Zeffiro
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Immanuel. M. Bomze.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is an updated version of the paper that appeared in 4OR, Vol. 19, pp. 313–345 (2021).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Bomze, I.M., Rinaldi, F. & Zeffiro, D. Frank–Wolfe and friends: a journey into projection-free first-order optimization methods. Ann Oper Res 343, 607–638 (2024). https://doi.org/10.1007/s10479-024-06251-7

Download citation

Received: 09 April 2024
Accepted: 30 July 2024
Published: 16 September 2024
Issue Date: December 2024
DOI: https://doi.org/10.1007/s10479-024-06251-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Frank–Wolfe and friends: a journey into projection-free first-order optimization methods

Abstract

Similar content being viewed by others

Frank–Wolfe and friends: a journey into projection-free first-order optimization methods

A Note on Open Problems and Challenges in Optimization Theory and Algorithms

Nonlinear Optimization: A Brief Overview

1 Introduction

1.1 Organisation of the paper

1.2 Notation

2 Problem and general scheme

2.1 The classical Frank–Wolfe method

3 Examples

3.1 Traffic assignment

3.2 Submodular optimization

3.3 LASSO problem

3.4 Matrix completion

3.5 Adversarial attacks in machine learning

3.6 Minimum enclosing ball

3.7 Training linear Support Vector Machines

3.8 Finding maximal cliques in graphs

3.9 Finding sparse points in a set

4 Stepsizes

Lemma 1

Proof

5 Properties of the FW method and its variants

5.1 The FW gap

5.2 \({{\mathcal {O}}}(1/k)\) rate for convex objectives

Theorem 1

Lemma 2

Proof

Proof

5.3 Variants

5.4 Sparse approximation properties

5.5 Affine invariance

5.6 Support identification for the AFW

Lemma 3

Proof

Theorem 2

Proof

5.7 Inexact linear oracle

6 Improved rates for strongly convex objectives

6.1 Linear convergence under an angle condition

Lemma 4

Proof

Theorem 3

Proof

6.2 Strongly convex domains

7 Generalized FW for composite non-smooth optimization

Theorem 4

Proof

8 Extensions

8.1 Block coordinate Frank–Wolfe method

8.2 Conditional gradient sliding

8.3 Variants for the min-norm point problem

8.4 Variants for optimization over the trace norm ball

9 Conclusions

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords