Abstract
The majority of first-order methods for large-scale convex–concave saddle point problems and variational inequalities with monotone operators are proximal algorithms. To make such an algorithm practical, the problem’s domain should be proximal-friendly—admit a strongly convex function with easy to minimize linear perturbations. As a by-product, this domain admits a computationally cheap linear minimization oracle (LMO) capable to minimize linear forms. There are, however, important situations where a cheap LMO indeed is available, but the problem domain is not proximal-friendly, which motivates search for algorithms based solely on LMO. For smooth convex minimization, there exists a classical algorithm using LMO—conditional gradient. In contrast, known to us similar techniques for other problems with convex structure (nonsmooth convex minimization, convex–concave saddle point problems, even as simple as bilinear ones, and variational inequalities with monotone operators, even as simple as affine) are quite recent and utilize common approach based on Fenchel-type representations of the associated objectives/vector fields. The goal of this paper was to develop alternative (and seemingly much simpler) decomposition techniques based on LMO for bilinear saddle point problems and for variational inequalities with affine monotone operators.
Similar content being viewed by others
Notes
Note that the saddle point frontier depends on the order of blocks in the x- and the y-variables, and this order will always be clear from the context.
The construction to follow can be easily extended from “knapsack-generated” matrices to more general “Dynamic Programming-generated” ones, see Sect. 1 in the “Appendix.”
For implementation details, see Sect. 1.
“a primal” instead of “the primal” reflects the fact that \(\varPsi \) is not uniquely defined by \(\varPhi \)—it is defined by \(\varPhi \) and \(\overline{\eta }\) and by how the values of \(\varPsi \) are selected when (32) does not specify these values uniquely.
“covers” instead of “is equivalent” stems from the fact that the scope of decomposition is not restricted to the setups of the form of (41).
Note that applying Carathéodory theorem, we could further “compress” the representations of approximate solutions—make these solutions convex combinations of at most \(K+1\) of \(\delta ^D_{d^i}\)s and \(\delta ^A_{a^i}\)s.
References
Juditsky, A., Nemirovski, A.: Solving variational inequalities with monotone operators on domains given by linear minimization oracles. Math. Program. 152(1), 1–36 (2013)
Harchaoui, Z., Juditsky, A., Nemirovski, A.: Conditional gradient algorithms for norm-regularized smooth convex optimization. Math. Program. 152(1), 75–112 (2014)
Ziegler, G.M.: Lectures on Polytopes, vol. 152. Springer, Berlin (1995)
Frank, M., Wolfe, P.: An algorithm for quadratic programming. Naval Res. Logist. Q. 3(1–2), 95–110 (1956)
Demyanov, V., Rubinov, A.: Approximate Methods in optimization problems, vol. 32. Elsevier, Amsterdam (1970)
Dunn, J.C., Harshbarger, S.: Conditional gradient algorithms with open loop step size rules. J. Math. Anal. Appl. 62(2), 432–444 (1978)
Freund, R.M., Grigas, P.: New analysis and results for the Frank–Wolfe method. Math. Program. 155(1–2), 199–230 (2016)
Garber, D., Hazan, E.: Faster Rates for the Frank–Wolfe Method Over Strongly-convex Sets. arXiv preprint arXiv:1406.1305 (2014)
Harchaoui, Z., Douze, M., Paulin, M., Dudik, M., Malick, J.: Large-scale image classification with trace-norm regularization. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3386–3393. IEEE (2012)
Jaggi, M.: Revisiting Frank–Wolfe: projection-free sparse convex optimization. In: Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 427–435 (2013)
Jaggi, M., Sulovsk, M., et al.: A simple algorithm for nuclear norm regularized problems. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 471–478 (2010)
Pshenichny, B.N., Danilin, Y.M.: Numerical Methods in Extremal Problems. Mir Moscow (1978)
Argyriou, A., Signoretto, M., Suykens, J.A.K.: Hybrid algorithms with applications to sparse and low rank regularization, chap. 3. In: Suykens, J.A.K., Signoretto,M., Argyriou, A., (eds.) Regularization, Optimization, Kernels, and Support Vector Machines, pp. 53–82. Chapman & Hall/CRC (2014)
Pierucci, F., Harchaoui, Z., Malick, J.: A Smoothing Approach for Composite Conditional Gradient with Nonsmooth Loss. Tech. rep., Inria (2014). https://hal.inria.fr/hal-01096630/
Tewari, A., Ravikumar, P.K., Dhillon, I.S.: Greedy algorithms for structurally constrained high dimensional problems. In: Advances in Neural Information Processing Systems, pp. 882–890 (2011)
Ying, Y., Li, P.: Distance metric learning with eigenvalue optimization. J. Mach. Learn. Res. 13(1), 1–26 (2012)
Cox, B., Juditsky, A., Nemirovski, A.: Dual subgradient algorithms for large-scale nonsmooth learning problems. Math. Program. 148(1–2), 143–180 (2014)
Lan, G., Zhou, Y.: Conditional Gradient Sliding for Convex Optimization (2014). http://www.ise.ufl.edu/glan/files/2015/09/CGS08-31
Hiriart-Urruty, J.B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms I: Fundamentals, vol. 305. Springer, Berlin (2013)
Nemirovski, A., Onn, S., Rothblum, U.G.: Accuracy certificates for computational problems with convex structure. Math. Oper. Res. 35(1), 52–78 (2010)
Cox, B.: Applications of Accuracy Certificates for Problems with Convex Structure. Ph.D. thesis, Georgia Institute of Technology (2011). https://smartech.gatech.edu/jspui/bitstream/1853/39489/1/cox_bruce_a_201105_phd
Gol’stein, E.: Direct-dual block method of linear programming. Autom. Remote Control 57(11), 1531–1536 (1996)
Gol’stein, E., Sokolov, N.: A decomposition algorithm for solving multicommodity production-and-transportation problem. Ekonomika i Matematicheskie Metody 33(1), 112–128 (1997)
Dvurechensky, P., Nesterov, Y., Spokoiny, V.: Primal-dual methods for solving infinite-dimensional games. J. Optim. Theory Appl. 166(1), 23–51 (2015)
Bellman, R.: On “Colonel Blotto” and analogous games. SIAM Rev. 11(1), 66–68 (1969)
Robertson, B.: The Colonel Blotto game. Econ. Theory 29(1), 1–24 (2006)
Grant, M., Boyd, S.: Cvx: Matlab Software for Disciplined Convex Programming, version 2.1 (2015). http://cvxr.com/cvx
Acknowledgments
A. Juditsky was supported by the CNRS-Mastodons project Titan, and the LabEx PERSYVAL-Lab (ANR-11-LABX-0025). Research of A. Nemirovski was supported by the NSF Grants CMMI-1232623, CCF-1415498, CMMI-1262063.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Proof of Lemma 2.1
It suffices to prove the \(\phi \)-related statements. Lipschitz continuity of \(\phi \) in the direct product case is evident. Furthermore, the function \(\theta (x_1,x_2;y_1)=\max \limits _{y_2\in Y_2[y_1]}\varPhi (x_1,x_2;y_1,y_2)\) is convex and Lipschitz continuous in \(x=[x_1;x_2]\in X\) for every \(y_1\in Y_1\), whence
is convex and lower semicontinuous in \(x_1\in X_1\) (note that X is compact). On the other hand,
so that \(\chi (x_1;y_1,y_2)\) is concave and Lipschitz continuous in \(y=[y_1;y_2]\in Y\) for every \(x_1\in X_1\), whence
is concave and upper semicontinuous in \(y_1\in Y_1\) (note that Y is compact).
Next, we have
as required in (2). Finally, let \(\bar{x}=[\bar{x}_1;\bar{x}_2]\in X\) and \(\bar{y}=[\bar{y}_1;\bar{y}_2]\in Y\). We have
and
We conclude that
as claimed in (3). \(\square \)
Proof of Lemma 2.2
For \(x_1\in X_1\), we have
as claimed in (a). “Symmetric” reasoning justifies (b). \(\square \)
Proof of Lemma 2.3
Assume that (5) holds true. Then, G clearly is certifying, implying that
and therefore (5) reads
where taking minimum in the left-hand side over \(x_2\in X_2[x_1]\),
as claimed in (ii).
Now assume that (i) and (ii) hold true. By (i), \(\chi _G(\bar{x}_1)=\langle G,[\bar{x}_1;\bar{x}_2]\rangle \), and by (ii) combined with the definition of \(\chi _G\),
implying (5). \(\square \)
1.1 Dynamic Programming-Generated Simple Matrices
Consider the situation as follows. There exists an evolving in time system \(\mathcal{S}\), with state \(\xi _s\) at time \(s=1,2,\ldots ,m\) belonging to a given finite nonempty set \(\Xi _s\). Furthermore, every pair \((\xi ,s)\) with \(s\in \{1,\ldots ,m\}\), \(\xi \in \Xi _s\) is associated with nonempty finite set of actions \(A^s_\xi \), and we set
Furthermore, for every s, \(1\le s< m\), a transition mapping \(\pi _{s}(\xi ,a):\mathcal{S}_s\rightarrow \Xi _{s+1}\) is given. Finally, we are given vector-valued functions (”outputs”) \(\chi _s:\mathcal{S}_s\rightarrow {\mathbb {R}}^{r_s}\).
A trajectory of \(\mathcal{S}\) is a sequence \(\{(\xi _s,a_s):1\le s\le m\}\) such that \((\xi _s,a_s)\in \mathcal{S}_s\) for \(1\le s\le m\) and
The output of a trajectory \(\tau =\{(\xi _s,a_s):1\le s\le m\}\) is the block vector
We can associate with \(\mathcal{S}\) the matrix \(D=D[\mathcal{S}]\) with \(K=r_1+\cdots +r_m\) rows and with columns indexed by the trajectories of \(\mathcal{S}\); specifically, the column indexed by a trajectory \(\tau \) is \(\chi [\tau ]\).
For example, knapsack-generated matrix D associated with knapsack data from Sect. 2.6.2 is of the form \(D[\mathcal{S}]\) with system \(\mathcal{S}\) as follows:
-
\(\Xi _s\), \(s=1,\ldots ,m\), is the set of nonnegative integers which are \(\le H\);
-
\(A^s_\xi \) is the set of nonnegative integers a such that \(a\le \bar{p}_s\) and \(\xi -h_sp_s\ge 0\);
-
the transition mappings are \(\pi _{s}(\xi ,a)=\xi -ah_s\);
-
the outputs are \(\chi _s(\xi ,a)=f_s(a)\), \(1\le s\le m\).
In the notation of Sect. 2.6.2, vectors \([p_1;\ldots ;p_m]\in \mathcal{P}\) are exactly the sequences of actions \(a_1,\ldots ,a_m\) stemming from the trajectories of the just defined system \(\mathcal{S}\).
Observe that matrix \(D=D[\mathcal{S}]\) is simple, provided the cardinalities of \(\Xi _s\) and \(A^s_\xi \) are reasonable. Indeed, given \(x=[x_1;\ldots ;x_m]\in {\mathbb {R}}^{n}={\mathbb {R}}^{r_1}\times \cdots \times {\mathbb {R}}^{r_m}\), we can identify \(\overline{D}[x]\) by dynamic programming, running first the backward Bellman recurrence
(where \(U_{m+1}(\cdot )\equiv 0\)), and then recovering the (trajectory indexing the) column of D corresponding to \(\overline{D}[x]\) by running the forward Bellman recurrence
1.2 Attacker Versus Defender Via Ellipsoid Algorithm
In our implementation,
-
1.
Relation (39) is ensured by specifying U, V as centered at the origin Euclidean balls of radius R, where R is an upper bound on the Euclidean norms of the columns in D and in A (such a bound can be easily obtained from the knapsack data specifying the matrices D, A).
-
2.
We process the monotone vector field associated with the primal SP problem (30), that is, the field
$$\begin{aligned} F(u,v)=[F_u(u,v)=\overline{A}[u]-v;F_v(u,v)=u-\underline{D}[v]] \end{aligned}$$by ellipsoid algorithm with accuracy certificates from [20]. For \(\tau =1,2,\ldots ,\) the algorithm generates search points \([u_\tau ;v_\tau ]\in {\mathbb {R}}^K\times {\mathbb {R}}^K\), with \([u_1;v_1]=0\), along with execution protocols \(\mathcal{I}^\tau =\{[u_i;v_i],F(u_i,v_i):i\in I_\tau \}\), where \(I_\tau =\{i\le \tau :[u_i;v_i]\in U\times V\}\), augmented by accuracy certificates \(\lambda ^\tau =\{\lambda ^\tau _i\ge 0:i\in I_\tau \}\) such that \(\sum _{i\in I_\tau }\lambda ^\tau _i=1\). From the results of [20], it follows that for every \(\epsilon >0\) it holds
$$\begin{aligned} \tau \ge N(\epsilon ):= O(1)K^2\ln \left( 2{R+\epsilon \over \epsilon }\right) \Rightarrow {\mathrm{Res}}(\mathcal{I}^\tau ,\lambda ^\tau \big |U\times V)\le \epsilon . \end{aligned}$$(45) -
3.
When computing \(F(u_i,v_i)\) (this computation takes place only at productive steps—those with \([u_i;v_i]\in U\times V\)), we get, as a by-product, the columns \(A^i=\overline{A}[u_i]\) and \(D^i=\underline{D}[v_i]\) of matrices A, D, along with the indexes \(a^i\), \(d^i\) of these columns (recall that these indexes are pure strategies of attacker and defender and thus, according to the construction of A, D, are collections of m nonnegative integers). In our implementation, we stored these columns, same as their indexes and the corresponding search points \([u_i;v_i]\). As is immediately seen, in the case in question the approximate solution \([w^\tau ;z^\tau ]\) to the SP problem of interest (27) induced by execution protocol \(\mathcal{I}^\tau \) and accuracy certificate \(\lambda ^\tau \) is comprised of two sparse vectors
$$\begin{aligned} w^\tau =\sum _{i\in I_\tau }\lambda ^\tau _i\delta ^D_{d^i},\,\,z^\tau =\sum _{i\in I_\tau }\lambda ^\tau _i\delta ^A_{a^i}, \end{aligned}$$(46)where \(\delta ^D_d\) is the “dth basic orth” in the simplex \(\varDelta _N\) of probabilistic vectors with entries indexed by pure strategies of defender, and similarly for \(\delta ^A_a\). Thus, we have no difficulties with representing our approximate solutions,Footnote 8 in spite of their huge ambient dimension.
According to our general theory and (45), the number of steps needed to get an \(\epsilon \)-solution [w; z] to the problem of interest (i.e., a feasible solution with \(\epsilon _{{\tiny \mathrm sad}}([w;z]\big |\psi ,W,Z)\le \epsilon )\) does not exceed \(N(\epsilon )\), with computational effort per step dominated by the necessity to identify \(\overline{A}[u_i]\), \(\underline{D}[v_i]\) by dynamic programming.
In fact, we used the outlined scheme with two straightforward modifications.
-
First, instead of building the accuracy certificates \(\lambda ^\tau \) according to the rules from [20], we used the best, given execution protocols \(\mathcal{I}^\tau \), accuracy certificates by solving the convex program
$$\begin{aligned} \min _\lambda \left\{ {\mathrm{Res}}(\mathcal{I}^\tau ,\lambda \big |U\times V)\!:=\!\max _{y\in U\times V}\sum _{i\in I_\tau } \lambda _i \langle F(u_i,v_i),[u_i;v_i]-y\rangle :\lambda _i\!\ge \!0,\sum _{i\in I_\tau }\lambda _i\!=\!1\right\} . \end{aligned}$$(47)In our implementation, this problem was solved once per \(4K^2\) steps. Note that with U, V being Euclidean balls, (47) is a Conic Quadratic Problem and may be solved using, e.g., CVX [27].
-
Second, given current approximate solution (46) to the problem of interest, we can compute its saddle point inaccuracy exactly instead of upper-bounding it by \({\mathrm{Res}}(\mathcal{I}^\tau ,\lambda ^\tau \big |U\times V)\). Indeed, it is immediately seen that
$$\begin{aligned} \epsilon _{{\tiny \mathrm sad}}([w^\tau ;z^\tau ]\big |\psi ,W,Z)\!=\!{\hbox {Max}}\left( A^T\left[ \sum _{i\in I_\tau }\lambda ^\tau _iD^i\right] \right) -{\hbox {Min}}\left( D^T\left[ \sum _{i\in I_\tau }\lambda ^\tau _iA^i\right] \right) . \end{aligned}$$In our implementation, we performed this computation each time when a new accuracy certificate was computed, and terminated the solution process when the saddle point inaccuracy became less than a given threshold (1.e-4).
Proof of Proposition 3.2
(i): Let \(\xi _1,\xi _2\in \Xi \), and let \(\eta _1=\overline{\eta }(\xi _1)\), \(\eta _2=\overline{\eta }(\xi _2)\). By (32), we have
Summing inequalities up, we get
so that \(\varPsi \) is monotone.
Furthermore, the first inequality in (35) is due to Proposition 3.1. To prove the second inequality in (35), let \(\mathcal{I}_t\!=\!\{\xi _i\in \Xi ,\varPsi (\xi _i):1\le i\le t\}\), \(\mathcal{J}_t\!=\!\{\theta _i:=[\xi _i;\overline{\eta }(\xi _i)],\varPhi (\theta _i):\) \(1\le i\le t\}\), and let \(\lambda \) be t-step accuracy certificate. We have
(i) is proved.
(ii): Let \(\eta \in H\). Invoking (34), we have
and (36) follows. \(\square \)
Rights and permissions
About this article
Cite this article
Cox, B., Juditsky, A. & Nemirovski, A. Decomposition Techniques for Bilinear Saddle Point Problems and Variational Inequalities with Affine Monotone Operators. J Optim Theory Appl 172, 402–435 (2017). https://doi.org/10.1007/s10957-016-0949-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-016-0949-3
Keywords
- Decomposition techniques
- Conditional gradients
- Variational problems with affine monotone operator
- Proximal algorithms