Abstract
Maximum a posteriori (MAP) inference is an important task for graphical models. Due to complex dependencies among variables in realistic models, finding an exact solution for MAP inference is often intractable. Thus, many approximation methods have been developed, among which the linear programming (LP) relaxation based methods show promising performance. However, one major drawback of LP relaxation is that it is possible to give fractional solutions. Instead of presenting a tighter relaxation, in this work we propose a continuous but equivalent reformulation of the original MAP inference problem, called LS–LP. We add the \(\ell _2\)-sphere constraint onto the original LP relaxation, leading to an intersected space with the local marginal polytope that is equivalent to the space of all valid integer label configurations. Thus, LS–LP is equivalent to the original MAP inference problem. We propose a perturbed alternating direction method of multipliers (ADMM) algorithm to optimize the LS–LP problem, by adding a sufficiently small perturbation \(\epsilon \) onto the objective function and constraints. We prove that the perturbed ADMM algorithm globally converges to the \(\epsilon \)-Karush–Kuhn–Tucker (\(\epsilon \)-KKT) point of the LS–LP problem. The convergence rate will also be analyzed. Experiments on several benchmark datasets from Probabilistic Inference Challenge (PIC 2011) and OpenGM 2 show competitive performance of our proposed method against state-of-the-art MAP inference methods.

Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Attouch, H., & Bolte, J. (2009). On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Mathematical Programming, 116(1–2), 5–16.
Attouch, H., Bolte, J., Redont, P., & Soubeyran, A. (2010). Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka–Lojasiewicz inequality. Mathematics of Operations Research, 35(2), 438–457.
Besag, J. (1986). On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society: Series B (Methodological), 48(3), 259–279.
Bolte, J., Daniilidis, A., & Lewis, A. (2007). The łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4), 1205–1223.
Bolte, J., Daniilidis, A., Lewis, A., & Shiota, M. (2007). Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18(2), 556–572.
Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1), 1–122.
Elidan, G., Globerson, A., & Heinemann, U. (2012). Pascal 2011 probabilistic inference challenge. Retrieved July 15, 2020, from http://www.cs.huji.ac.il/project/PASCAL/index.php.
Fu, Q., & Banerjee, H. W. A. (2013). Bethe-ADMM for tree decomposition based parallel map inference. In Uncertainty in artificial intelligence (p. 222). Citeseer.
Globerson, A., & Jaakkola, T. S. (2008). Fixing max-product: Convergent message passing algorithms for MAP LP-relaxations. In NIPS (pp. 553–560).
Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In ICCV (pp. 1–8). IEEE.
Jaimovich, A., Elidan, G., Margalit, H., & Friedman, N. (2006). Towards an integrated protein–protein interaction network: A relational markov network approach. Journal of Computational Biology, 13(2), 145–164.
Johnson, J. K., Malioutov, D. M., & Willsky, A. S. (2007). Lagrangian relaxation for map estimation in graphical models. ArXiv preprint arXiv:0710.0013.
Jojic, V., Gould, S., Koller, D. (2010). Accelerated dual decomposition for map inference. In ICML (pp. 503–510).
Kappes, J. H., Andres, B., Hamprecht, F. A., Schnörr, C., Nowozin, S., Batra, D., et al. (2015). A comparative study of modern inference techniques for structured discrete energy minimization problems. International Journal of Computer Vision, 115, 155–184.
Kappes, J. H., Savchynskyy, B., Schnörr, C. (2012). A bundle approach to efficient map-inference by lagrangian relaxation. In CVPR (pp. 1688–1695). IEEE.
Karush, W. (1939). Minima of functions of several variables with inequalities as side constraints. M.Sc. Dissertation. Department of Mathematics, University of Chicago.
Kelley, J. (1960). The cutting-plane method for solving convex programs. Journal of the Society for Industrial and Applied Mathematics, 8(4), 703–712.
Koller, D., & Nir, F. (Eds.). (2009). Probabilistic graphical models: Principles and techniques. Cambridge, MA: MIT Press.
Kolmogorov, V. (2006). Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1568–1583.
Komodakis, N., Paragios, N., & Tziritas, G. (2007) MRF optimization via dual decomposition: Message-passing revisited. In ICCV (pp. 1–8). IEEE.
Kschischang, F. R., Frey, B. J., & Loeliger, H. A. (2001). Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2), 498–519.
Kuhn, H. W., & Tucker, A. W. (2014). Nonlinear programming. In Traces and emergence of nonlinear programming (pp. 247–258). Springer.
Land, A. H., & Doig, A. G. (1960). An automatic method of solving discrete programming problems. Econometrica, 28, 497–520.
Laurent, M., & Rendl, F. (2002). Semidefinite programming and integer programming. Centrum voor Wiskunde en Informatica.
Li, G., & Pong, T. K. (2015). Global convergence of splitting methods for nonconvex composite optimization. SIAM Journal on Optimization, 25(4), 2434–2460.
Lojasiewicz, S. (1963). Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles, 117, 87–89.
Martins, A. F., Figeuiredo, M. A., Aguiar, P. M., Smith, N. A., Xing, E. P. (2011). An augmented lagrangian approach to constrained map inference. In ICML.
Martins, A. F., Figueiredo, M. A., Aguiar, P. M., Smith, N. A., & Xing, E. P. (2015). AD3: Alternating directions dual decomposition for map inference in graphical models. Journal of Machine Learning Research, 16(1), 495–545.
Meshi, O., & Globerson, A. (2011). An alternating direction method for dual MAP LP relaxation. In Joint European conference on machine learning and knowledge discovery in databases (pp. 470–483). Springer.
Meshi, O., Mahdavi, M., & Schwing, A. (2015). Smooth and strong: Map inference with linear convergence. In NIPS (pp. 298–306).
Otten, L., & Dechter, R. (2012). Anytime and/or depth-first search for combinatorial optimization. AI Communications, 25(3), 211–227.
Otten, L., Ihler, A., Kask, K., & Dechter, R. (2012). Winning the pascal 2011 map challenge with enhanced and/or branch-and-bound. In IN NIPS WORKSHOP DISCML. Citeseer.
Savchynskyy, B., Schmidt, S., Kappes, J., & Schnörr, C. (2012). Efficient MRF energy minimization via adaptive diminishing smoothing. ArXiv preprint arXiv:1210.4906.
Schwing, A. G., Hazan, T., Pollefeys, M., & Urtasun, R. (2012). Globally convergent dual MAP LP relaxation solvers using Fenchel–Young margins. In NIPS (pp. 2384–2392).
Schwing, A. G., Hazan, T., Pollefeys, M., & Urtasun, R. (2014). Globally convergent parallel MAP LP relaxation solver using the Frank–Wolfe algorithm. In ICML (pp. 487–495).
Sontag, D. A. (2010). Approximate inference in graphical models using LP relaxations. Ph.D. Thesis, Massachusetts Institute of Technology.
Sontag, D. A., Li, Y., et al. (2012). Efficiently searching for frustrated cycles in map inference. In UAI.
Wainwright, M. J., Jaakkola, T. S., & Willsky, A. S. (2005). Map estimation via agreement on trees: Message-passing and linear programming. IEEE Transactions on Information Theory, 51(11), 3697–3717.
Wainwright, M. J., & Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1–2), 1–305.
Wang, Y., Yin, W., & Zeng, J. (2017). Global convergence of ADMM in nonconvex nonsmooth optimization. Journal of Scientific Programming, 78(1), 29–63.
Wu, B., & Ghanem, B. (2019). \(\ell _p\)-box ADMM: A versatile framework for integer programming. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(7), 1695–1708.
Xu, Y., & Yin, W. (2013). A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Sciences, 6(3), 1758–1789.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Julien Mairal.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Baoyuan Wu was partially supported by Tencent AI Lab and King Abdullah University of Science and Technology (KAUST). Li Shen was supported by Tencent AI Lab. Bernard Ghanem was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR). Tong Zhang was supported by the Hong Kong University of Science and Technology (HKUST). Li Shen is the corresponding author.
Convergence Analysis
Convergence Analysis
To facilitate the convergence analysis, here we rewrite some equations and notations firstly defined in Sect. 5. Problem (11) can be simplified to the following general shape, as follows
Our illustration for (32) is separated into three parts, as follows:
- 1.
Variables\(\mathbf {x}= [ \varvec{\mu }_1; \ldots ; \varvec{\mu }_{|V|} ] \in \mathbb {R}^{\sum _{i}^{ V} |\mathcal {X}_i|}\), and it concatenates all variable nodes \(\varvec{\mu }_V\). \(\mathbf {y}= [\mathbf {y}_1; \ldots ; \mathbf {y}_{|V|}]\) with \(\mathbf {y}_i = [\varvec{\upsilon }_i; \varvec{\mu }_{\alpha _{i,1}}; \ldots ; \varvec{\mu }_{\alpha _{i,|\mathcal {N}_i|}}] \in \mathbb {R}^{|\mathcal {X}_i| + \sum _{\alpha }^{\mathcal {N}_i} |\mathcal {X}_{\alpha }|}\). \(\mathbf {y}\) concatenates all factor nodes \(\varvec{\mu }_V\) and the extra variable nodes \(\varvec{\upsilon }\); \(\mathbf {y}_i\) concatenates the factor nodes and the extra variable node connected to the i-th variable node \(\varvec{\mu }_i\). \(\mathcal {N}_i\) indicates the set of neighborhood factor nodes connected to the i-th variable node; the subscript \(\alpha _{i,j}\) indicates the j-th factor connected to the i-th variable, with \(i \in V\) and \(j \in \mathcal {N}_i\).
- 2.
Objective functions\(f(\mathbf {x})= \mathbf {w}_{\mathbf {x}}^\top \mathbf {x}\) with \(\mathbf {w}_{\mathbf {x}} = - [\varvec{\theta }_1; \ldots ; \)\(\varvec{\theta }_{|V|}]\). \(h(\mathbf {y}) = g(\mathbf {y}) + \mathbf {w}_{\mathbf {y}}^\top \mathbf {y}\), with \(\mathbf {w}_{\mathbf {y}} = [\mathbf {w}_1; \ldots ;\)\(\mathbf {w}_{|V|}]\) with \(\mathbf {w}_{i} = -[\varvec{0}; \frac{1}{|\mathcal {N}_{\alpha _{i,1}}|} \varvec{\theta }_{\alpha _{i,1}};\)\(\ldots ; \frac{1}{|\mathcal {N}_{\alpha _{i,|\mathcal {N}_i|}}|} \varvec{\theta }_{\alpha _{i,|\mathcal {N}_i|}}]\), and \(\mathcal {N}_{\alpha } = \{ i \mid (i, \alpha ) \in E\}\) being the set of neighborhood variable nodes connected to the \(\alpha \)-th factor. \(g(\mathbf {y}) = \mathbb {I}(\varvec{\upsilon } \in \mathcal {S}) + \sum _{\alpha \in F} \mathbb {I}(\varvec{\mu }_{\alpha } \in \Delta ^{|\mathcal {X}_{\alpha }|})\), with \(\mathbb {I}(a)\) being the indicator function: \(\mathbb {I}(a)=0\) if a is true, otherwise \(\mathbb {I}(a)=\infty \).
- 3.
Constraint matrices The constraint matrix \(\mathbf {A}= \text {diag}(\)\(\mathbf {A}_1, \ldots , \mathbf {A}_i, \ldots , \mathbf {A}_{|V|})\) with \(\mathbf {A}_i = [\mathbf {I}_{|\mathcal {X}_i|}; \ldots ; \mathbf {I}_{|\mathcal {X}_i|} ] \in \{0,1\}^{(|\mathcal {N}_i| +1)|\mathcal {X}_i| \times |\mathcal {X}_i|}\). \(\mathbf {B}= \text {diag}(\mathbf {B}_1, \ldots ,\)\( \mathbf {B}_i, \ldots , \mathbf {B}_{|V|})\), with \(\mathbf {B}_i = \text {diag}(\mathbf {I}_{|\mathcal {X}_i|}, \mathbf {M}_{i, \alpha _{i,1}}, \ldots , \mathbf {M}_{i, \alpha _{i, |\mathcal {N}_i|}} )\). \(\mathbf {A}\) summarizes all constraints on \(\varvec{\mu }_V\), while \(\mathbf {B}\) collects all constraints on \(\varvec{\mu }_F\) and \(\varvec{\upsilon }\).
Note that Problem (32) has a clear structure with two groups of variables, corresponding the augmented factor graph (see Fig. 1c).
According to the analysis presented in Wang et al. (2017), a sufficient condition to ensure the global convergence of the ADMM algorithm for the problem \(\hbox {LS}{-}\hbox {LP}(\varvec{\theta })\) is that \(\text {Im}(\mathbf {B}) \subseteq \text {Im}(\mathbf {A})\), with \(\text {Im}(\mathbf {A})\) being the image of \(\mathbf {A}\), i.e., the column space of \(\mathbf {A}\). However, \(\mathbf {A}\) in (32) is full column rank, rather than full row rank, while \(\mathbf {B}\) is full row rank. To satisfy this sufficient condition, we introduce a sufficiently small perturbation to both the objective function and the constraint in (32), as follows
where \(\hat{\mathbf {A}} = [\mathbf {A}, \epsilon \mathbf {I}]\) with a sufficiently small constant \(\epsilon > 0\), then \(\hat{\mathbf {A}}\) is full row rank. \(\hat{\mathbf {x}} = [\mathbf {x}; \bar{\mathbf {x}}]\), with \(\bar{\mathbf {x}} = [\bar{\mathbf {x}}_1; \ldots ; \bar{\mathbf {x}}_{|V|}] \in \mathbb {R}^{\sum _i^{V} (|\mathcal {N}_i|+1) |\mathcal {X}_i|}\) and \(\bar{\mathbf {x}}_i = [\varvec{\mu }_i; \ldots ; \varvec{\mu }_i] \in \mathbb {R}^{(|\mathcal {N}_i|+1) |\mathcal {X}_i|}\). \(\hat{f}(\hat{\mathbf {x}}) = f(\mathbf {x}) + \frac{1}{2}\epsilon \hat{\mathbf {x}}^\top \hat{\mathbf {x}}\). Consequently, \(\text {Im}(\hat{\mathbf {A}}) \equiv \text {Im}(\mathbf {B}) \subseteq \mathbb {R}^{\text {rank of } ~ \hat{\mathbf {A}}}\), as both \(\hat{\mathbf {A}}\) and \(\mathbf {B}\) are full row rank. Then, the sufficient condition \(\text {Im}(\mathbf {B}) \subseteq \text {Im}(\hat{\mathbf {A}})\) holds.
The augmented Lagrangian function of (33) is formulated as
The updates of the ADMM algorithm to optimize (33) are as follows
The optimality conditions of the variable sequence \((\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1})\) generated above are
The convergence of this perturbed ADMM algorithm for the LS–LP problem is summarized in Theorem 2. The detailed proof is presented in the following sub-sections sequentially. Note that hereafter \(\Vert \cdot \Vert \) indicates the \(\ell _2\) norm for a vector, or the Frobenius norm for a matrix; \(\mathcal {A}_1 \succeq \mathcal {A}_2\) represents that \(\mathcal {A}_1 - \mathcal {A}_2\) is positive semi-definite, with \(\mathcal {A}_1, \mathcal {A}_2\) being square matrices; \(\nabla \) denotes the gradient operator, \(\nabla ^2\) means the Hessian operator, and \(\partial \) is the sub-gradient operator; \(\mathbf {I}\) represents the identity matrix with compatible shape.
1.1 Properties
In this section, we present some important properties of the objective function and constraints in (33), which will be used in the followed convergence analysis.
Properties on objective functions (P1)
-
(P1.1) f, h and \(\mathcal {L}_{\rho , \epsilon }\) are semi-algebraic, lower semi-continuous functions and satisfy Kurdyka–Lojasiewicz (KL) property, and h is closed and proper
-
(P1.2) There exist \(\mathcal {Q}_1, \mathcal {Q}_2\) such that \(\mathcal {Q}_1 \succeq \nabla ^2 \hat{f}(\hat{\mathbf {x}}) \succeq \mathcal {Q}_2\), \(\forall \hat{\mathbf {x}}\)
-
(P1.3) \(\lim \inf _{\Vert \hat{\mathbf {x}} \Vert \rightarrow \infty } \Vert \nabla \hat{f}(\hat{\mathbf {x}}) \Vert = \infty \)
Properties on constraint matrices (P2)
-
(P2.1) There exists \(\sigma > 0\) such that \(\hat{\mathbf {A}} \hat{\mathbf {A}}^\top \succeq \sigma \mathbf {I}\)
-
(P2.2) \(\mathcal {Q}_2 + \rho \hat{\mathbf {A}}^\top \hat{\mathbf {A}} \succeq \delta \mathbf {I}\) for some \(\rho , \delta > 0\), and \(\rho > \frac{1}{\epsilon } \)
-
(P2.3) There exists \(\mathcal {Q}_3 \succeq [\nabla ^2 \hat{f}(\hat{\mathbf {x}})]^2, \forall \hat{\mathbf {x}}\), and \(\delta \mathbf {I}\succ \frac{2}{\sigma \rho } \mathcal {Q}_3\)
-
(P2.4) Both \(\hat{\mathbf {A}}\) and \(\mathbf {B}\) are full row rank, and \(\text {Im}(\hat{\mathbf {A}}) \equiv \text {Im}(\mathbf {B}) \subseteq \mathbb {R}^{\text {rank of }~ \hat{\mathbf {A}}}\)
Remark
(1) Although the definition of KL property (see Definition 1) is somewhat complex, but it holds for many widely used functions, according to Xu and Yin (2013). Typical functions satisfying KL property includes: (a) real analytic functions, and any polynomial function such as \(\Vert \mathbf {H} \mathbf {x}- \mathbf {b} \Vert \) belongs to this type; (b) locally strongly convex functions, such as the logistic loss function \(\log (1+\exp (-\mathbf {x}))\); (c) semi-algebraic functions, such as \(\Vert \mathbf {x}\Vert _1, \Vert \mathbf {x}\Vert _2,\Vert \mathbf {x}\Vert _{\infty }, \Vert \mathbf {H} \mathbf {x}- \mathbf {b} \Vert _1, \Vert \mathbf {H} \mathbf {x}- \mathbf {b} \Vert _2, \Vert \mathbf {H} \mathbf {x}- \mathbf {b} \Vert _{\infty }\) and the indicator function \(\mathbb {I}(\cdot )\). It is easy to verify that P1.1 holds in our problem. (2) Here we provide an instantiation of above hyper-parameters satisfying above properties. Firstly, it is easy to obtain that \(\nabla ^2 \hat{f}(\hat{\mathbf {x}}) = \epsilon \mathbf {I}\), and \(\hat{\mathbf {A}} \hat{\mathbf {A}}^\top = [\mathbf {A}, \epsilon \mathbf {I}] [\mathbf {A}, \epsilon \mathbf {I}]^\top = \mathbf {A}\mathbf {A}^\top + \epsilon ^2 \mathbf {I}\succ \epsilon ^2 \mathbf {I}\), as well as \(\rho \hat{\mathbf {A}}^\top \hat{\mathbf {A}} \succeq \epsilon \mathbf {I}\), when \(\epsilon \) is small enough and \(\rho > \frac{1}{\epsilon }\) (e.g., \(\rho = \frac{2}{\epsilon }\)). Then, the values \(\mathcal {Q}_1 = \mathcal {Q}_2 = \epsilon \mathbf {I}, \mathcal {Q}_3 = \epsilon ^2 \mathbf {I}, \delta = 2 \epsilon , \sigma = \epsilon ^2\) satisfy P1.2, P2.1, P2.2 and P2.3. Without loss of generality, we will adopt these specific values for these hyper-parameters to simplify the following analysis, while only keeping \(\rho \) and \(\epsilon \).
1.2 Decreasing of \(\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^{k})\)
In this section, we firstly prove the decreasing property of the augmented Lagrangian function, i.e.,
Firstly, utilizing P2.1, P2.3 and (37), we obtain that
Then, we have
According to P1.2 and P2.2, \(\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}, \varvec{\lambda }^{k})\) is strongly convex with respect to \(\hat{\mathbf {x}}\), with the parameter of at least \(2 \epsilon \). Then, we have
As \(\mathbf {y}^{k+1}\) is the minimal solution of \(\mathcal {L}_{\rho , \epsilon }(\mathbf {y}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k})\), it is easy to know
Combining (41), (42) and (43), we have
where the last inequality utilizes P2.3 and \(\rho > \frac{1}{\epsilon }\).
1.3 Boundedness of \(\{\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k\}\)
Next, we prove the boundedness of \(\{\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k\}\). We suppose that \(\rho \) is large enough such that there is \(0<\gamma <\rho \) with
According to (44), for any \(k \ge 1\), we have
Besides, according to P2.1, we have
Plugging (47) into (46), we obtain that
According to the coerciveness of \(\nabla \hat{f}(\hat{\mathbf {x}}^k)\) (i.e., P1.3), we obtain that \(\hat{\mathbf {x}}^k < \infty , \forall k\), i.e., the boundedness of \(\{\hat{\mathbf {x}}^k\}\). From (47), we know the boundedness of \(\{\varvec{\lambda }^k\}\). Besides, according to P2.4, \(\{\hat{\mathbf {A}}\hat{\mathbf {x}}^k\}\) is also bounded. From (38), we obtain the boundedness of \(\{\mathbf {B}\mathbf {y}^k\}\). Considering the full row rank of \(\mathbf {B}\) (i.e., P2.4), the boundedness of \(\{\mathbf {y}^k\}\) is proved.
1.4 Convergence of Residual
According to the boundedness of \(\{\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k\}\), there is a sub-sequence \(\{\mathbf {y}^{k_i}, \hat{\mathbf {x}}^{k_i}, \varvec{\lambda }^{k_i}\}\) that converges to a cluster point \(\{\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*\}\). Considering the lower semi-continuity of \(\mathcal {L}_{\rho , \epsilon }\) (i.e., P1.1), we have
Summing (44) from \(k = M, \ldots , N-1\) with \(M \ge 1\), we have
Then, by setting \(N = k_i\) and \(M=1\), we have
Taking limit on both sides of the above inequality, we obtain
It implies that
Besides, according to (40), it is easy to obtain that
Moreover, utilizing \(\mathbf {B}\mathbf {y}^{k+1} = \hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} - \frac{1}{\rho }(\varvec{\lambda }^{k+1} - \varvec{\lambda }^k)\) from (38), we have
Besides, as shown in Lemma 1 in Wang et al. (2017), the full row rank of \(\mathbf {B}\) (i.e., P1.4) implies that
where \(\bar{M} > 0\) is a constant. Taking limit on both sides of (55) and utilizing (56), we obtain
Combining (53), (54) and (57), we obtain that
By setting \(k+1 = k_i\), plugging (53) into (36) and (54) into (37), and taking limit \(k_i \rightarrow \infty \), we obtain the KKT conditions. It tells that the cluster point \((\mathbf {y}^*, \hat{\mathbf {x}}^*)\) is the KKT point of \(\hbox {LS}{-}\hbox {LP}(\varvec{\theta };\varvec{\epsilon })\) (i.e., (33)).
1.5 Global Convergence
Inspired by the analysis presented in Li and Pong (2015), in this section we will prove the following conclusions:
\(\sum _{k=1}^{\infty } \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert < \infty \);
\(\{\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k\}\) converges to \((\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*)\);
\((\mathbf {y}^*, \hat{\mathbf {x}}^*)\) is the KKT point of (33).
Firstly, utilizing the optimality conditions (36, 37, 38), we have that
Further, combining with (40), there exists a constant \(C>0\) such that
where \(\text {dist}(\cdot , \cdot )\) denotes the distance between a vector and a set of vectors. Hereafter we denote \(\partial _{(\mathbf {y}, \hat{\mathbf {x}}, \varvec{\lambda })} \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1},\hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1})\) as \(\partial \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1},\hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1})\) for clarity. Besides, the relation (44) implies that there is a constant \(D \in (0, \epsilon - \frac{1}{\rho })\) such that
Moreover, the relation (49) implies that \(\{ \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}) \}\) is lower bounded along the convergent sub-sequence \(\{ (\mathbf {y}^{k_i}, \hat{\mathbf {x}}^{k_i}, \varvec{\lambda }^{k_i}) \}\). Combining with the its decreasing property, the limit of \(\{ \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}) \}\) exists. Thus, we will show that
To prove it, we utilize the fact that \(\mathbf {y}^{k+1}\) is the minimizer of \( \mathcal {L}_{\rho , \epsilon }(\mathbf {y}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}) \), such that
Combining the above relation, (58) and the continuity of \(\mathcal {L}_{\rho , \epsilon }\) w.r.t. \(\hat{\mathbf {x}}\) and \(\varvec{\lambda }\), the following relation holds along the sub-sequence \(\{ (\mathbf {y}^{k_i}, \hat{\mathbf {x}}^{k_i}, \varvec{\lambda }^{k_i}) \}\) that converges to \( (\mathbf {y}^{*}, \hat{\mathbf {x}}^{*}, \varvec{\lambda }^{*}) \),
According to (58), the sub-sequence \(\{ (\mathbf {y}^{k_i+1}, \hat{\mathbf {x}}^{k_i+1}, \varvec{\lambda }^{k_i+1}) \}\) also converges to \( (\mathbf {y}^{*}, \hat{\mathbf {x}}^{*}, \varvec{\lambda }^{*}) \). Then, utilizing the lower semi-continuity of \(\mathcal {L}_{\rho , \epsilon }\), we have
Combining (66) with (67), we know the existence of the limit of the sequence \(\{ \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) \}\), which proves the relation (64).
As \(\mathcal {L}_{\rho , \epsilon }\) is KL function, according to Definition 1, it has the following properties:
There exist a constant \(\eta \in (0, \infty ]\), a continuous concave function \(\varphi : [0, \eta ) \rightarrow \mathbb {R}_{+}\), as well as a neighbourhood \(\mathcal {V}\) of \((\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*)\). \(\varphi \) is differentiable on \((0, \eta )\) with positive derivatives.
For all \((\mathbf {y}, \hat{\mathbf {x}}, \varvec{\lambda }) \in \mathcal {V}\) satisfying \(l^*< \mathcal {L}_{\rho , \epsilon }(\mathbf {y}, \hat{\mathbf {x}}, \varvec{\lambda }) < l^* + \eta \), we have
$$\begin{aligned} \varphi '( \mathcal {L}_{\rho , \epsilon }(\mathbf {y}, \hat{\mathbf {x}}, \varvec{\lambda }) - l^* ) \text {dist}(0, \partial \mathcal {L}_{\rho , \epsilon }(\mathbf {y}, \hat{\mathbf {x}}, \varvec{\lambda }) ) \ge 1. \end{aligned}$$(68)
Then, we define the following neighborhood sets:
where \(\zeta > 0\) is a small constant.
Utilizing the relations (37) and (38), as well as P2.1, we obtain that for any \(k\ge 1\), the following relation holds:
Also, the relations (37) and (38) imply that for any \(k\ge 1\), we have
Moreover, the relation (58) implies that \(\exists N_0 \ge 1\) such that \(\forall k \ge N_0\), we have
Similar to (56), the full row rank of \(\mathbf {B}\) implies \(\Vert \mathbf {y}^k - \mathbf {y}^* \Vert \le \bar{M} \Vert \mathbf {B}(\mathbf {y}^k - \mathbf {y}^*) \Vert \). Then, plugging (73) into (72), we obtain that
for any \(\hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\) and \(k \ge N_0\). Combining (71) and (74), we know that if \(\hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\) and \(k \ge N_0\), then \((\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) \in \mathcal {V}_{\zeta } \subseteq \mathcal {V}\).
Moreover, (44) and (64) implies that \(\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) \ge l^*, \forall k \ge 1\). Besides, as \((\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*)\) is a cluster point, we will obtain that \(\exists N \ge N_0\), the following relations hold:
Next, We will show that if \(\hat{\mathbf {x}}^N \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\) and \(l^*< \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^N, \hat{\mathbf {x}}^N, \varvec{\lambda }^N) < l^* + \eta \) hold for some fixed \(k \ge N_0\), then the following relation holds
To prove (76), we utilize the fact that \(\hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}, k \ge N_0\) implies that \((\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}} \subseteq \mathcal {V}\). And, combining with \(l^*< \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) < l^* + \eta \), we obtain that
Combining the relations (62), (63) and (77), as well as the concavity of \(\varphi \), we obtain that
for all such k. Taking square root on both sides of (78), and utilizing the fact that \(a+b \ge 2 \sqrt{a b}\), then (76) is proved.
We then prove \(\forall k \ge N, \hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\) holds. This claim can be proved through induction. Obviously it is true for \(k=N\) by construction, as shown in (75). For \(k=N+1\), we have
where the first inequality utilizes (63), and the last inequality follows the last relation in (75). Thus, \(\forall k \ge N, \hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\) holds.
Next, we suppose that \(\hat{\mathbf {x}}^N, \ldots , \hat{\mathbf {x}}^{N+t-1} \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\) for some \(t>1\), and we need to prove that \(\hat{\mathbf {x}}^{N+t} \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\) also holds, i.e.,
where \(\varphi ^{N+i} = \varphi (\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{N+i}, \hat{\mathbf {x}}^{N+i}, \varvec{\lambda }^{N+i})-l^*) \) and \(\mathcal {L}_{\rho , \epsilon }^N = \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{N}, \hat{\mathbf {x}}^{N}, \varvec{\lambda }^{N})\). The second inequality follows from (76). The fourth inequality follows from (63). The fifth inequality utilizes the fact that \(\mathcal {L}_{\rho , \epsilon }^{N+1} > l^*\), and the last inequality follows from the last relation in (75). Thus, \(\hat{\mathbf {x}}^{N+k} \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\) holds. We have proved that \(\forall k \ge N, \hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\) holds by induction.
Then, according to \(\forall k \ge N, \hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\), we can sum both sides of (76) from \(k=N\) to \(\infty \), to obtain that
which implies that \(\sum _{k=1}^{\infty } \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert < \infty \) holds. Thus \(\{ \hat{\mathbf {x}}^k \}\) converges. The convergence of \(\{ \mathbf {y}^k \}\) follows from \(\mathbf {B}\mathbf {y}^{k+1} = \hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} + \frac{1}{\rho }(\varvec{\lambda }^{k+1} -\varvec{\lambda }^k)\) in (38) and (58), as well as the surjectivity of \(\mathbf {B}\) (i.e., full row rank). The convergence of \(\{ \varvec{\lambda }^k \}\) follows from \(\nabla \hat{f}(\hat{\mathbf {x}}^{k+1}) = - \hat{\mathbf {A}}^\top \varvec{\lambda }^{k+1}\) in (37) and the surjectivity of \(\hat{\mathbf {A}}\) (i.e., full row rank). Consequently, \(\{ \mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k \}\) converges to the cluster point \((\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*)\). The conclusion that \((\mathbf {y}^*, \hat{\mathbf {x}}^*)\) is the KKT point of Problem (33) has been proved in Sect. A.4.
1.6 \(\epsilon \)-KKT Point of the Original LS–LP Problem
Proposition 1
The globally converged solution \((\mathbf {y}^*, \mathbf {x}^*, \varvec{\lambda }^*)\) produced by the ADMM algorithm for the perturbed LS–LP problem (33) is the \(\epsilon \)-KKT solution to the original LS–LP problem (32).
Proof
The globally converged solution \((\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*)\) to the perturbed LS–LP problem (33) satisfies the following relations:
Recalling the definitions \(\hat{\mathbf {A}} = [\mathbf {A}, \epsilon \mathbf {I}]\), \(\hat{\mathbf {x}} = [\mathbf {x}; \bar{\mathbf {x}}]\) and \(\hat{f}(\hat{\mathbf {x}}) = f(\mathbf {x}) + \frac{\epsilon }{2} \hat{\mathbf {x}}^\top \hat{\mathbf {x}}\), the above relations imply that
where we utilize the boundedness of \(\{\hat{\mathbf {x}}^*\}\). Thus, according to Definition 2, the globally converged point \((\mathbf {y}^*, \mathbf {x}^*)\) is the \(\epsilon \)-KKT solution to the original LS–LP problem (32). \(\square \)
1.7 Convergence Rate
Lemma 3
Firstly, without loss of generality, we can assume that \(l^* = \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*) = 0\) (e.g., one can replace \(l_k = \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k)\) by \(l_k - l^*\)). We further assume that \(\mathcal {L}_{\rho , \epsilon }\) has the KL property at \((\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*)\) with the concave function \(\varphi (s) = c s^{1-p}\), where \(p \in [0, 1), c>0\). Consequently, we can obtain the following inequalities:
- (i)
if \(p=0\), then \(\{(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k)\}_{k = 1, \ldots , \infty }\) can converge in finite steps;
- (ii)
If \(p \in (0,\frac{1}{2}]\), then there exist \(c>0\) and \(\tau \in (0,1)\) such that \(\Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert \le c \tau ^k\);
- (iii)
\(p \in (\frac{1}{2},1)\), then there exist \(c>0\) such that \(\Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert \le c k^{-\frac{1-p}{2p -1}} \).
Proof
(i) If \(p=0\), we define a subset \(H = \{k \in \mathbb {N}: \hat{\mathbf {x}}_{k} \ne \hat{\mathbf {x}}_{k+1} \}\). If \(k \in H\) is sufficiently large, then these exists \(C_3>0\) such that
Combining with (63), we have
If the subset H is infinite, then it will contradict to the fact that \(l_k - l_{k+1} \rightarrow 0\) as \(k\rightarrow \infty \). Thus, H is a finite subset, leading to the conclusion that \(\{\hat{\mathbf {x}}^k\}_{k \in \mathbb {N}}\) will converge in finite steps. Recalling the relationships between \(\hat{\mathbf {x}}_k\) and \(\mathbf {y}_k, \varvec{\lambda }_k\) (see the descriptions under (81)), we also obtain that \(\{ \mathbf {y}_k, \varvec{\lambda }_k \}_{k \in \mathbb {N}}\) converges in finite steps.
By defining \(\bigtriangleup _k = \sum _k^{\infty } \Vert x^k+1 - x^k \Vert \), the inequality (81) can be rewritten as follows
Besides, the KL property and \(l^* = 0\) give that
Combining with (62), we obtain
Then, inserting (89) into (87), we obtain
(ii) If \(p \in (0, \frac{1}{2}]\), then \(\frac{1-p}{p}\ge 1\). Besides, since \((\bigtriangleup _{k-1} - \bigtriangleup _k) \rightarrow 0\) when \(k \rightarrow \infty \), there exists an integer \(K_0\) such that \((\bigtriangleup _{k-1} - \bigtriangleup _k) < 1\), leading to that \((\bigtriangleup _{k-1} - \bigtriangleup _k)^{\frac{1-p}{p}} \le (\bigtriangleup _{k-1} - \bigtriangleup _k)\). Inserting it into (90), we obtain that
It is easy to deduce that \(\bigtriangleup _k \le (\bigtriangleup _{K_0} \tau ^{-K_0}) \tau ^k = \frac{c}{2} \tau ^k\), with c being a positive constant. Note that k in \(\tau ^k\) indicates k power of \(\tau \), rather than the iteration index. Combining with \(\Vert \hat{\mathbf {x}}^k - \hat{\mathbf {x}}^* \Vert \le \bigtriangleup _k\), it is easy to obtain that \(\Vert \hat{\mathbf {x}}^k - \hat{\mathbf {x}}^* \Vert \le \frac{c}{2} \tau ^k\) with \(\tau \in (0,1)\) and c being a positive constant. Then, we have
(iii) If \(p \in (\frac{1}{2}, 1)\), then \(\frac{1-p}{p}<1\). Then, it is easy to obtain that \((\bigtriangleup _{k-1} - \bigtriangleup _k)^{\frac{1-p}{p}} > (\bigtriangleup _{k-1} - \bigtriangleup _k)\). Inserting it into (90), we obtain that
It has been studied in Theorem 2 of Attouch and Bolte (2009) that the above inequality can deduce \(\bigtriangleup _k \le \frac{c}{2} k^{- \frac{1-p}{2p-1}}\), with c being a positive constant. Since \(\Vert \hat{\mathbf {x}}^k - \hat{\mathbf {x}}^* \Vert \le \bigtriangleup _k\), we have that \(\Vert \hat{\mathbf {x}}^k - \hat{\mathbf {x}}^* \Vert \le \frac{c}{2} k^{- \frac{1-p}{2p-1}}\). Then, we have
\(\square \)
Proposition 2
We adopt the same assumptions in Lemma 3. Then,
- (i)
If \(p =0\), then we will obtain the \(\epsilon \)-KKT solution to the LS–LP problem in finite steps.
- (ii)
If \(p \in (0, \frac{1}{2}]\), then we will obtain the \(\epsilon \)-KKT solution to the LS–LP problem in at least \(O\big (\log _{\frac{1}{\tau }}(\frac{1}{\epsilon })^2\big )\) steps.
- (iii)
If \(p \in ( \frac{1}{2}, 1)\), then we will obtain the \(\epsilon \)-KKT solution to the LS–LP problem in at least \( O\big ( (\frac{1}{\epsilon })^{\frac{4p-2}{1-p}}\big )\) steps.
Proof
The conclusion (i) directly holds from Lemma 3(i).
According to the optimality condition (36), we have
According to the optimality condition (38) and the relation (40), we obtain that
According to Lemma 3, we have
- (i)
If \(p \in (0, \frac{1}{2}]\), then
$$\begin{aligned} O\left( \frac{1}{\epsilon }\right) \cdot \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^{k} \Vert _2\le & {} O\left( \frac{1}{\epsilon }\right) \tau ^k \le O(\epsilon )\\\Rightarrow & {} k \ge O\left( \log _{\frac{1}{\tau }}\left( \frac{1}{\epsilon }\right) ^2\right) , \end{aligned}$$which means that when \(k \ge O\big (\log _{\frac{1}{\tau }}(\frac{1}{\epsilon })^2\big )\), we will obtain the \(\epsilon \)-KKT solution to the perturbed LS–LP problem, i.e., the \(\epsilon \)-KKT solution to the original LS–LP problem.
- (ii)
If \(p \in ( \frac{1}{2}, 1)\), then
$$\begin{aligned} O\left( \frac{1}{\epsilon }\right) \cdot \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^{k} \Vert _2\le & {} O\left( \frac{1}{\epsilon }\right) k^{- \frac{1-p}{2p-1}} \le O(\epsilon )\\\Rightarrow & {} k \ge O\left( \left( \frac{1}{\epsilon }\right) ^{\frac{4p-2}{1-p}}\right) , \end{aligned}$$which means that when \(k \ge O\big ( (\frac{1}{\epsilon })^{\frac{4p-2}{1-p}}\big )\), we will obtain the \(\epsilon \)-KKT solution to the perturbed LS–LP problem, i.e., the \(\epsilon \)-KKT solution to the original LS–LP problem. \(\square \)
Rights and permissions
About this article
Cite this article
Wu, B., Shen, L., Zhang, T. et al. MAP Inference Via \(\ell _2\)-Sphere Linear Program Reformulation. Int J Comput Vis 128, 1913–1936 (2020). https://doi.org/10.1007/s11263-020-01313-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-020-01313-2