MAP Inference Via $$\ell _2$$ -Sphere Linear Program Reformulation

Wu, Baoyuan; Shen, Li; Zhang, Tong; Ghanem, Bernard

doi:10.1007/s11263-020-01313-2

MAP Inference Via $\ell _2$-Sphere Linear Program Reformulation

Published: 04 March 2020

Volume 128, pages 1913–1936, (2020)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

741 Accesses
Explore all metrics

Abstract

Maximum a posteriori (MAP) inference is an important task for graphical models. Due to complex dependencies among variables in realistic models, finding an exact solution for MAP inference is often intractable. Thus, many approximation methods have been developed, among which the linear programming (LP) relaxation based methods show promising performance. However, one major drawback of LP relaxation is that it is possible to give fractional solutions. Instead of presenting a tighter relaxation, in this work we propose a continuous but equivalent reformulation of the original MAP inference problem, called LS–LP. We add the $\ell _2$-sphere constraint onto the original LP relaxation, leading to an intersected space with the local marginal polytope that is equivalent to the space of all valid integer label configurations. Thus, LS–LP is equivalent to the original MAP inference problem. We propose a perturbed alternating direction method of multipliers (ADMM) algorithm to optimize the LS–LP problem, by adding a sufficiently small perturbation $\epsilon $ onto the objective function and constraints. We prove that the perturbed ADMM algorithm globally converges to the $\epsilon $-Karush–Kuhn–Tucker ($\epsilon $-KKT) point of the LS–LP problem. The convergence rate will also be analyzed. Experiments on several benchmark datasets from Probabilistic Inference Challenge (PIC 2011) and OpenGM 2 show competitive performance of our proposed method against state-of-the-art MAP inference methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient Semidefinite Branch-and-Cut for MAP-MRF Inference

Article 24 October 2015

MAP inference algorithms without approximation for collective graphical models on path graphs via discrete difference of convex algorithm

Article Open access 08 December 2022

Sampled Gromov Wasserstein

Article 26 July 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

References

Attouch, H., & Bolte, J. (2009). On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Mathematical Programming, 116(1–2), 5–16.
Article MathSciNet Google Scholar
Attouch, H., Bolte, J., Redont, P., & Soubeyran, A. (2010). Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka–Lojasiewicz inequality. Mathematics of Operations Research, 35(2), 438–457.
Article MathSciNet Google Scholar
Besag, J. (1986). On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society: Series B (Methodological), 48(3), 259–279.
MathSciNet MATH Google Scholar
Bolte, J., Daniilidis, A., & Lewis, A. (2007). The łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4), 1205–1223.
Article Google Scholar
Bolte, J., Daniilidis, A., Lewis, A., & Shiota, M. (2007). Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18(2), 556–572.
Article MathSciNet Google Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1), 1–122.
Elidan, G., Globerson, A., & Heinemann, U. (2012). Pascal 2011 probabilistic inference challenge. Retrieved July 15, 2020, from http://www.cs.huji.ac.il/project/PASCAL/index.php.
Fu, Q., & Banerjee, H. W. A. (2013). Bethe-ADMM for tree decomposition based parallel map inference. In Uncertainty in artificial intelligence (p. 222). Citeseer.
Globerson, A., & Jaakkola, T. S. (2008). Fixing max-product: Convergent message passing algorithms for MAP LP-relaxations. In NIPS (pp. 553–560).
Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In ICCV (pp. 1–8). IEEE.
Jaimovich, A., Elidan, G., Margalit, H., & Friedman, N. (2006). Towards an integrated protein–protein interaction network: A relational markov network approach. Journal of Computational Biology, 13(2), 145–164.
Article MathSciNet Google Scholar
Johnson, J. K., Malioutov, D. M., & Willsky, A. S. (2007). Lagrangian relaxation for map estimation in graphical models. ArXiv preprint arXiv:0710.0013.
Jojic, V., Gould, S., Koller, D. (2010). Accelerated dual decomposition for map inference. In ICML (pp. 503–510).
Kappes, J. H., Andres, B., Hamprecht, F. A., Schnörr, C., Nowozin, S., Batra, D., et al. (2015). A comparative study of modern inference techniques for structured discrete energy minimization problems. International Journal of Computer Vision, 115, 155–184.
Article MathSciNet Google Scholar
Kappes, J. H., Savchynskyy, B., Schnörr, C. (2012). A bundle approach to efficient map-inference by lagrangian relaxation. In CVPR (pp. 1688–1695). IEEE.
Karush, W. (1939). Minima of functions of several variables with inequalities as side constraints. M.Sc. Dissertation. Department of Mathematics, University of Chicago.
Kelley, J. (1960). The cutting-plane method for solving convex programs. Journal of the Society for Industrial and Applied Mathematics, 8(4), 703–712.
Article MathSciNet Google Scholar
Koller, D., & Nir, F. (Eds.). (2009). Probabilistic graphical models: Principles and techniques. Cambridge, MA: MIT Press.
MATH Google Scholar
Kolmogorov, V. (2006). Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1568–1583.
Article Google Scholar
Komodakis, N., Paragios, N., & Tziritas, G. (2007) MRF optimization via dual decomposition: Message-passing revisited. In ICCV (pp. 1–8). IEEE.
Kschischang, F. R., Frey, B. J., & Loeliger, H. A. (2001). Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2), 498–519.
Article MathSciNet Google Scholar
Kuhn, H. W., & Tucker, A. W. (2014). Nonlinear programming. In Traces and emergence of nonlinear programming (pp. 247–258). Springer.
Land, A. H., & Doig, A. G. (1960). An automatic method of solving discrete programming problems. Econometrica, 28, 497–520.
Article MathSciNet Google Scholar
Laurent, M., & Rendl, F. (2002). Semidefinite programming and integer programming. Centrum voor Wiskunde en Informatica.
Li, G., & Pong, T. K. (2015). Global convergence of splitting methods for nonconvex composite optimization. SIAM Journal on Optimization, 25(4), 2434–2460.
Article MathSciNet Google Scholar
Lojasiewicz, S. (1963). Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles, 117, 87–89.
MATH Google Scholar
Martins, A. F., Figeuiredo, M. A., Aguiar, P. M., Smith, N. A., Xing, E. P. (2011). An augmented lagrangian approach to constrained map inference. In ICML.
Martins, A. F., Figueiredo, M. A., Aguiar, P. M., Smith, N. A., & Xing, E. P. (2015). AD3: Alternating directions dual decomposition for map inference in graphical models. Journal of Machine Learning Research, 16(1), 495–545.
MathSciNet MATH Google Scholar
Meshi, O., & Globerson, A. (2011). An alternating direction method for dual MAP LP relaxation. In Joint European conference on machine learning and knowledge discovery in databases (pp. 470–483). Springer.
Meshi, O., Mahdavi, M., & Schwing, A. (2015). Smooth and strong: Map inference with linear convergence. In NIPS (pp. 298–306).
Otten, L., & Dechter, R. (2012). Anytime and/or depth-first search for combinatorial optimization. AI Communications, 25(3), 211–227.
Article MathSciNet Google Scholar
Otten, L., Ihler, A., Kask, K., & Dechter, R. (2012). Winning the pascal 2011 map challenge with enhanced and/or branch-and-bound. In IN NIPS WORKSHOP DISCML. Citeseer.
Savchynskyy, B., Schmidt, S., Kappes, J., & Schnörr, C. (2012). Efficient MRF energy minimization via adaptive diminishing smoothing. ArXiv preprint arXiv:1210.4906.
Schwing, A. G., Hazan, T., Pollefeys, M., & Urtasun, R. (2012). Globally convergent dual MAP LP relaxation solvers using Fenchel–Young margins. In NIPS (pp. 2384–2392).
Schwing, A. G., Hazan, T., Pollefeys, M., & Urtasun, R. (2014). Globally convergent parallel MAP LP relaxation solver using the Frank–Wolfe algorithm. In ICML (pp. 487–495).
Sontag, D. A. (2010). Approximate inference in graphical models using LP relaxations. Ph.D. Thesis, Massachusetts Institute of Technology.
Sontag, D. A., Li, Y., et al. (2012). Efficiently searching for frustrated cycles in map inference. In UAI.
Wainwright, M. J., Jaakkola, T. S., & Willsky, A. S. (2005). Map estimation via agreement on trees: Message-passing and linear programming. IEEE Transactions on Information Theory, 51(11), 3697–3717.
Article MathSciNet Google Scholar
Wainwright, M. J., & Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1–2), 1–305.
Article Google Scholar
Wang, Y., Yin, W., & Zeng, J. (2017). Global convergence of ADMM in nonconvex nonsmooth optimization. Journal of Scientific Programming, 78(1), 29–63.
MathSciNet MATH Google Scholar
Wu, B., & Ghanem, B. (2019). $\ell _p$-box ADMM: A versatile framework for integer programming. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(7), 1695–1708.
Article Google Scholar
Xu, Y., & Yin, W. (2013). A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Sciences, 6(3), 1758–1789.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Tencent AI Lab, Shenzhen, 518000, China
Baoyuan Wu & Li Shen
Hong Kong University of Science and Technology, Hong Kong, China
Tong Zhang
King Abdullah University of Science and Technology, Thuwal, 23955, Saudi Arabia
Bernard Ghanem

Authors

Baoyuan Wu
View author publications
You can also search for this author inPubMed Google Scholar
Li Shen
View author publications
You can also search for this author inPubMed Google Scholar
Tong Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Bernard Ghanem
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Li Shen.

Additional information

Communicated by Julien Mairal.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Baoyuan Wu was partially supported by Tencent AI Lab and King Abdullah University of Science and Technology (KAUST). Li Shen was supported by Tencent AI Lab. Bernard Ghanem was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR). Tong Zhang was supported by the Hong Kong University of Science and Technology (HKUST). Li Shen is the corresponding author.

Convergence Analysis

To facilitate the convergence analysis, here we rewrite some equations and notations firstly defined in Sect. 5. Problem (11) can be simplified to the following general shape, as follows

$$\begin{aligned} \hbox {LS}{-}\hbox {LP}(\varvec{\theta }) = \min _{\mathbf {x}, \mathbf {y}} f(\mathbf {x}) + h(\mathbf {y}),\quad \text {s.t.} ~ \mathbf {A}\mathbf {x}= \mathbf {B}\mathbf {y}. \end{aligned}$$

(32)

Our illustration for (32) is separated into three parts, as follows:

1.
Variables$\mathbf {x}= [ \varvec{\mu }_1; \ldots ; \varvec{\mu }_{|V|} ] \in \mathbb {R}^{\sum _{i}^{ V} |\mathcal {X}_i|}$, and it concatenates all variable nodes $\varvec{\mu }_V$. $\mathbf {y}= [\mathbf {y}_1; \ldots ; \mathbf {y}_{|V|}]$ with $\mathbf {y}_i = [\varvec{\upsilon }_i; \varvec{\mu }_{\alpha _{i,1}}; \ldots ; \varvec{\mu }_{\alpha _{i,|\mathcal {N}_i|}}] \in \mathbb {R}^{|\mathcal {X}_i| + \sum _{\alpha }^{\mathcal {N}_i} |\mathcal {X}_{\alpha }|}$. $\mathbf {y}$ concatenates all factor nodes $\varvec{\mu }_V$ and the extra variable nodes $\varvec{\upsilon }$; $\mathbf {y}_i$ concatenates the factor nodes and the extra variable node connected to the i-th variable node $\varvec{\mu }_i$. $\mathcal {N}_i$ indicates the set of neighborhood factor nodes connected to the i-th variable node; the subscript $\alpha _{i,j}$ indicates the j-th factor connected to the i-th variable, with $i \in V$ and $j \in \mathcal {N}_i$.
2.
Objective functions$f(\mathbf {x})= \mathbf {w}_{\mathbf {x}}^\top \mathbf {x}$ with $\mathbf {w}_{\mathbf {x}} = - [\varvec{\theta }_1; \ldots ; $$\varvec{\theta }_{|V|}]$. $h(\mathbf {y}) = g(\mathbf {y}) + \mathbf {w}_{\mathbf {y}}^\top \mathbf {y}$, with $\mathbf {w}_{\mathbf {y}} = [\mathbf {w}_1; \ldots ;$$\mathbf {w}_{|V|}]$ with $\mathbf {w}_{i} = -[\varvec{0}; \frac{1}{|\mathcal {N}_{\alpha _{i,1}}|} \varvec{\theta }_{\alpha _{i,1}};$$\ldots ; \frac{1}{|\mathcal {N}_{\alpha _{i,|\mathcal {N}_i|}}|} \varvec{\theta }_{\alpha _{i,|\mathcal {N}_i|}}]$, and $\mathcal {N}_{\alpha } = \{ i \mid (i, \alpha ) \in E\}$ being the set of neighborhood variable nodes connected to the $\alpha $-th factor. $g(\mathbf {y}) = \mathbb {I}(\varvec{\upsilon } \in \mathcal {S}) + \sum _{\alpha \in F} \mathbb {I}(\varvec{\mu }_{\alpha } \in \Delta ^{|\mathcal {X}_{\alpha }|})$, with $\mathbb {I}(a)$ being the indicator function: $\mathbb {I}(a)=0$ if a is true, otherwise $\mathbb {I}(a)=\infty $.
3.
Constraint matrices The constraint matrix $\mathbf {A}= \text {diag}($$\mathbf {A}_1, \ldots , \mathbf {A}_i, \ldots , \mathbf {A}_{|V|})$ with $\mathbf {A}_i = [\mathbf {I}_{|\mathcal {X}_i|}; \ldots ; \mathbf {I}_{|\mathcal {X}_i|} ] \in \{0,1\}^{(|\mathcal {N}_i| +1)|\mathcal {X}_i| \times |\mathcal {X}_i|}$. $\mathbf {B}= \text {diag}(\mathbf {B}_1, \ldots ,$$ \mathbf {B}_i, \ldots , \mathbf {B}_{|V|})$, with $\mathbf {B}_i = \text {diag}(\mathbf {I}_{|\mathcal {X}_i|}, \mathbf {M}_{i, \alpha _{i,1}}, \ldots , \mathbf {M}_{i, \alpha _{i, |\mathcal {N}_i|}} )$. $\mathbf {A}$ summarizes all constraints on $\varvec{\mu }_V$, while $\mathbf {B}$ collects all constraints on $\varvec{\mu }_F$ and $\varvec{\upsilon }$.

Note that Problem (32) has a clear structure with two groups of variables, corresponding the augmented factor graph (see Fig. 1c).

According to the analysis presented in Wang et al. (2017), a sufficient condition to ensure the global convergence of the ADMM algorithm for the problem $\hbox {LS}{-}\hbox {LP}(\varvec{\theta })$ is that $\text {Im}(\mathbf {B}) \subseteq \text {Im}(\mathbf {A})$, with $\text {Im}(\mathbf {A})$ being the image of $\mathbf {A}$, i.e., the column space of $\mathbf {A}$. However, $\mathbf {A}$ in (32) is full column rank, rather than full row rank, while $\mathbf {B}$ is full row rank. To satisfy this sufficient condition, we introduce a sufficiently small perturbation to both the objective function and the constraint in (32), as follows

$$\begin{aligned} \hbox {LS}{-}\hbox {LP}(\varvec{\theta }; \epsilon ) = \min _{\hat{\mathbf {x}}, \mathbf {y}} \hat{f}(\hat{\mathbf {x}}) + h(\mathbf {y}),\quad \text {s.t.} ~ \hat{\mathbf {A}} \hat{\mathbf {x}} = \mathbf {B}\mathbf {y}, \end{aligned}$$

(33)

where $\hat{\mathbf {A}} = [\mathbf {A}, \epsilon \mathbf {I}]$ with a sufficiently small constant $\epsilon > 0$, then $\hat{\mathbf {A}}$ is full row rank. $\hat{\mathbf {x}} = [\mathbf {x}; \bar{\mathbf {x}}]$, with $\bar{\mathbf {x}} = [\bar{\mathbf {x}}_1; \ldots ; \bar{\mathbf {x}}_{|V|}] \in \mathbb {R}^{\sum _i^{V} (|\mathcal {N}_i|+1) |\mathcal {X}_i|}$ and $\bar{\mathbf {x}}_i = [\varvec{\mu }_i; \ldots ; \varvec{\mu }_i] \in \mathbb {R}^{(|\mathcal {N}_i|+1) |\mathcal {X}_i|}$. $\hat{f}(\hat{\mathbf {x}}) = f(\mathbf {x}) + \frac{1}{2}\epsilon \hat{\mathbf {x}}^\top \hat{\mathbf {x}}$. Consequently, $\text {Im}(\hat{\mathbf {A}}) \equiv \text {Im}(\mathbf {B}) \subseteq \mathbb {R}^{\text {rank of } ~ \hat{\mathbf {A}}}$, as both $\hat{\mathbf {A}}$ and $\mathbf {B}$ are full row rank. Then, the sufficient condition $\text {Im}(\mathbf {B}) \subseteq \text {Im}(\hat{\mathbf {A}})$ holds.

The augmented Lagrangian function of (33) is formulated as

$$\begin{aligned} \mathcal {L}_{\rho , \epsilon }(\hat{\mathbf {x}}, \mathbf {y}, \varvec{\lambda })&= \hat{f}(\hat{\mathbf {x}}) + h(\mathbf {y}) + \varvec{\lambda }^\top (\hat{\mathbf {A}} \hat{\mathbf {x}} - \mathbf {B}\mathbf {y})\nonumber \\&\quad + \frac{\rho }{2} \Vert \hat{\mathbf {A}} \hat{\mathbf {x}} - \mathbf {B}\mathbf {y}\Vert _2^2 \end{aligned}$$

(34)

The updates of the ADMM algorithm to optimize (33) are as follows

$$\begin{aligned} \left\{ \begin{array}{l} \mathbf {y}^{k+1} = \mathop {\arg \!\min }_{\mathbf {y}} \mathcal {L}_{\rho , \epsilon }(\hat{\mathbf {x}}^k, \mathbf {y}, \varvec{\lambda }^k),\\ \hat{\mathbf {x}}^{k+1} = \mathop {\arg \!\min }_{\hat{\mathbf {x}}} \mathcal {L}_{\rho , \epsilon }(\hat{\mathbf {x}}, \mathbf {y}^{k+1}, \varvec{\lambda }^k),\\ \varvec{\lambda }^{k+1} = \varvec{\lambda }^k + \rho (\hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} - \mathbf {B}\mathbf {y}^{k+1}). \end{array}\right. \end{aligned}$$

(35)

The optimality conditions of the variable sequence $(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1})$ generated above are

$$\begin{aligned}&\mathbf {B}^\top \varvec{\lambda }^k + \rho \mathbf {B}^\top (\hat{\mathbf {A}} \hat{\mathbf {x}}^k - \mathbf {B}\mathbf {y}^{k+1})= \mathbf {B}^\top \varvec{\lambda }^{k+1} \nonumber \\&\quad - \rho \mathbf {B}^\top \hat{\mathbf {A}} (\hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k) \in \partial h(\mathbf {y}^{k+1}),\end{aligned}$$

(36)

$$\begin{aligned}&\nabla \hat{f}(\hat{\mathbf {x}}^{k+1}) + \hat{\mathbf {A}}^\top \varvec{\lambda }^k + \rho \hat{\mathbf {A}}^\top (\hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} - \mathbf {B}\mathbf {y}^{k+1})\nonumber \\&\quad = \nabla \hat{f}(\hat{\mathbf {x}}^{k+1}) + \hat{\mathbf {A}}^\top \varvec{\lambda }^{k+1} = \varvec{0},\end{aligned}$$

(37)

$$\begin{aligned}&\frac{1}{\rho } (\varvec{\lambda }^{k+1} - \varvec{\lambda }^{k}) = \hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} - \mathbf {B}\mathbf {y}^{k+1}. \end{aligned}$$

(38)

The convergence of this perturbed ADMM algorithm for the LS–LP problem is summarized in Theorem 2. The detailed proof is presented in the following sub-sections sequentially. Note that hereafter $\Vert \cdot \Vert $ indicates the $\ell _2$ norm for a vector, or the Frobenius norm for a matrix; $\mathcal {A}_1 \succeq \mathcal {A}_2$ represents that $\mathcal {A}_1 - \mathcal {A}_2$ is positive semi-definite, with $\mathcal {A}_1, \mathcal {A}_2$ being square matrices; $\nabla $ denotes the gradient operator, $\nabla ^2$ means the Hessian operator, and $\partial $ is the sub-gradient operator; $\mathbf {I}$ represents the identity matrix with compatible shape.

1.1 Properties

In this section, we present some important properties of the objective function and constraints in (33), which will be used in the followed convergence analysis.

Properties on objective functions (P1)

(P1.1) f, h and $\mathcal {L}_{\rho , \epsilon }$ are semi-algebraic, lower semi-continuous functions and satisfy Kurdyka–Lojasiewicz (KL) property, and h is closed and proper
(P1.2) There exist $\mathcal {Q}_1, \mathcal {Q}_2$ such that $\mathcal {Q}_1 \succeq \nabla ^2 \hat{f}(\hat{\mathbf {x}}) \succeq \mathcal {Q}_2$, $\forall \hat{\mathbf {x}}$
(P1.3) $\lim \inf _{\Vert \hat{\mathbf {x}} \Vert \rightarrow \infty } \Vert \nabla \hat{f}(\hat{\mathbf {x}}) \Vert = \infty $

Properties on constraint matrices (P2)

(P2.1) There exists $\sigma > 0$ such that $\hat{\mathbf {A}} \hat{\mathbf {A}}^\top \succeq \sigma \mathbf {I}$
(P2.2) $\mathcal {Q}_2 + \rho \hat{\mathbf {A}}^\top \hat{\mathbf {A}} \succeq \delta \mathbf {I}$ for some $\rho , \delta > 0$, and $\rho > \frac{1}{\epsilon } $
(P2.3) There exists $\mathcal {Q}_3 \succeq [\nabla ^2 \hat{f}(\hat{\mathbf {x}})]^2, \forall \hat{\mathbf {x}}$, and $\delta \mathbf {I}\succ \frac{2}{\sigma \rho } \mathcal {Q}_3$
(P2.4) Both $\hat{\mathbf {A}}$ and $\mathbf {B}$ are full row rank, and $\text {Im}(\hat{\mathbf {A}}) \equiv \text {Im}(\mathbf {B}) \subseteq \mathbb {R}^{\text {rank of }~ \hat{\mathbf {A}}}$

Remark

(1) Although the definition of KL property (see Definition 1) is somewhat complex, but it holds for many widely used functions, according to Xu and Yin (2013). Typical functions satisfying KL property includes: (a) real analytic functions, and any polynomial function such as $\Vert \mathbf {H} \mathbf {x}- \mathbf {b} \Vert $ belongs to this type; (b) locally strongly convex functions, such as the logistic loss function $\log (1+\exp (-\mathbf {x}))$; (c) semi-algebraic functions, such as $\Vert \mathbf {x}\Vert _1, \Vert \mathbf {x}\Vert _2,\Vert \mathbf {x}\Vert _{\infty }, \Vert \mathbf {H} \mathbf {x}- \mathbf {b} \Vert _1, \Vert \mathbf {H} \mathbf {x}- \mathbf {b} \Vert _2, \Vert \mathbf {H} \mathbf {x}- \mathbf {b} \Vert _{\infty }$ and the indicator function $\mathbb {I}(\cdot )$. It is easy to verify that P1.1 holds in our problem. (2) Here we provide an instantiation of above hyper-parameters satisfying above properties. Firstly, it is easy to obtain that $\nabla ^2 \hat{f}(\hat{\mathbf {x}}) = \epsilon \mathbf {I}$, and $\hat{\mathbf {A}} \hat{\mathbf {A}}^\top = [\mathbf {A}, \epsilon \mathbf {I}] [\mathbf {A}, \epsilon \mathbf {I}]^\top = \mathbf {A}\mathbf {A}^\top + \epsilon ^2 \mathbf {I}\succ \epsilon ^2 \mathbf {I}$, as well as $\rho \hat{\mathbf {A}}^\top \hat{\mathbf {A}} \succeq \epsilon \mathbf {I}$, when $\epsilon $ is small enough and $\rho > \frac{1}{\epsilon }$ (e.g., $\rho = \frac{2}{\epsilon }$). Then, the values $\mathcal {Q}_1 = \mathcal {Q}_2 = \epsilon \mathbf {I}, \mathcal {Q}_3 = \epsilon ^2 \mathbf {I}, \delta = 2 \epsilon , \sigma = \epsilon ^2$ satisfy P1.2, P2.1, P2.2 and P2.3. Without loss of generality, we will adopt these specific values for these hyper-parameters to simplify the following analysis, while only keeping $\rho $ and $\epsilon $.

1.2 Decreasing of $\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^{k})$

In this section, we firstly prove the decreasing property of the augmented Lagrangian function, i.e.,

$$\begin{aligned} \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) > \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1}), \forall k. \end{aligned}$$

(39)

Firstly, utilizing P2.1, P2.3 and (37), we obtain that

$$\begin{aligned}&\epsilon ^2 \Vert \varvec{\lambda }^{k+1} - \varvec{\lambda }^{k} \Vert _2^2 \le \Vert \hat{\mathbf {A}}(\varvec{\lambda }^{k+1} - \varvec{\lambda }^{k}) \Vert _2^2 \nonumber \\&\quad = \Vert \nabla \hat{f}(\hat{\mathbf {x}}^{k+1}) - \nabla \hat{f}(\hat{\mathbf {x}}^k) \Vert _2^2 = \epsilon ^2 \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert _2^2. \end{aligned}$$

(40)

Then, we have

$$\begin{aligned}&\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1}) - \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k})\nonumber \\&\quad = (\varvec{\lambda }^{k+1} - \varvec{\lambda }^{k})^\top (\hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} - \mathbf {B}\mathbf {y}^{k+1})\nonumber \\&\quad =\frac{1}{\rho } \Vert \varvec{\lambda }^{k+1} - \varvec{\lambda }^{k} \Vert _2^2 \le \frac{1}{\rho } \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert _2^2 \end{aligned}$$

(41)

According to P1.2 and P2.2, $\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}, \varvec{\lambda }^{k})$ is strongly convex with respect to $\hat{\mathbf {x}}$, with the parameter of at least $2 \epsilon $. Then, we have

$$\begin{aligned}&\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k}) - \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}) \nonumber \\&\quad \le - \epsilon \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert _2^2. \end{aligned}$$

(42)

As $\mathbf {y}^{k+1}$ is the minimal solution of $\mathcal {L}_{\rho , \epsilon }(\mathbf {y}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k})$, it is easy to know

$$\begin{aligned} \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^k, \varvec{\lambda }^{k}) - \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}) \le 0. \end{aligned}$$

(43)

Combining (41), (42) and (43), we have

$$\begin{aligned}&\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1}) - \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k})\nonumber \\&\quad \le (\frac{1}{\rho } - \epsilon ) \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert _2^2 < 0, \end{aligned}$$

(44)

where the last inequality utilizes P2.3 and $\rho > \frac{1}{\epsilon }$.

1.3 Boundedness of $\{\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k\}$

Next, we prove the boundedness of $\{\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k\}$. We suppose that $\rho $ is large enough such that there is $0<\gamma <\rho $ with

$$\begin{aligned} \inf _{\hat{\mathbf {x}}} \big ( \hat{f}(\hat{\mathbf {x}}) - \frac{1}{2 \epsilon ^2 \gamma } \Vert \nabla \hat{f}(\hat{\mathbf {x}}) \Vert _2^2 \big ) = f^* > - \infty . \end{aligned}$$

(45)

According to (44), for any $k \ge 1$, we have

$$\begin{aligned}&\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) = \hat{f}(\hat{\mathbf {x}}^k) + h(\mathbf {y}^k) + \frac{\rho }{2} \Vert \hat{\mathbf {A}} \hat{\mathbf {x}}^k - \mathbf {B}\mathbf {y}^k \nonumber \\&\quad +\frac{\varvec{\lambda }^k}{\rho } \Vert _2^2 - \frac{1}{2\rho } \Vert \varvec{\lambda }^k \Vert _2^2 \le \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^1, \hat{\mathbf {x}}^1, \varvec{\lambda }^1) < \infty . \end{aligned}$$

(46)

Besides, according to P2.1, we have

$$\begin{aligned} \epsilon ^2 \Vert \varvec{\lambda }^k \Vert _2^2 \le \Vert \hat{\mathbf {A}}^\top \varvec{\lambda }^k \Vert _2^2 = \Vert \nabla \hat{f}(\hat{\mathbf {x}}^k) \Vert _2^2. \end{aligned}$$

(47)

Plugging (47) into (46), we obtain that

$$\begin{aligned} \infty&> \hat{f}(\hat{\mathbf {x}}^k) + h(\mathbf {y}^k) + \frac{\rho }{2} \Vert \hat{\mathbf {A}} \hat{\mathbf {x}}^k - \mathbf {B}\mathbf {y}^k + \frac{\varvec{\lambda }^k}{\rho } \Vert _2^2 \nonumber \\&\quad - \frac{1}{2 \epsilon ^2 \rho } \Vert \nabla \hat{f}(\hat{\mathbf {x}}^k) \Vert _2^2 \ge f^* + \frac{\frac{1}{\gamma } - \frac{1}{\rho }}{2 \epsilon ^2} \Vert \nabla \hat{f}(\hat{\mathbf {x}}^k) \Vert _2^2\nonumber \\&\quad + h(\mathbf {y}^k) + \frac{\rho }{2} \Vert \hat{\mathbf {A}} \hat{\mathbf {x}}^k - \mathbf {B}\mathbf {y}^k + \frac{\varvec{\lambda }^k}{\rho } \Vert _2^2. \end{aligned}$$

(48)

According to the coerciveness of $\nabla \hat{f}(\hat{\mathbf {x}}^k)$ (i.e., P1.3), we obtain that $\hat{\mathbf {x}}^k < \infty , \forall k$, i.e., the boundedness of $\{\hat{\mathbf {x}}^k\}$. From (47), we know the boundedness of $\{\varvec{\lambda }^k\}$. Besides, according to P2.4, $\{\hat{\mathbf {A}}\hat{\mathbf {x}}^k\}$ is also bounded. From (38), we obtain the boundedness of $\{\mathbf {B}\mathbf {y}^k\}$. Considering the full row rank of $\mathbf {B}$ (i.e., P2.4), the boundedness of $\{\mathbf {y}^k\}$ is proved.

1.4 Convergence of Residual

According to the boundedness of $\{\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k\}$, there is a sub-sequence $\{\mathbf {y}^{k_i}, \hat{\mathbf {x}}^{k_i}, \varvec{\lambda }^{k_i}\}$ that converges to a cluster point $\{\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*\}$. Considering the lower semi-continuity of $\mathcal {L}_{\rho , \epsilon }$ (i.e., P1.1), we have

$$\begin{aligned} \underset{i \rightarrow \infty }{\lim \inf } ~ \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k_i}, \hat{\mathbf {x}}^{k_i}, \varvec{\lambda }^{k_i}) \ge \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*) > - \infty . \end{aligned}$$

(49)

Summing (44) from $k = M, \ldots , N-1$ with $M \ge 1$, we have

$$\begin{aligned}&\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^N, \hat{\mathbf {x}}^N, \varvec{\lambda }^N) - \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^M, \hat{\mathbf {x}}^M, \varvec{\lambda }^M) \nonumber \\&\quad \le \left( \frac{1}{\rho } - \epsilon \right) \sum _{k=M}^{N-1} \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert _2^2 < 0. \end{aligned}$$

(50)

Then, by setting $N = k_i$ and $M=1$, we have

$$\begin{aligned}&\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k_i}, \hat{\mathbf {x}}^{k_i}, \varvec{\lambda }^{k_i}) - \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^1, \hat{\mathbf {x}}^1, \varvec{\lambda }^1) \nonumber \\&\quad \le \left( \frac{1}{\rho } - \epsilon \right) \sum _{k=1}^{k_i-1} \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert _2^2. \end{aligned}$$

(51)

Taking limit on both sides of the above inequality, we obtain

$$\begin{aligned}&-\infty< \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*) - \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^1, \hat{\mathbf {x}}^1, \varvec{\lambda }^1) \nonumber \\&\quad \le \left( \frac{1}{\rho } - \epsilon \right) \sum _{k=1}^{\infty } \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert _2^2 < 0. \end{aligned}$$

(52)

It implies that

$$\begin{aligned} \lim _{k \rightarrow \infty }\Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert = 0. \end{aligned}$$

(53)

Besides, according to (40), it is easy to obtain that

$$\begin{aligned} \lim _{k \rightarrow \infty }\Vert \varvec{\lambda }^{k+1} - \varvec{\lambda }^k \Vert = 0. \end{aligned}$$

(54)

Moreover, utilizing $\mathbf {B}\mathbf {y}^{k+1} = \hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} - \frac{1}{\rho }(\varvec{\lambda }^{k+1} - \varvec{\lambda }^k)$ from (38), we have

$$\begin{aligned} \Vert \mathbf {B}(\mathbf {y}^{k+1} - \mathbf {y}^{k}) \Vert&\le \Vert \hat{\mathbf {A}} (\hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k) \Vert + \frac{1}{\rho } \Vert (\varvec{\lambda }^{k+1} - \varvec{\lambda }^k) \Vert \nonumber \\&\quad + \frac{1}{\rho } \Vert (\varvec{\lambda }^{k} - \varvec{\lambda }^{k-1}) \Vert . \end{aligned}$$

(55)

Besides, as shown in Lemma 1 in Wang et al. (2017), the full row rank of $\mathbf {B}$ (i.e., P1.4) implies that

$$\begin{aligned} \Vert \mathbf {y}^{k+1} - \mathbf {y}^k \Vert \le \bar{M} \Vert \mathbf {B}(\mathbf {y}^{k+1} - \mathbf {y}^k) \Vert , \end{aligned}$$

(56)

where $\bar{M} > 0$ is a constant. Taking limit on both sides of (55) and utilizing (56), we obtain

$$\begin{aligned} \lim _{k \rightarrow \infty }\Vert \mathbf {y}^{k+1} - \mathbf {y}^k \Vert = 0. \end{aligned}$$

(57)

Combining (53), (54) and (57), we obtain that

$$\begin{aligned} \lim _{k \rightarrow \infty } \Vert \mathbf {y}^{k+1} - \mathbf {y}^k \Vert _2^2 +\Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert _2^2 + \Vert \varvec{\lambda }^{k+1} - \varvec{\lambda }^k \Vert _2^2 = 0. \end{aligned}$$

(58)

By setting $k+1 = k_i$, plugging (53) into (36) and (54) into (37), and taking limit $k_i \rightarrow \infty $, we obtain the KKT conditions. It tells that the cluster point $(\mathbf {y}^*, \hat{\mathbf {x}}^*)$ is the KKT point of $\hbox {LS}{-}\hbox {LP}(\varvec{\theta };\varvec{\epsilon })$ (i.e., (33)).

1.5 Global Convergence

Inspired by the analysis presented in Li and Pong (2015), in this section we will prove the following conclusions:

$\sum _{k=1}^{\infty } \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert < \infty $;
$\{\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k\}$ converges to $(\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*)$;
$(\mathbf {y}^*, \hat{\mathbf {x}}^*)$ is the KKT point of (33).

Firstly, utilizing the optimality conditions (36, 37, 38), we have that

$$\begin{aligned}&\partial _{\mathbf {y}} \mathcal {L}_{\rho , \epsilon }(\hat{\mathbf {x}}^{k+1}, \mathbf {y}^{k+1}, \varvec{\lambda }^{k+1}) = \partial h(\mathbf {y}^{k+1}) - \mathbf {B}^\top \varvec{\lambda }^{k+1} \nonumber \\&\quad - \rho \mathbf {B}^\top (\hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} - \mathbf {B}\mathbf {y}^{k+1}) \ni - \mathbf {B}^\top (\varvec{\lambda }^{k+1} - \varvec{\lambda }^k) \nonumber \\&\quad - \rho \mathbf {B}^\top \hat{\mathbf {A}} (\hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k),\end{aligned}$$

(59)

$$\begin{aligned}&\nabla _{\hat{\mathbf {x}}} \mathcal {L}_{\rho , \epsilon }(\hat{\mathbf {x}}^{k+1}, \mathbf {y}^{k+1}, \varvec{\lambda }^{k+1}) = \nabla _{\hat{\mathbf {x}}} \hat{f}(\hat{\mathbf {x}}^{k+1}) + \hat{\mathbf {A}}^\top \varvec{\lambda }^{k+1}\nonumber \\&\quad +\rho \hat{\mathbf {A}}^\top (\hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} - \mathbf {B}\mathbf {y}^{k+1}) = \hat{\mathbf {A}}^\top (\varvec{\lambda }^{k+1} - \varvec{\lambda }^k),\end{aligned}$$

(60)

$$\begin{aligned}&\nabla _{\varvec{\lambda }} \mathcal {L}_{\rho , \epsilon }(\hat{\mathbf {x}}^{k+1}, \mathbf {y}^{k+1}, \varvec{\lambda }^{k+1}) = \hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} - \mathbf {B}\mathbf {y}^{k+1} \nonumber \\&\quad = \frac{1}{\rho } (\varvec{\lambda }^{k+1} - \varvec{\lambda }^k). \end{aligned}$$

(61)

Further, combining with (40), there exists a constant $C>0$ such that

$$\begin{aligned} \text {dist}\big (0, \partial _{(\mathbf {y}, \hat{\mathbf {x}}, \varvec{\lambda })} \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1},\hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1})\big ) \le C \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert , \end{aligned}$$

(62)

where $\text {dist}(\cdot , \cdot )$ denotes the distance between a vector and a set of vectors. Hereafter we denote $\partial _{(\mathbf {y}, \hat{\mathbf {x}}, \varvec{\lambda })} \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1},\hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1})$ as $\partial \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1},\hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1})$ for clarity. Besides, the relation (44) implies that there is a constant $D \in (0, \epsilon - \frac{1}{\rho })$ such that

$$\begin{aligned} \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}) {-} \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1}) \ge D \Vert \hat{\mathbf {x}}^{k+1} {-} \hat{\mathbf {x}}^k \Vert _2^2. \end{aligned}$$

(63)

Moreover, the relation (49) implies that $\{ \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}) \}$ is lower bounded along the convergent sub-sequence $\{ (\mathbf {y}^{k_i}, \hat{\mathbf {x}}^{k_i}, \varvec{\lambda }^{k_i}) \}$. Combining with the its decreasing property, the limit of $\{ \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}) \}$ exists. Thus, we will show that

$$\begin{aligned} \underset{k \rightarrow \infty }{\lim } \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}) = l^*:{=} \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{*}, \hat{\mathbf {x}}^*, \varvec{\lambda }^{*}). \end{aligned}$$

(64)

To prove it, we utilize the fact that $\mathbf {y}^{k+1}$ is the minimizer of $ \mathcal {L}_{\rho , \epsilon }(\mathbf {y}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}) $, such that

$$\begin{aligned} \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}) \le \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{*}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}). \end{aligned}$$

(65)

Combining the above relation, (58) and the continuity of $\mathcal {L}_{\rho , \epsilon }$ w.r.t. $\hat{\mathbf {x}}$ and $\varvec{\lambda }$, the following relation holds along the sub-sequence $\{ (\mathbf {y}^{k_i}, \hat{\mathbf {x}}^{k_i}, \varvec{\lambda }^{k_i}) \}$ that converges to $ (\mathbf {y}^{*}, \hat{\mathbf {x}}^{*}, \varvec{\lambda }^{*}) $,

$$\begin{aligned} \underset{i \rightarrow \infty }{\lim \sup } ~ \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k_i+1}, \hat{\mathbf {x}}^{k_i+1}, \varvec{\lambda }^{k_i+1}) \le \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{*}, \hat{\mathbf {x}}^{*}, \varvec{\lambda }^{*}). \end{aligned}$$

(66)

According to (58), the sub-sequence $\{ (\mathbf {y}^{k_i+1}, \hat{\mathbf {x}}^{k_i+1}, \varvec{\lambda }^{k_i+1}) \}$ also converges to $ (\mathbf {y}^{*}, \hat{\mathbf {x}}^{*}, \varvec{\lambda }^{*}) $. Then, utilizing the lower semi-continuity of $\mathcal {L}_{\rho , \epsilon }$, we have

$$\begin{aligned} \underset{i \rightarrow \infty }{\lim \inf } ~ \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k_i+1}, \hat{\mathbf {x}}^{k_i+1}, \varvec{\lambda }^{k_i+1}) \ge \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{*}, \hat{\mathbf {x}}^{*}, \varvec{\lambda }^{*}). \end{aligned}$$

(67)

Combining (66) with (67), we know the existence of the limit of the sequence $\{ \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) \}$, which proves the relation (64).

As $\mathcal {L}_{\rho , \epsilon }$ is KL function, according to Definition 1, it has the following properties:

There exist a constant $\eta \in (0, \infty ]$, a continuous concave function $\varphi : [0, \eta ) \rightarrow \mathbb {R}_{+}$, as well as a neighbourhood $\mathcal {V}$ of $(\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*)$. $\varphi $ is differentiable on $(0, \eta )$ with positive derivatives.
For all $(\mathbf {y}, \hat{\mathbf {x}}, \varvec{\lambda }) \in \mathcal {V}$ satisfying $l^*< \mathcal {L}_{\rho , \epsilon }(\mathbf {y}, \hat{\mathbf {x}}, \varvec{\lambda }) < l^* + \eta $, we have
$$\begin{aligned} \varphi '( \mathcal {L}_{\rho , \epsilon }(\mathbf {y}, \hat{\mathbf {x}}, \varvec{\lambda }) - l^* ) \text {dist}(0, \partial \mathcal {L}_{\rho , \epsilon }(\mathbf {y}, \hat{\mathbf {x}}, \varvec{\lambda }) ) \ge 1. \end{aligned}$$
(68)

Then, we define the following neighborhood sets:

$$\begin{aligned} \mathcal {V}_{\zeta }&{:=} \bigg \{ (\mathbf {y}, \hat{\mathbf {x}}, \varvec{\lambda }) ~ \bigg \vert ~ \Vert \hat{\mathbf {x}} - \hat{\mathbf {x}}^* \Vert< \zeta , \Vert \mathbf {y}- \mathbf {y}^*\Vert \nonumber \\&\quad< \bar{M}(\Vert \hat{\mathbf {A}}\Vert +1) \zeta , \Vert \varvec{\lambda } - \varvec{\lambda }^* \Vert < \zeta \bigg \} \subseteq \mathcal {V}\end{aligned}$$

(69)

$$\begin{aligned} \mathcal {V}_{\zeta , \hat{\mathbf {x}}}&{:=} \big \{ \hat{\mathbf {x}} ~ \big \vert ~ \Vert \hat{\mathbf {x}} - \hat{\mathbf {x}}^* \Vert < \zeta \big \}, \end{aligned}$$

(70)

where $\zeta > 0$ is a small constant.

Utilizing the relations (37) and (38), as well as P2.1, we obtain that for any $k\ge 1$, the following relation holds:

$$\begin{aligned}&\epsilon ^2 \Vert \varvec{\lambda }^k - \varvec{\lambda }^* \Vert _2^2 \le \Vert \hat{\mathbf {A}}^\top (\varvec{\lambda }^k - \varvec{\lambda }^*) \Vert _2^2 = \Vert \triangledown \hat{f}(\hat{\mathbf {x}}^k) - \triangledown \hat{f}(\hat{\mathbf {x}}^*) \Vert _2^2\nonumber \\&\quad = \epsilon ^2 \Vert \hat{\mathbf {x}}^k - \hat{\mathbf {x}}^* \Vert _2^2. \end{aligned}$$

(71)

Also, the relations (37) and (38) imply that for any $k\ge 1$, we have

$$\begin{aligned} \Vert \mathbf {B}(\mathbf {y}^k - \mathbf {y}^*) \Vert&= \left\| \hat{\mathbf {A}} (\hat{\mathbf {x}}^k - \hat{\mathbf {x}}^*) - \frac{1}{\rho }(\varvec{\lambda }^k - \varvec{\lambda }^{k-1}) \right\| \nonumber \\&\le \Vert \hat{\mathbf {A}} \Vert \Vert (\hat{\mathbf {x}}^k - \hat{\mathbf {x}}^*) \Vert + \frac{1}{\rho } \Vert \varvec{\lambda }^k - \varvec{\lambda }^{k-1} \Vert . \end{aligned}$$

(72)

Moreover, the relation (58) implies that $\exists N_0 \ge 1$ such that $\forall k \ge N_0$, we have

$$\begin{aligned} \Vert \varvec{\lambda }^k - \varvec{\lambda }^{k-1} \Vert \le \rho \zeta . \end{aligned}$$

(73)

Similar to (56), the full row rank of $\mathbf {B}$ implies $\Vert \mathbf {y}^k - \mathbf {y}^* \Vert \le \bar{M} \Vert \mathbf {B}(\mathbf {y}^k - \mathbf {y}^*) \Vert $. Then, plugging (73) into (72), we obtain that

$$\begin{aligned} \Vert \mathbf {y}^k - \mathbf {y}^* \Vert \le \bar{M}(\Vert \hat{\mathbf {A}}\Vert +1) \zeta , \end{aligned}$$

(74)

for any $\hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}$ and $k \ge N_0$. Combining (71) and (74), we know that if $\hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}$ and $k \ge N_0$, then $(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) \in \mathcal {V}_{\zeta } \subseteq \mathcal {V}$.

Moreover, (44) and (64) implies that $\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) \ge l^*, \forall k \ge 1$. Besides, as $(\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*)$ is a cluster point, we will obtain that $\exists N \ge N_0$, the following relations hold:

$$\begin{aligned} \left\{ \begin{array}{l} \hat{\mathbf {x}}^N \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\\ l^*< \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^N, \hat{\mathbf {x}}^N, \varvec{\lambda }^N)< l^* + \eta \\ \Vert \hat{\mathbf {x}}^N - \hat{\mathbf {x}}^* \Vert + 2 \sqrt{(\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^N, \hat{\mathbf {x}}^N, \varvec{\lambda }^N) - l^* ) / D}\\ \quad + \frac{C}{D} (\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^N, \hat{\mathbf {x}}^N, \varvec{\lambda }^N) - l^* ) < \zeta \end{array}\right. \end{aligned}$$

(75)

Next, We will show that if $\hat{\mathbf {x}}^N \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}$ and $l^*< \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^N, \hat{\mathbf {x}}^N, \varvec{\lambda }^N) < l^* + \eta $ hold for some fixed $k \ge N_0$, then the following relation holds

$$\begin{aligned}&\Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert + \big (\Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert - \Vert \hat{\mathbf {x}}^{k} - \hat{\mathbf {x}}^{k-1} \Vert \big ) \nonumber \\&\quad \le \frac{C}{D} \bigg [ \varphi \big (\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) - l^* \big )\nonumber \\&\quad - \varphi \big (\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1}) - l^* \big ) \bigg ]. \end{aligned}$$

(76)

To prove (76), we utilize the fact that $\hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}, k \ge N_0$ implies that $(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}} \subseteq \mathcal {V}$. And, combining with $l^*< \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) < l^* + \eta $, we obtain that

$$\begin{aligned} \varphi '( \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) - l^* ) \text {dist}(0, \partial \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) ) \ge 1. \end{aligned}$$

(77)

Combining the relations (62), (63) and (77), as well as the concavity of $\varphi $, we obtain that

$$\begin{aligned}&C \Vert \hat{\mathbf {x}}^k - \hat{\mathbf {x}}^{k-1} \Vert \cdot \big [ \varphi \big (\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) - l^* \big ) \nonumber \\&\quad - \varphi \big (\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1}) - l^* \big ) \big ]\nonumber \\&\quad \ge \text {dist}(0, \partial \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) ) \cdot \big [ \varphi \big (\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) - l^* \big ) \nonumber \\&\quad - \varphi \big (\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1}) - l^* \big ) \big ]\nonumber \\&\quad \ge \text {dist}(0, \partial \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) ) \cdot \varphi '\big (\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) - l^* \big ) \nonumber \\&\quad \cdot \big [ \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) - \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1}) \big ]\nonumber \\&\quad \ge D \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^{k} \Vert ^2, \end{aligned}$$

(78)

for all such k. Taking square root on both sides of (78), and utilizing the fact that $a+b \ge 2 \sqrt{a b}$, then (76) is proved.

We then prove $\forall k \ge N, \hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}$ holds. This claim can be proved through induction. Obviously it is true for $k=N$ by construction, as shown in (75). For $k=N+1$, we have

$$\begin{aligned}&\Vert \hat{\mathbf {x}}^{N+1} - \hat{\mathbf {x}}^{*} \Vert \le \Vert \hat{\mathbf {x}}^{N+1} - \hat{\mathbf {x}}^{N} \Vert + \Vert \hat{\mathbf {x}}^{N} - \hat{\mathbf {x}}^{*} \Vert \nonumber \\&\quad \le \sqrt{\big ( \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^N, \hat{\mathbf {x}}^N, \varvec{\lambda }^N) - \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{N+1}, \hat{\mathbf {x}}^{N+1}, \varvec{\lambda }^{N+1}) \big ) / D}\nonumber \\&\qquad + \Vert \hat{\mathbf {x}}^{N} - \hat{\mathbf {x}}^{*} \Vert \nonumber \\&\quad \le \sqrt{\big ( \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^N, \hat{\mathbf {x}}^N, \varvec{\lambda }^N) - l^* \big ) / D} + \Vert \hat{\mathbf {x}}^{N} - \hat{\mathbf {x}}^{*} \Vert < \zeta , \end{aligned}$$

(79)

where the first inequality utilizes (63), and the last inequality follows the last relation in (75). Thus, $\forall k \ge N, \hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}$ holds.

Next, we suppose that $\hat{\mathbf {x}}^N, \ldots , \hat{\mathbf {x}}^{N+t-1} \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}$ for some $t>1$, and we need to prove that $\hat{\mathbf {x}}^{N+t} \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}$ also holds, i.e.,

$$\begin{aligned}&\Vert \hat{\mathbf {x}}^{N+t} - \hat{\mathbf {x}}^{*} \Vert \le \Vert \hat{\mathbf {x}}^{N} - \hat{\mathbf {x}}^{*} \Vert + \Vert \hat{\mathbf {x}}^{N+1} - \hat{\mathbf {x}}^{N} \Vert \nonumber \\&\qquad + \sum _{i=1}^{t-1} \Vert \hat{\mathbf {x}}^{N+i+1} - \hat{\mathbf {x}}^{N+i} \Vert = \Vert \hat{\mathbf {x}}^{N} - \hat{\mathbf {x}}^{*} \Vert + 2 \Vert \hat{\mathbf {x}}^{N+1} - \hat{\mathbf {x}}^{N} \Vert \nonumber \\&\qquad - \Vert \hat{\mathbf {x}}^{N+t} - \hat{\mathbf {x}}^{N+t-1} \Vert +\sum _{i=1}^{t-1} \bigg [ \Vert \hat{\mathbf {x}}^{N+i+1} - \hat{\mathbf {x}}^{N+i} \Vert \nonumber \\&\qquad + \big ( \Vert \hat{\mathbf {x}}^{N+i+1} - \hat{\mathbf {x}}^{N+i} \Vert - \Vert \hat{\mathbf {x}}^{N+i} - \hat{\mathbf {x}}^{N+i-1} \Vert \big ) \bigg ]\nonumber \\&\quad \le \Vert \hat{\mathbf {x}}^{N} - \hat{\mathbf {x}}^{*} \Vert + 2 \Vert \hat{\mathbf {x}}^{N+1} - \hat{\mathbf {x}}^{N} \Vert \nonumber \\&\qquad + \frac{C}{D} \sum _{i=1}^{t-1} \big [ \varphi ^{N+i} - \varphi ^{N+i+1} \big ]\nonumber \\&\quad \le \Vert \hat{\mathbf {x}}^{N} - \hat{\mathbf {x}}^{*} \Vert + 2 \Vert \hat{\mathbf {x}}^{N+1} - \hat{\mathbf {x}}^{N} \Vert + \frac{C}{D} \sum _{i=1}^{t-1} \varphi ^{N+1}\nonumber \\&\quad \le \Vert \hat{\mathbf {x}}^{N} - \hat{\mathbf {x}}^{*} \Vert + 2 \sqrt{ \frac{\mathcal {L}_{\rho , \epsilon }^N - \mathcal {L}_{\rho , \epsilon }^{N+1}}{D}} + \frac{C}{D} \sum _{i=1}^{t-1} \varphi ^{N+1}\nonumber \\&\quad \le \Vert \hat{\mathbf {x}}^{N} - \hat{\mathbf {x}}^{*} \Vert + 2 \sqrt{ \frac{\mathcal {L}_{\rho , \epsilon }^N - l^* }{D}} + \frac{C}{D} \sum _{i=1}^{t-1} \varphi ^{N+1} < \zeta \end{aligned}$$

(80)

where $\varphi ^{N+i} = \varphi (\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{N+i}, \hat{\mathbf {x}}^{N+i}, \varvec{\lambda }^{N+i})-l^*) $ and $\mathcal {L}_{\rho , \epsilon }^N = \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{N}, \hat{\mathbf {x}}^{N}, \varvec{\lambda }^{N})$. The second inequality follows from (76). The fourth inequality follows from (63). The fifth inequality utilizes the fact that $\mathcal {L}_{\rho , \epsilon }^{N+1} > l^*$, and the last inequality follows from the last relation in (75). Thus, $\hat{\mathbf {x}}^{N+k} \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}$ holds. We have proved that $\forall k \ge N, \hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}$ holds by induction.

Then, according to $\forall k \ge N, \hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}$, we can sum both sides of (76) from $k=N$ to $\infty $, to obtain that

$$\begin{aligned} \sum _{k=N}^{\infty } \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^{k} \Vert \le \frac{C}{D} \varphi ^N + \Vert \hat{\mathbf {x}}^{N} - \hat{\mathbf {x}}^{N-1} \Vert < \infty , \end{aligned}$$

(81)

which implies that $\sum _{k=1}^{\infty } \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert < \infty $ holds. Thus $\{ \hat{\mathbf {x}}^k \}$ converges. The convergence of $\{ \mathbf {y}^k \}$ follows from $\mathbf {B}\mathbf {y}^{k+1} = \hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} + \frac{1}{\rho }(\varvec{\lambda }^{k+1} -\varvec{\lambda }^k)$ in (38) and (58), as well as the surjectivity of $\mathbf {B}$ (i.e., full row rank). The convergence of $\{ \varvec{\lambda }^k \}$ follows from $\nabla \hat{f}(\hat{\mathbf {x}}^{k+1}) = - \hat{\mathbf {A}}^\top \varvec{\lambda }^{k+1}$ in (37) and the surjectivity of $\hat{\mathbf {A}}$ (i.e., full row rank). Consequently, $\{ \mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k \}$ converges to the cluster point $(\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*)$. The conclusion that $(\mathbf {y}^*, \hat{\mathbf {x}}^*)$ is the KKT point of Problem (33) has been proved in Sect. A.4.

1.6 $\epsilon $-KKT Point of the Original LS–LP Problem

Proposition 1

The globally converged solution $(\mathbf {y}^*, \mathbf {x}^*, \varvec{\lambda }^*)$ produced by the ADMM algorithm for the perturbed LS–LP problem (33) is the $\epsilon $-KKT solution to the original LS–LP problem (32).

Proof

The globally converged solution $(\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*)$ to the perturbed LS–LP problem (33) satisfies the following relations:

$$\begin{aligned} \mathbf {B}^\top \varvec{\lambda }^* \in \partial h(\mathbf {y}^*), ~ \nabla \hat{f}(\hat{\mathbf {x}}^{*}) = - \hat{\mathbf {A}}^\top \varvec{\lambda }^*, ~ \hat{\mathbf {A}} \hat{\mathbf {x}}^* = \mathbf {B}\mathbf {y}^*. \end{aligned}$$

(82)

Recalling the definitions $\hat{\mathbf {A}} = [\mathbf {A}, \epsilon \mathbf {I}]$, $\hat{\mathbf {x}} = [\mathbf {x}; \bar{\mathbf {x}}]$ and $\hat{f}(\hat{\mathbf {x}}) = f(\mathbf {x}) + \frac{\epsilon }{2} \hat{\mathbf {x}}^\top \hat{\mathbf {x}}$, the above relations imply that

$$\begin{aligned} \nabla \hat{f}(\hat{\mathbf {x}}^{*}) + \hat{\mathbf {A}}^\top \varvec{\lambda }^*&= \nabla f(\mathbf {x}^{*}) + \mathbf {A}^\top \varvec{\lambda }^* + \epsilon \mathbf {x}^* = \varvec{0} \nonumber \\&\Rightarrow \Vert \nabla f(\mathbf {x}^{*}) + \mathbf {A}^\top \varvec{\lambda }^* \Vert = \epsilon \Vert \mathbf {x}^* \Vert = O(\epsilon ),\end{aligned}$$

(83)

$$\begin{aligned} \hat{\mathbf {A}} \hat{\mathbf {x}}^* + \mathbf {B}\mathbf {y}^*&= \mathbf {A}\mathbf {x}^* + \epsilon \bar{\mathbf {x}}^* + \mathbf {B}\mathbf {y}^* = 0 \Rightarrow \Vert \mathbf {A}\mathbf {x}^* + \mathbf {B}\mathbf {y}^* \Vert \nonumber \\&= \Vert \epsilon \bar{\mathbf {x}}^* \Vert = O(\epsilon ), \end{aligned}$$

(84)

where we utilize the boundedness of $\{\hat{\mathbf {x}}^*\}$. Thus, according to Definition 2, the globally converged point $(\mathbf {y}^*, \mathbf {x}^*)$ is the $\epsilon $-KKT solution to the original LS–LP problem (32). $\square $

1.7 Convergence Rate

Lemma 3

Firstly, without loss of generality, we can assume that $l^* = \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*) = 0$ (e.g., one can replace $l_k = \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k)$ by $l_k - l^*$). We further assume that $\mathcal {L}_{\rho , \epsilon }$ has the KL property at $(\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*)$ with the concave function $\varphi (s) = c s^{1-p}$, where $p \in [0, 1), c>0$. Consequently, we can obtain the following inequalities:

(i)
if $p=0$, then $\{(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k)\}_{k = 1, \ldots , \infty }$ can converge in finite steps;
(ii)
If $p \in (0,\frac{1}{2}]$, then there exist $c>0$ and $\tau \in (0,1)$ such that $\Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert \le c \tau ^k$;
(iii)
$p \in (\frac{1}{2},1)$, then there exist $c>0$ such that $\Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert \le c k^{-\frac{1-p}{2p -1}} $.

Proof

(i) If $p=0$, we define a subset $H = \{k \in \mathbb {N}: \hat{\mathbf {x}}_{k} \ne \hat{\mathbf {x}}_{k+1} \}$. If $k \in H$ is sufficiently large, then these exists $C_3>0$ such that

$$\begin{aligned} \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert ^2 \ge C_3 > 0. \end{aligned}$$

(85)

Combining with (63), we have

$$\begin{aligned} l_k - l_{k+1} \ge D \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert ^2 \ge C_3 D > 0. \end{aligned}$$

(86)

If the subset H is infinite, then it will contradict to the fact that $l_k - l_{k+1} \rightarrow 0$ as $k\rightarrow \infty $. Thus, H is a finite subset, leading to the conclusion that $\{\hat{\mathbf {x}}^k\}_{k \in \mathbb {N}}$ will converge in finite steps. Recalling the relationships between $\hat{\mathbf {x}}_k$ and $\mathbf {y}_k, \varvec{\lambda }_k$ (see the descriptions under (81)), we also obtain that $\{ \mathbf {y}_k, \varvec{\lambda }_k \}_{k \in \mathbb {N}}$ converges in finite steps.

By defining $\bigtriangleup _k = \sum _k^{\infty } \Vert x^k+1 - x^k \Vert $, the inequality (81) can be rewritten as follows

$$\begin{aligned} \bigtriangleup _k \le \frac{C}{D} \varphi (l_k) + (\bigtriangleup _{k-1} - \bigtriangleup _k) < \infty . \end{aligned}$$

(87)

Besides, the KL property and $l^* = 0$ give that

$$\begin{aligned} \varphi '(l_k) \text {dist}(0, \partial (l_k))&= c (1-p) l_k^{1-p} \text {dist}(0, \partial (l_k)) \ge 1\nonumber \\&\Rightarrow l_k^p \le c(1-p) \text {dist}(0, \partial (l_k)). \end{aligned}$$

(88)

Combining with (62), we obtain

$$\begin{aligned} l_k^p&\le c(1-p) C (\bigtriangleup _{k-1} - \bigtriangleup _k) \Rightarrow \varphi (l_k) = c l_k^{1-p} \nonumber \\&\le c (c(1-p) C)^{\frac{1-p}{p}} (\bigtriangleup _{k-1} - \bigtriangleup _k)^{\frac{1-p}{p}} \nonumber \\&= C_1 (\bigtriangleup _{k-1} - \bigtriangleup _k)^{\frac{1-p}{p}}. \end{aligned}$$

(89)

Then, inserting (89) into (87), we obtain

$$\begin{aligned} \bigtriangleup _k \le C_2 (\bigtriangleup _{k-1} - \bigtriangleup _k)^{\frac{1-p}{p}} + (\bigtriangleup _{k-1} - \bigtriangleup _k) < \infty . \end{aligned}$$

(90)

(ii) If $p \in (0, \frac{1}{2}]$, then $\frac{1-p}{p}\ge 1$. Besides, since $(\bigtriangleup _{k-1} - \bigtriangleup _k) \rightarrow 0$ when $k \rightarrow \infty $, there exists an integer $K_0$ such that $(\bigtriangleup _{k-1} - \bigtriangleup _k) < 1$, leading to that $(\bigtriangleup _{k-1} - \bigtriangleup _k)^{\frac{1-p}{p}} \le (\bigtriangleup _{k-1} - \bigtriangleup _k)$. Inserting it into (90), we obtain that

$$\begin{aligned} \bigtriangleup _k&\le (C_2+1) (\bigtriangleup _{k-1} - \bigtriangleup _k) \Rightarrow \bigtriangleup _k \le C_3 (\bigtriangleup _{k-1} - \bigtriangleup _k) \nonumber \\&\Rightarrow \bigtriangleup _k {\le } \frac{C_3}{1{+}C_3} \bigtriangleup _{k-1} {=} \tau \bigtriangleup _{k-1}, ~\text {with}~ \tau \in (0,1),\nonumber \\&\qquad \forall k {>} K_0. \end{aligned}$$

(91)

It is easy to deduce that $\bigtriangleup _k \le (\bigtriangleup _{K_0} \tau ^{-K_0}) \tau ^k = \frac{c}{2} \tau ^k$, with c being a positive constant. Note that k in $\tau ^k$ indicates k power of $\tau $, rather than the iteration index. Combining with $\Vert \hat{\mathbf {x}}^k - \hat{\mathbf {x}}^* \Vert \le \bigtriangleup _k$, it is easy to obtain that $\Vert \hat{\mathbf {x}}^k - \hat{\mathbf {x}}^* \Vert \le \frac{c}{2} \tau ^k$ with $\tau \in (0,1)$ and c being a positive constant. Then, we have

$$\begin{aligned} \Vert \hat{\mathbf {x}}^{k+1} {-} \hat{\mathbf {x}}^k \Vert&\le \Vert \hat{\mathbf {x}}^k {-} \hat{\mathbf {x}}^* \Vert {+} \Vert \hat{\mathbf {x}}^{k+1} {-} \hat{\mathbf {x}}^* \Vert \nonumber \\&\le \frac{c}{2} (\tau ^{k+1} {+} \tau ^k) \le c \tau ^k. \end{aligned}$$

(92)

(iii) If $p \in (\frac{1}{2}, 1)$, then $\frac{1-p}{p}<1$. Then, it is easy to obtain that $(\bigtriangleup _{k-1} - \bigtriangleup _k)^{\frac{1-p}{p}} > (\bigtriangleup _{k-1} - \bigtriangleup _k)$. Inserting it into (90), we obtain that

$$\begin{aligned} \bigtriangleup _k&\le (C_2+1) (\bigtriangleup _{k-1} - \bigtriangleup _k)^{\frac{1-p}{p}} \Rightarrow \bigtriangleup _k\nonumber \\&\le C_3 (\bigtriangleup _{k-1} - \bigtriangleup _k)^{\frac{1-p}{p}} \Rightarrow \bigtriangleup _k^{\frac{p}{1-p}} \nonumber \\&\le C_4 (\bigtriangleup _{k-1} - \bigtriangleup _k), ~ \forall k > K_0. \end{aligned}$$

(93)

It has been studied in Theorem 2 of Attouch and Bolte (2009) that the above inequality can deduce $\bigtriangleup _k \le \frac{c}{2} k^{- \frac{1-p}{2p-1}}$, with c being a positive constant. Since $\Vert \hat{\mathbf {x}}^k - \hat{\mathbf {x}}^* \Vert \le \bigtriangleup _k$, we have that $\Vert \hat{\mathbf {x}}^k - \hat{\mathbf {x}}^* \Vert \le \frac{c}{2} k^{- \frac{1-p}{2p-1}}$. Then, we have

$$\begin{aligned} \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert&\le \Vert \hat{\mathbf {x}}^k - \hat{\mathbf {x}}^* \Vert + \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^* \Vert \nonumber \\&\le \frac{c}{2} \big (k^{- \frac{1-p}{2p-1}} + (k+1)^{- \frac{1-p}{2p-1}} \big ) \le c k^{- \frac{1-p}{2p-1}}. \end{aligned}$$

(94)

$\square $

Proposition 2

We adopt the same assumptions in Lemma 3. Then,

(i)
If $p =0$, then we will obtain the $\epsilon $-KKT solution to the LS–LP problem in finite steps.
(ii)
If $p \in (0, \frac{1}{2}]$, then we will obtain the $\epsilon $-KKT solution to the LS–LP problem in at least $O\big (\log _{\frac{1}{\tau }}(\frac{1}{\epsilon })^2\big )$ steps.
(iii)
If $p \in ( \frac{1}{2}, 1)$, then we will obtain the $\epsilon $-KKT solution to the LS–LP problem in at least $ O\big ( (\frac{1}{\epsilon })^{\frac{4p-2}{1-p}}\big )$ steps.

Proof

The conclusion (i) directly holds from Lemma 3(i).

According to the optimality condition (36), we have

$$\begin{aligned}&\text {dist}\big ( \mathbf {B}^\top \varvec{\lambda }^{k+1}, \partial h(\mathbf {y}^{k+1}) \big ) = \Vert \rho \mathbf {B}^\top \hat{\mathbf {A}} ( \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k) \Vert _2\nonumber \\&\quad \Rightarrow \text {dist}^2\big ( \mathbf {B}^\top \varvec{\lambda }^{k+1}, \partial h(\mathbf {y}^{k+1})\big ) = \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^{k} \Vert _{\rho ^2 \hat{\mathbf {A}}^\top \mathbf {B}\mathbf {B}^\top \hat{\mathbf {A}}}^2 \nonumber \\&\quad \le \xi _{\text {max}}(\rho ^2 \hat{\mathbf {A}}^\top \mathbf {B}\mathbf {B}^\top \hat{\mathbf {A}}) \Vert \hat{\mathbf {x}}^{k+1} {-} \hat{\mathbf {x}}^{k} \Vert _2^2 {=} O\left( \frac{1}{\epsilon ^2}\right) \Vert \hat{\mathbf {x}}^{k+1} {-} \hat{\mathbf {x}}^{k} \Vert _2^2\nonumber \\&\quad \Rightarrow \text {dist}\big ( \mathbf {B}^\top \varvec{\lambda }^{k+1}, \partial h(\mathbf {y}^{k+1}) \big ) \le O\left( \frac{1}{\epsilon }\right) \cdot \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^{k} \Vert _2 \end{aligned}$$

(95)

According to the optimality condition (38) and the relation (40), we obtain that

$$\begin{aligned} \Vert \hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} - \mathbf {B}\mathbf {y}^{k+1} \Vert _2&= \frac{1}{\rho }\Vert \varvec{\lambda }^{k +1} - \varvec{\lambda }^{k} \Vert _2 \le \frac{1}{\rho } \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert _2 \nonumber \\&\le \epsilon \cdot \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert _2 \le O\left( \frac{1}{\epsilon }\right) \nonumber \\&\quad \cdot \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^{k} \Vert _2. \end{aligned}$$

(96)

According to Lemma 3, we have

(i)
If $p \in (0, \frac{1}{2}]$, then
$$\begin{aligned} O\left( \frac{1}{\epsilon }\right) \cdot \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^{k} \Vert _2\le & {} O\left( \frac{1}{\epsilon }\right) \tau ^k \le O(\epsilon )\\\Rightarrow & {} k \ge O\left( \log _{\frac{1}{\tau }}\left( \frac{1}{\epsilon }\right) ^2\right) , \end{aligned}$$
which means that when $k \ge O\big (\log _{\frac{1}{\tau }}(\frac{1}{\epsilon })^2\big )$, we will obtain the $\epsilon $-KKT solution to the perturbed LS–LP problem, i.e., the $\epsilon $-KKT solution to the original LS–LP problem.
(ii)
If $p \in ( \frac{1}{2}, 1)$, then
$$\begin{aligned} O\left( \frac{1}{\epsilon }\right) \cdot \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^{k} \Vert _2\le & {} O\left( \frac{1}{\epsilon }\right) k^{- \frac{1-p}{2p-1}} \le O(\epsilon )\\\Rightarrow & {} k \ge O\left( \left( \frac{1}{\epsilon }\right) ^{\frac{4p-2}{1-p}}\right) , \end{aligned}$$
which means that when $k \ge O\big ( (\frac{1}{\epsilon })^{\frac{4p-2}{1-p}}\big )$, we will obtain the $\epsilon $-KKT solution to the perturbed LS–LP problem, i.e., the $\epsilon $-KKT solution to the original LS–LP problem. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, B., Shen, L., Zhang, T. et al. MAP Inference Via $\ell _2$-Sphere Linear Program Reformulation. Int J Comput Vis 128, 1913–1936 (2020). https://doi.org/10.1007/s11263-020-01313-2

Download citation

Received: 08 May 2019
Accepted: 20 February 2020
Published: 04 March 2020
Issue Date: July 2020
DOI: https://doi.org/10.1007/s11263-020-01313-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MAP Inference Via \(\ell _2\)-Sphere Linear Program Reformulation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Efficient Semidefinite Branch-and-Cut for MAP-MRF Inference

MAP inference algorithms without approximation for collective graphical models on path graphs via discrete difference of convex algorithm

Sampled Gromov Wasserstein

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Convergence Analysis

1.1 Properties

Remark

1.2 Decreasing of \(\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^{k})\)

1.3 Boundedness of \(\{\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k\}\)

1.4 Convergence of Residual

1.5 Global Convergence

1.6 \(\epsilon \)-KKT Point of the Original LS–LP Problem

Proposition 1

Proof

1.7 Convergence Rate

Lemma 3

Proof

Proposition 2

Proof

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

MAP Inference Via \(\ell _2\)-Sphere Linear Program Reformulation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Efficient Semidefinite Branch-and-Cut for MAP-MRF Inference

MAP inference algorithms without approximation for collective graphical models on path graphs via discrete difference of convex algorithm

Sampled Gromov Wasserstein

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Convergence Analysis

Convergence Analysis

1.1 Properties

Remark

1.2 Decreasing of \(\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^{k})\)

1.3 Boundedness of \(\{\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k\}\)

1.4 Convergence of Residual

1.5 Global Convergence

1.6 \(\epsilon \)-KKT Point of the Original LS–LP Problem

Proposition 1

Proof

1.7 Convergence Rate

Lemma 3

Proof

Proposition 2

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now