Skip to main content
Log in

MAP Inference Via \(\ell _2\)-Sphere Linear Program Reformulation

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Maximum a posteriori (MAP) inference is an important task for graphical models. Due to complex dependencies among variables in realistic models, finding an exact solution for MAP inference is often intractable. Thus, many approximation methods have been developed, among which the linear programming (LP) relaxation based methods show promising performance. However, one major drawback of LP relaxation is that it is possible to give fractional solutions. Instead of presenting a tighter relaxation, in this work we propose a continuous but equivalent reformulation of the original MAP inference problem, called LS–LP. We add the \(\ell _2\)-sphere constraint onto the original LP relaxation, leading to an intersected space with the local marginal polytope that is equivalent to the space of all valid integer label configurations. Thus, LS–LP is equivalent to the original MAP inference problem. We propose a perturbed alternating direction method of multipliers (ADMM) algorithm to optimize the LS–LP problem, by adding a sufficiently small perturbation \(\epsilon \) onto the objective function and constraints. We prove that the perturbed ADMM algorithm globally converges to the \(\epsilon \)-Karush–Kuhn–Tucker (\(\epsilon \)-KKT) point of the LS–LP problem. The convergence rate will also be analyzed. Experiments on several benchmark datasets from Probabilistic Inference Challenge (PIC 2011) and OpenGM 2 show competitive performance of our proposed method against state-of-the-art MAP inference methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. http://sigpromu.org/quadprog/download.php?sid=3wtwk5tb.

  2. http://www.cs.cmu.edu/~ark/AD3/.

  3. https://github.com/opengm/MPLP.

  4. https://github.com/lotten/daoopt.

References

  • Attouch, H., & Bolte, J. (2009). On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Mathematical Programming, 116(1–2), 5–16.

    Article  MathSciNet  Google Scholar 

  • Attouch, H., Bolte, J., Redont, P., & Soubeyran, A. (2010). Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka–Lojasiewicz inequality. Mathematics of Operations Research, 35(2), 438–457.

    Article  MathSciNet  Google Scholar 

  • Besag, J. (1986). On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society: Series B (Methodological), 48(3), 259–279.

    MathSciNet  MATH  Google Scholar 

  • Bolte, J., Daniilidis, A., & Lewis, A. (2007). The łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4), 1205–1223.

    Article  Google Scholar 

  • Bolte, J., Daniilidis, A., Lewis, A., & Shiota, M. (2007). Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18(2), 556–572.

    Article  MathSciNet  Google Scholar 

  • Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1), 1–122.

  • Elidan, G., Globerson, A., & Heinemann, U. (2012). Pascal 2011 probabilistic inference challenge. Retrieved July 15, 2020, from http://www.cs.huji.ac.il/project/PASCAL/index.php.

  • Fu, Q., & Banerjee, H. W. A. (2013). Bethe-ADMM for tree decomposition based parallel map inference. In Uncertainty in artificial intelligence (p. 222). Citeseer.

  • Globerson, A., & Jaakkola, T. S. (2008). Fixing max-product: Convergent message passing algorithms for MAP LP-relaxations. In NIPS (pp. 553–560).

  • Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In ICCV (pp. 1–8). IEEE.

  • Jaimovich, A., Elidan, G., Margalit, H., & Friedman, N. (2006). Towards an integrated protein–protein interaction network: A relational markov network approach. Journal of Computational Biology, 13(2), 145–164.

    Article  MathSciNet  Google Scholar 

  • Johnson, J. K., Malioutov, D. M., & Willsky, A. S. (2007). Lagrangian relaxation for map estimation in graphical models. ArXiv preprint arXiv:0710.0013.

  • Jojic, V., Gould, S., Koller, D. (2010). Accelerated dual decomposition for map inference. In ICML (pp. 503–510).

  • Kappes, J. H., Andres, B., Hamprecht, F. A., Schnörr, C., Nowozin, S., Batra, D., et al. (2015). A comparative study of modern inference techniques for structured discrete energy minimization problems. International Journal of Computer Vision, 115, 155–184.

    Article  MathSciNet  Google Scholar 

  • Kappes, J. H., Savchynskyy, B., Schnörr, C. (2012). A bundle approach to efficient map-inference by lagrangian relaxation. In CVPR (pp. 1688–1695). IEEE.

  • Karush, W. (1939). Minima of functions of several variables with inequalities as side constraints. M.Sc. Dissertation. Department of Mathematics, University of Chicago.

  • Kelley, J. (1960). The cutting-plane method for solving convex programs. Journal of the Society for Industrial and Applied Mathematics, 8(4), 703–712.

    Article  MathSciNet  Google Scholar 

  • Koller, D., & Nir, F. (Eds.). (2009). Probabilistic graphical models: Principles and techniques. Cambridge, MA: MIT Press.

    MATH  Google Scholar 

  • Kolmogorov, V. (2006). Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1568–1583.

    Article  Google Scholar 

  • Komodakis, N., Paragios, N., & Tziritas, G. (2007) MRF optimization via dual decomposition: Message-passing revisited. In ICCV (pp. 1–8). IEEE.

  • Kschischang, F. R., Frey, B. J., & Loeliger, H. A. (2001). Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2), 498–519.

    Article  MathSciNet  Google Scholar 

  • Kuhn, H. W., & Tucker, A. W. (2014). Nonlinear programming. In Traces and emergence of nonlinear programming (pp. 247–258). Springer.

  • Land, A. H., & Doig, A. G. (1960). An automatic method of solving discrete programming problems. Econometrica, 28, 497–520.

    Article  MathSciNet  Google Scholar 

  • Laurent, M., & Rendl, F. (2002). Semidefinite programming and integer programming. Centrum voor Wiskunde en Informatica.

  • Li, G., & Pong, T. K. (2015). Global convergence of splitting methods for nonconvex composite optimization. SIAM Journal on Optimization, 25(4), 2434–2460.

    Article  MathSciNet  Google Scholar 

  • Lojasiewicz, S. (1963). Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles, 117, 87–89.

    MATH  Google Scholar 

  • Martins, A. F., Figeuiredo, M. A., Aguiar, P. M., Smith, N. A., Xing, E. P. (2011). An augmented lagrangian approach to constrained map inference. In ICML.

  • Martins, A. F., Figueiredo, M. A., Aguiar, P. M., Smith, N. A., & Xing, E. P. (2015). AD3: Alternating directions dual decomposition for map inference in graphical models. Journal of Machine Learning Research, 16(1), 495–545.

    MathSciNet  MATH  Google Scholar 

  • Meshi, O., & Globerson, A. (2011). An alternating direction method for dual MAP LP relaxation. In Joint European conference on machine learning and knowledge discovery in databases (pp. 470–483). Springer.

  • Meshi, O., Mahdavi, M., & Schwing, A. (2015). Smooth and strong: Map inference with linear convergence. In NIPS (pp. 298–306).

  • Otten, L., & Dechter, R. (2012). Anytime and/or depth-first search for combinatorial optimization. AI Communications, 25(3), 211–227.

    Article  MathSciNet  Google Scholar 

  • Otten, L., Ihler, A., Kask, K., & Dechter, R. (2012). Winning the pascal 2011 map challenge with enhanced and/or branch-and-bound. In IN NIPS WORKSHOP DISCML. Citeseer.

  • Savchynskyy, B., Schmidt, S., Kappes, J., & Schnörr, C. (2012). Efficient MRF energy minimization via adaptive diminishing smoothing. ArXiv preprint arXiv:1210.4906.

  • Schwing, A. G., Hazan, T., Pollefeys, M., & Urtasun, R. (2012). Globally convergent dual MAP LP relaxation solvers using Fenchel–Young margins. In NIPS (pp. 2384–2392).

  • Schwing, A. G., Hazan, T., Pollefeys, M., & Urtasun, R. (2014). Globally convergent parallel MAP LP relaxation solver using the Frank–Wolfe algorithm. In ICML (pp. 487–495).

  • Sontag, D. A. (2010). Approximate inference in graphical models using LP relaxations. Ph.D. Thesis, Massachusetts Institute of Technology.

  • Sontag, D. A., Li, Y., et al. (2012). Efficiently searching for frustrated cycles in map inference. In UAI.

  • Wainwright, M. J., Jaakkola, T. S., & Willsky, A. S. (2005). Map estimation via agreement on trees: Message-passing and linear programming. IEEE Transactions on Information Theory, 51(11), 3697–3717.

    Article  MathSciNet  Google Scholar 

  • Wainwright, M. J., & Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1–2), 1–305.

    Article  Google Scholar 

  • Wang, Y., Yin, W., & Zeng, J. (2017). Global convergence of ADMM in nonconvex nonsmooth optimization. Journal of Scientific Programming, 78(1), 29–63.

    MathSciNet  MATH  Google Scholar 

  • Wu, B., & Ghanem, B. (2019). \(\ell _p\)-box ADMM: A versatile framework for integer programming. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(7), 1695–1708.

    Article  Google Scholar 

  • Xu, Y., & Yin, W. (2013). A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Sciences, 6(3), 1758–1789.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li Shen.

Additional information

Communicated by Julien Mairal.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Baoyuan Wu was partially supported by Tencent AI Lab and King Abdullah University of Science and Technology (KAUST). Li Shen was supported by Tencent AI Lab. Bernard Ghanem was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR). Tong Zhang was supported by the Hong Kong University of Science and Technology (HKUST). Li Shen is the corresponding author.

Convergence Analysis

Convergence Analysis

To facilitate the convergence analysis, here we rewrite some equations and notations firstly defined in Sect. 5. Problem (11) can be simplified to the following general shape, as follows

$$\begin{aligned} \hbox {LS}{-}\hbox {LP}(\varvec{\theta }) = \min _{\mathbf {x}, \mathbf {y}} f(\mathbf {x}) + h(\mathbf {y}),\quad \text {s.t.} ~ \mathbf {A}\mathbf {x}= \mathbf {B}\mathbf {y}. \end{aligned}$$
(32)

Our illustration for (32) is separated into three parts, as follows:

  1. 1.

    Variables\(\mathbf {x}= [ \varvec{\mu }_1; \ldots ; \varvec{\mu }_{|V|} ] \in \mathbb {R}^{\sum _{i}^{ V} |\mathcal {X}_i|}\), and it concatenates all variable nodes \(\varvec{\mu }_V\). \(\mathbf {y}= [\mathbf {y}_1; \ldots ; \mathbf {y}_{|V|}]\) with \(\mathbf {y}_i = [\varvec{\upsilon }_i; \varvec{\mu }_{\alpha _{i,1}}; \ldots ; \varvec{\mu }_{\alpha _{i,|\mathcal {N}_i|}}] \in \mathbb {R}^{|\mathcal {X}_i| + \sum _{\alpha }^{\mathcal {N}_i} |\mathcal {X}_{\alpha }|}\). \(\mathbf {y}\) concatenates all factor nodes \(\varvec{\mu }_V\) and the extra variable nodes \(\varvec{\upsilon }\); \(\mathbf {y}_i\) concatenates the factor nodes and the extra variable node connected to the i-th variable node \(\varvec{\mu }_i\). \(\mathcal {N}_i\) indicates the set of neighborhood factor nodes connected to the i-th variable node; the subscript \(\alpha _{i,j}\) indicates the j-th factor connected to the i-th variable, with \(i \in V\) and \(j \in \mathcal {N}_i\).

  2. 2.

    Objective functions\(f(\mathbf {x})= \mathbf {w}_{\mathbf {x}}^\top \mathbf {x}\) with \(\mathbf {w}_{\mathbf {x}} = - [\varvec{\theta }_1; \ldots ; \)\(\varvec{\theta }_{|V|}]\). \(h(\mathbf {y}) = g(\mathbf {y}) + \mathbf {w}_{\mathbf {y}}^\top \mathbf {y}\), with \(\mathbf {w}_{\mathbf {y}} = [\mathbf {w}_1; \ldots ;\)\(\mathbf {w}_{|V|}]\) with \(\mathbf {w}_{i} = -[\varvec{0}; \frac{1}{|\mathcal {N}_{\alpha _{i,1}}|} \varvec{\theta }_{\alpha _{i,1}};\)\(\ldots ; \frac{1}{|\mathcal {N}_{\alpha _{i,|\mathcal {N}_i|}}|} \varvec{\theta }_{\alpha _{i,|\mathcal {N}_i|}}]\), and \(\mathcal {N}_{\alpha } = \{ i \mid (i, \alpha ) \in E\}\) being the set of neighborhood variable nodes connected to the \(\alpha \)-th factor. \(g(\mathbf {y}) = \mathbb {I}(\varvec{\upsilon } \in \mathcal {S}) + \sum _{\alpha \in F} \mathbb {I}(\varvec{\mu }_{\alpha } \in \Delta ^{|\mathcal {X}_{\alpha }|})\), with \(\mathbb {I}(a)\) being the indicator function: \(\mathbb {I}(a)=0\) if a is true, otherwise \(\mathbb {I}(a)=\infty \).

  3. 3.

    Constraint matrices The constraint matrix \(\mathbf {A}= \text {diag}(\)\(\mathbf {A}_1, \ldots , \mathbf {A}_i, \ldots , \mathbf {A}_{|V|})\) with \(\mathbf {A}_i = [\mathbf {I}_{|\mathcal {X}_i|}; \ldots ; \mathbf {I}_{|\mathcal {X}_i|} ] \in \{0,1\}^{(|\mathcal {N}_i| +1)|\mathcal {X}_i| \times |\mathcal {X}_i|}\). \(\mathbf {B}= \text {diag}(\mathbf {B}_1, \ldots ,\)\( \mathbf {B}_i, \ldots , \mathbf {B}_{|V|})\), with \(\mathbf {B}_i = \text {diag}(\mathbf {I}_{|\mathcal {X}_i|}, \mathbf {M}_{i, \alpha _{i,1}}, \ldots , \mathbf {M}_{i, \alpha _{i, |\mathcal {N}_i|}} )\). \(\mathbf {A}\) summarizes all constraints on \(\varvec{\mu }_V\), while \(\mathbf {B}\) collects all constraints on \(\varvec{\mu }_F\) and \(\varvec{\upsilon }\).

Note that Problem (32) has a clear structure with two groups of variables, corresponding the augmented factor graph (see Fig. 1c).

According to the analysis presented in Wang et al. (2017), a sufficient condition to ensure the global convergence of the ADMM algorithm for the problem \(\hbox {LS}{-}\hbox {LP}(\varvec{\theta })\) is that \(\text {Im}(\mathbf {B}) \subseteq \text {Im}(\mathbf {A})\), with \(\text {Im}(\mathbf {A})\) being the image of \(\mathbf {A}\), i.e., the column space of \(\mathbf {A}\). However, \(\mathbf {A}\) in (32) is full column rank, rather than full row rank, while \(\mathbf {B}\) is full row rank. To satisfy this sufficient condition, we introduce a sufficiently small perturbation to both the objective function and the constraint in (32), as follows

$$\begin{aligned} \hbox {LS}{-}\hbox {LP}(\varvec{\theta }; \epsilon ) = \min _{\hat{\mathbf {x}}, \mathbf {y}} \hat{f}(\hat{\mathbf {x}}) + h(\mathbf {y}),\quad \text {s.t.} ~ \hat{\mathbf {A}} \hat{\mathbf {x}} = \mathbf {B}\mathbf {y}, \end{aligned}$$
(33)

where \(\hat{\mathbf {A}} = [\mathbf {A}, \epsilon \mathbf {I}]\) with a sufficiently small constant \(\epsilon > 0\), then \(\hat{\mathbf {A}}\) is full row rank. \(\hat{\mathbf {x}} = [\mathbf {x}; \bar{\mathbf {x}}]\), with \(\bar{\mathbf {x}} = [\bar{\mathbf {x}}_1; \ldots ; \bar{\mathbf {x}}_{|V|}] \in \mathbb {R}^{\sum _i^{V} (|\mathcal {N}_i|+1) |\mathcal {X}_i|}\) and \(\bar{\mathbf {x}}_i = [\varvec{\mu }_i; \ldots ; \varvec{\mu }_i] \in \mathbb {R}^{(|\mathcal {N}_i|+1) |\mathcal {X}_i|}\). \(\hat{f}(\hat{\mathbf {x}}) = f(\mathbf {x}) + \frac{1}{2}\epsilon \hat{\mathbf {x}}^\top \hat{\mathbf {x}}\). Consequently, \(\text {Im}(\hat{\mathbf {A}}) \equiv \text {Im}(\mathbf {B}) \subseteq \mathbb {R}^{\text {rank of } ~ \hat{\mathbf {A}}}\), as both \(\hat{\mathbf {A}}\) and \(\mathbf {B}\) are full row rank. Then, the sufficient condition \(\text {Im}(\mathbf {B}) \subseteq \text {Im}(\hat{\mathbf {A}})\) holds.

The augmented Lagrangian function of (33) is formulated as

$$\begin{aligned} \mathcal {L}_{\rho , \epsilon }(\hat{\mathbf {x}}, \mathbf {y}, \varvec{\lambda })&= \hat{f}(\hat{\mathbf {x}}) + h(\mathbf {y}) + \varvec{\lambda }^\top (\hat{\mathbf {A}} \hat{\mathbf {x}} - \mathbf {B}\mathbf {y})\nonumber \\&\quad + \frac{\rho }{2} \Vert \hat{\mathbf {A}} \hat{\mathbf {x}} - \mathbf {B}\mathbf {y}\Vert _2^2 \end{aligned}$$
(34)

The updates of the ADMM algorithm to optimize (33) are as follows

$$\begin{aligned} \left\{ \begin{array}{l} \mathbf {y}^{k+1} = \mathop {\arg \!\min }_{\mathbf {y}} \mathcal {L}_{\rho , \epsilon }(\hat{\mathbf {x}}^k, \mathbf {y}, \varvec{\lambda }^k),\\ \hat{\mathbf {x}}^{k+1} = \mathop {\arg \!\min }_{\hat{\mathbf {x}}} \mathcal {L}_{\rho , \epsilon }(\hat{\mathbf {x}}, \mathbf {y}^{k+1}, \varvec{\lambda }^k),\\ \varvec{\lambda }^{k+1} = \varvec{\lambda }^k + \rho (\hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} - \mathbf {B}\mathbf {y}^{k+1}). \end{array}\right. \end{aligned}$$
(35)

The optimality conditions of the variable sequence \((\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1})\) generated above are

$$\begin{aligned}&\mathbf {B}^\top \varvec{\lambda }^k + \rho \mathbf {B}^\top (\hat{\mathbf {A}} \hat{\mathbf {x}}^k - \mathbf {B}\mathbf {y}^{k+1})= \mathbf {B}^\top \varvec{\lambda }^{k+1} \nonumber \\&\quad - \rho \mathbf {B}^\top \hat{\mathbf {A}} (\hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k) \in \partial h(\mathbf {y}^{k+1}),\end{aligned}$$
(36)
$$\begin{aligned}&\nabla \hat{f}(\hat{\mathbf {x}}^{k+1}) + \hat{\mathbf {A}}^\top \varvec{\lambda }^k + \rho \hat{\mathbf {A}}^\top (\hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} - \mathbf {B}\mathbf {y}^{k+1})\nonumber \\&\quad = \nabla \hat{f}(\hat{\mathbf {x}}^{k+1}) + \hat{\mathbf {A}}^\top \varvec{\lambda }^{k+1} = \varvec{0},\end{aligned}$$
(37)
$$\begin{aligned}&\frac{1}{\rho } (\varvec{\lambda }^{k+1} - \varvec{\lambda }^{k}) = \hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} - \mathbf {B}\mathbf {y}^{k+1}. \end{aligned}$$
(38)

The convergence of this perturbed ADMM algorithm for the LS–LP problem is summarized in Theorem 2. The detailed proof is presented in the following sub-sections sequentially. Note that hereafter \(\Vert \cdot \Vert \) indicates the \(\ell _2\) norm for a vector, or the Frobenius norm for a matrix; \(\mathcal {A}_1 \succeq \mathcal {A}_2\) represents that \(\mathcal {A}_1 - \mathcal {A}_2\) is positive semi-definite, with \(\mathcal {A}_1, \mathcal {A}_2\) being square matrices; \(\nabla \) denotes the gradient operator, \(\nabla ^2\) means the Hessian operator, and \(\partial \) is the sub-gradient operator; \(\mathbf {I}\) represents the identity matrix with compatible shape.

1.1 Properties

In this section, we present some important properties of the objective function and constraints in (33), which will be used in the followed convergence analysis.

Properties on objective functions (P1)

  • (P1.1) fh and \(\mathcal {L}_{\rho , \epsilon }\) are semi-algebraic, lower semi-continuous functions and satisfy Kurdyka–Lojasiewicz (KL) property, and h is closed and proper

  • (P1.2) There exist \(\mathcal {Q}_1, \mathcal {Q}_2\) such that \(\mathcal {Q}_1 \succeq \nabla ^2 \hat{f}(\hat{\mathbf {x}}) \succeq \mathcal {Q}_2\), \(\forall \hat{\mathbf {x}}\)

  • (P1.3) \(\lim \inf _{\Vert \hat{\mathbf {x}} \Vert \rightarrow \infty } \Vert \nabla \hat{f}(\hat{\mathbf {x}}) \Vert = \infty \)

Properties on constraint matrices (P2)

  • (P2.1) There exists \(\sigma > 0\) such that \(\hat{\mathbf {A}} \hat{\mathbf {A}}^\top \succeq \sigma \mathbf {I}\)

  • (P2.2) \(\mathcal {Q}_2 + \rho \hat{\mathbf {A}}^\top \hat{\mathbf {A}} \succeq \delta \mathbf {I}\) for some \(\rho , \delta > 0\), and \(\rho > \frac{1}{\epsilon } \)

  • (P2.3) There exists \(\mathcal {Q}_3 \succeq [\nabla ^2 \hat{f}(\hat{\mathbf {x}})]^2, \forall \hat{\mathbf {x}}\), and \(\delta \mathbf {I}\succ \frac{2}{\sigma \rho } \mathcal {Q}_3\)

  • (P2.4) Both \(\hat{\mathbf {A}}\) and \(\mathbf {B}\) are full row rank, and \(\text {Im}(\hat{\mathbf {A}}) \equiv \text {Im}(\mathbf {B}) \subseteq \mathbb {R}^{\text {rank of }~ \hat{\mathbf {A}}}\)

Remark

(1) Although the definition of KL property (see Definition 1) is somewhat complex, but it holds for many widely used functions, according to Xu and Yin (2013). Typical functions satisfying KL property includes: (a) real analytic functions, and any polynomial function such as \(\Vert \mathbf {H} \mathbf {x}- \mathbf {b} \Vert \) belongs to this type; (b) locally strongly convex functions, such as the logistic loss function \(\log (1+\exp (-\mathbf {x}))\); (c) semi-algebraic functions, such as \(\Vert \mathbf {x}\Vert _1, \Vert \mathbf {x}\Vert _2,\Vert \mathbf {x}\Vert _{\infty }, \Vert \mathbf {H} \mathbf {x}- \mathbf {b} \Vert _1, \Vert \mathbf {H} \mathbf {x}- \mathbf {b} \Vert _2, \Vert \mathbf {H} \mathbf {x}- \mathbf {b} \Vert _{\infty }\) and the indicator function \(\mathbb {I}(\cdot )\). It is easy to verify that P1.1 holds in our problem. (2) Here we provide an instantiation of above hyper-parameters satisfying above properties. Firstly, it is easy to obtain that \(\nabla ^2 \hat{f}(\hat{\mathbf {x}}) = \epsilon \mathbf {I}\), and \(\hat{\mathbf {A}} \hat{\mathbf {A}}^\top = [\mathbf {A}, \epsilon \mathbf {I}] [\mathbf {A}, \epsilon \mathbf {I}]^\top = \mathbf {A}\mathbf {A}^\top + \epsilon ^2 \mathbf {I}\succ \epsilon ^2 \mathbf {I}\), as well as \(\rho \hat{\mathbf {A}}^\top \hat{\mathbf {A}} \succeq \epsilon \mathbf {I}\), when \(\epsilon \) is small enough and \(\rho > \frac{1}{\epsilon }\) (e.g., \(\rho = \frac{2}{\epsilon }\)). Then, the values \(\mathcal {Q}_1 = \mathcal {Q}_2 = \epsilon \mathbf {I}, \mathcal {Q}_3 = \epsilon ^2 \mathbf {I}, \delta = 2 \epsilon , \sigma = \epsilon ^2\) satisfy P1.2, P2.1, P2.2 and P2.3. Without loss of generality, we will adopt these specific values for these hyper-parameters to simplify the following analysis, while only keeping \(\rho \) and \(\epsilon \).

1.2 Decreasing of \(\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^{k})\)

In this section, we firstly prove the decreasing property of the augmented Lagrangian function, i.e.,

$$\begin{aligned} \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) > \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1}), \forall k. \end{aligned}$$
(39)

Firstly, utilizing P2.1, P2.3 and (37), we obtain that

$$\begin{aligned}&\epsilon ^2 \Vert \varvec{\lambda }^{k+1} - \varvec{\lambda }^{k} \Vert _2^2 \le \Vert \hat{\mathbf {A}}(\varvec{\lambda }^{k+1} - \varvec{\lambda }^{k}) \Vert _2^2 \nonumber \\&\quad = \Vert \nabla \hat{f}(\hat{\mathbf {x}}^{k+1}) - \nabla \hat{f}(\hat{\mathbf {x}}^k) \Vert _2^2 = \epsilon ^2 \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert _2^2. \end{aligned}$$
(40)

Then, we have

$$\begin{aligned}&\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1}) - \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k})\nonumber \\&\quad = (\varvec{\lambda }^{k+1} - \varvec{\lambda }^{k})^\top (\hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} - \mathbf {B}\mathbf {y}^{k+1})\nonumber \\&\quad =\frac{1}{\rho } \Vert \varvec{\lambda }^{k+1} - \varvec{\lambda }^{k} \Vert _2^2 \le \frac{1}{\rho } \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert _2^2 \end{aligned}$$
(41)

According to P1.2 and P2.2, \(\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}, \varvec{\lambda }^{k})\) is strongly convex with respect to \(\hat{\mathbf {x}}\), with the parameter of at least \(2 \epsilon \). Then, we have

$$\begin{aligned}&\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k}) - \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}) \nonumber \\&\quad \le - \epsilon \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert _2^2. \end{aligned}$$
(42)

As \(\mathbf {y}^{k+1}\) is the minimal solution of \(\mathcal {L}_{\rho , \epsilon }(\mathbf {y}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k})\), it is easy to know

$$\begin{aligned} \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^k, \varvec{\lambda }^{k}) - \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}) \le 0. \end{aligned}$$
(43)

Combining (41), (42) and (43), we have

$$\begin{aligned}&\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1}) - \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k})\nonumber \\&\quad \le (\frac{1}{\rho } - \epsilon ) \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert _2^2 < 0, \end{aligned}$$
(44)

where the last inequality utilizes P2.3 and \(\rho > \frac{1}{\epsilon }\).

1.3 Boundedness of \(\{\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k\}\)

Next, we prove the boundedness of \(\{\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k\}\). We suppose that \(\rho \) is large enough such that there is \(0<\gamma <\rho \) with

$$\begin{aligned} \inf _{\hat{\mathbf {x}}} \big ( \hat{f}(\hat{\mathbf {x}}) - \frac{1}{2 \epsilon ^2 \gamma } \Vert \nabla \hat{f}(\hat{\mathbf {x}}) \Vert _2^2 \big ) = f^* > - \infty . \end{aligned}$$
(45)

According to (44), for any \(k \ge 1\), we have

$$\begin{aligned}&\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) = \hat{f}(\hat{\mathbf {x}}^k) + h(\mathbf {y}^k) + \frac{\rho }{2} \Vert \hat{\mathbf {A}} \hat{\mathbf {x}}^k - \mathbf {B}\mathbf {y}^k \nonumber \\&\quad +\frac{\varvec{\lambda }^k}{\rho } \Vert _2^2 - \frac{1}{2\rho } \Vert \varvec{\lambda }^k \Vert _2^2 \le \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^1, \hat{\mathbf {x}}^1, \varvec{\lambda }^1) < \infty . \end{aligned}$$
(46)

Besides, according to P2.1, we have

$$\begin{aligned} \epsilon ^2 \Vert \varvec{\lambda }^k \Vert _2^2 \le \Vert \hat{\mathbf {A}}^\top \varvec{\lambda }^k \Vert _2^2 = \Vert \nabla \hat{f}(\hat{\mathbf {x}}^k) \Vert _2^2. \end{aligned}$$
(47)

Plugging (47) into (46), we obtain that

$$\begin{aligned} \infty&> \hat{f}(\hat{\mathbf {x}}^k) + h(\mathbf {y}^k) + \frac{\rho }{2} \Vert \hat{\mathbf {A}} \hat{\mathbf {x}}^k - \mathbf {B}\mathbf {y}^k + \frac{\varvec{\lambda }^k}{\rho } \Vert _2^2 \nonumber \\&\quad - \frac{1}{2 \epsilon ^2 \rho } \Vert \nabla \hat{f}(\hat{\mathbf {x}}^k) \Vert _2^2 \ge f^* + \frac{\frac{1}{\gamma } - \frac{1}{\rho }}{2 \epsilon ^2} \Vert \nabla \hat{f}(\hat{\mathbf {x}}^k) \Vert _2^2\nonumber \\&\quad + h(\mathbf {y}^k) + \frac{\rho }{2} \Vert \hat{\mathbf {A}} \hat{\mathbf {x}}^k - \mathbf {B}\mathbf {y}^k + \frac{\varvec{\lambda }^k}{\rho } \Vert _2^2. \end{aligned}$$
(48)

According to the coerciveness of \(\nabla \hat{f}(\hat{\mathbf {x}}^k)\) (i.e., P1.3), we obtain that \(\hat{\mathbf {x}}^k < \infty , \forall k\), i.e., the boundedness of \(\{\hat{\mathbf {x}}^k\}\). From (47), we know the boundedness of \(\{\varvec{\lambda }^k\}\). Besides, according to P2.4, \(\{\hat{\mathbf {A}}\hat{\mathbf {x}}^k\}\) is also bounded. From (38), we obtain the boundedness of \(\{\mathbf {B}\mathbf {y}^k\}\). Considering the full row rank of \(\mathbf {B}\) (i.e., P2.4), the boundedness of \(\{\mathbf {y}^k\}\) is proved.

1.4 Convergence of Residual

According to the boundedness of \(\{\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k\}\), there is a sub-sequence \(\{\mathbf {y}^{k_i}, \hat{\mathbf {x}}^{k_i}, \varvec{\lambda }^{k_i}\}\) that converges to a cluster point \(\{\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*\}\). Considering the lower semi-continuity of \(\mathcal {L}_{\rho , \epsilon }\) (i.e., P1.1), we have

$$\begin{aligned} \underset{i \rightarrow \infty }{\lim \inf } ~ \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k_i}, \hat{\mathbf {x}}^{k_i}, \varvec{\lambda }^{k_i}) \ge \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*) > - \infty . \end{aligned}$$
(49)

Summing (44) from \(k = M, \ldots , N-1\) with \(M \ge 1\), we have

$$\begin{aligned}&\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^N, \hat{\mathbf {x}}^N, \varvec{\lambda }^N) - \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^M, \hat{\mathbf {x}}^M, \varvec{\lambda }^M) \nonumber \\&\quad \le \left( \frac{1}{\rho } - \epsilon \right) \sum _{k=M}^{N-1} \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert _2^2 < 0. \end{aligned}$$
(50)

Then, by setting \(N = k_i\) and \(M=1\), we have

$$\begin{aligned}&\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k_i}, \hat{\mathbf {x}}^{k_i}, \varvec{\lambda }^{k_i}) - \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^1, \hat{\mathbf {x}}^1, \varvec{\lambda }^1) \nonumber \\&\quad \le \left( \frac{1}{\rho } - \epsilon \right) \sum _{k=1}^{k_i-1} \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert _2^2. \end{aligned}$$
(51)

Taking limit on both sides of the above inequality, we obtain

$$\begin{aligned}&-\infty< \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*) - \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^1, \hat{\mathbf {x}}^1, \varvec{\lambda }^1) \nonumber \\&\quad \le \left( \frac{1}{\rho } - \epsilon \right) \sum _{k=1}^{\infty } \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert _2^2 < 0. \end{aligned}$$
(52)

It implies that

$$\begin{aligned} \lim _{k \rightarrow \infty }\Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert = 0. \end{aligned}$$
(53)

Besides, according to (40), it is easy to obtain that

$$\begin{aligned} \lim _{k \rightarrow \infty }\Vert \varvec{\lambda }^{k+1} - \varvec{\lambda }^k \Vert = 0. \end{aligned}$$
(54)

Moreover, utilizing \(\mathbf {B}\mathbf {y}^{k+1} = \hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} - \frac{1}{\rho }(\varvec{\lambda }^{k+1} - \varvec{\lambda }^k)\) from (38), we have

$$\begin{aligned} \Vert \mathbf {B}(\mathbf {y}^{k+1} - \mathbf {y}^{k}) \Vert&\le \Vert \hat{\mathbf {A}} (\hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k) \Vert + \frac{1}{\rho } \Vert (\varvec{\lambda }^{k+1} - \varvec{\lambda }^k) \Vert \nonumber \\&\quad + \frac{1}{\rho } \Vert (\varvec{\lambda }^{k} - \varvec{\lambda }^{k-1}) \Vert . \end{aligned}$$
(55)

Besides, as shown in Lemma 1 in Wang et al. (2017), the full row rank of \(\mathbf {B}\) (i.e., P1.4) implies that

$$\begin{aligned} \Vert \mathbf {y}^{k+1} - \mathbf {y}^k \Vert \le \bar{M} \Vert \mathbf {B}(\mathbf {y}^{k+1} - \mathbf {y}^k) \Vert , \end{aligned}$$
(56)

where \(\bar{M} > 0\) is a constant. Taking limit on both sides of (55) and utilizing (56), we obtain

$$\begin{aligned} \lim _{k \rightarrow \infty }\Vert \mathbf {y}^{k+1} - \mathbf {y}^k \Vert = 0. \end{aligned}$$
(57)

Combining (53), (54) and (57), we obtain that

$$\begin{aligned} \lim _{k \rightarrow \infty } \Vert \mathbf {y}^{k+1} - \mathbf {y}^k \Vert _2^2 +\Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert _2^2 + \Vert \varvec{\lambda }^{k+1} - \varvec{\lambda }^k \Vert _2^2 = 0. \end{aligned}$$
(58)

By setting \(k+1 = k_i\), plugging (53) into (36) and (54) into (37), and taking limit \(k_i \rightarrow \infty \), we obtain the KKT conditions. It tells that the cluster point \((\mathbf {y}^*, \hat{\mathbf {x}}^*)\) is the KKT point of \(\hbox {LS}{-}\hbox {LP}(\varvec{\theta };\varvec{\epsilon })\) (i.e., (33)).

1.5 Global Convergence

Inspired by the analysis presented in Li and Pong (2015), in this section we will prove the following conclusions:

  • \(\sum _{k=1}^{\infty } \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert < \infty \);

  • \(\{\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k\}\) converges to \((\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*)\);

  • \((\mathbf {y}^*, \hat{\mathbf {x}}^*)\) is the KKT point of (33).

Firstly, utilizing the optimality conditions (36, 37, 38), we have that

$$\begin{aligned}&\partial _{\mathbf {y}} \mathcal {L}_{\rho , \epsilon }(\hat{\mathbf {x}}^{k+1}, \mathbf {y}^{k+1}, \varvec{\lambda }^{k+1}) = \partial h(\mathbf {y}^{k+1}) - \mathbf {B}^\top \varvec{\lambda }^{k+1} \nonumber \\&\quad - \rho \mathbf {B}^\top (\hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} - \mathbf {B}\mathbf {y}^{k+1}) \ni - \mathbf {B}^\top (\varvec{\lambda }^{k+1} - \varvec{\lambda }^k) \nonumber \\&\quad - \rho \mathbf {B}^\top \hat{\mathbf {A}} (\hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k),\end{aligned}$$
(59)
$$\begin{aligned}&\nabla _{\hat{\mathbf {x}}} \mathcal {L}_{\rho , \epsilon }(\hat{\mathbf {x}}^{k+1}, \mathbf {y}^{k+1}, \varvec{\lambda }^{k+1}) = \nabla _{\hat{\mathbf {x}}} \hat{f}(\hat{\mathbf {x}}^{k+1}) + \hat{\mathbf {A}}^\top \varvec{\lambda }^{k+1}\nonumber \\&\quad +\rho \hat{\mathbf {A}}^\top (\hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} - \mathbf {B}\mathbf {y}^{k+1}) = \hat{\mathbf {A}}^\top (\varvec{\lambda }^{k+1} - \varvec{\lambda }^k),\end{aligned}$$
(60)
$$\begin{aligned}&\nabla _{\varvec{\lambda }} \mathcal {L}_{\rho , \epsilon }(\hat{\mathbf {x}}^{k+1}, \mathbf {y}^{k+1}, \varvec{\lambda }^{k+1}) = \hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} - \mathbf {B}\mathbf {y}^{k+1} \nonumber \\&\quad = \frac{1}{\rho } (\varvec{\lambda }^{k+1} - \varvec{\lambda }^k). \end{aligned}$$
(61)

Further, combining with (40), there exists a constant \(C>0\) such that

$$\begin{aligned} \text {dist}\big (0, \partial _{(\mathbf {y}, \hat{\mathbf {x}}, \varvec{\lambda })} \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1},\hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1})\big ) \le C \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert , \end{aligned}$$
(62)

where \(\text {dist}(\cdot , \cdot )\) denotes the distance between a vector and a set of vectors. Hereafter we denote \(\partial _{(\mathbf {y}, \hat{\mathbf {x}}, \varvec{\lambda })} \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1},\hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1})\) as \(\partial \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1},\hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1})\) for clarity. Besides, the relation (44) implies that there is a constant \(D \in (0, \epsilon - \frac{1}{\rho })\) such that

$$\begin{aligned} \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}) {-} \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1}) \ge D \Vert \hat{\mathbf {x}}^{k+1} {-} \hat{\mathbf {x}}^k \Vert _2^2. \end{aligned}$$
(63)

Moreover, the relation (49) implies that \(\{ \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}) \}\) is lower bounded along the convergent sub-sequence \(\{ (\mathbf {y}^{k_i}, \hat{\mathbf {x}}^{k_i}, \varvec{\lambda }^{k_i}) \}\). Combining with the its decreasing property, the limit of \(\{ \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}) \}\) exists. Thus, we will show that

$$\begin{aligned} \underset{k \rightarrow \infty }{\lim } \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}) = l^*:{=} \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{*}, \hat{\mathbf {x}}^*, \varvec{\lambda }^{*}). \end{aligned}$$
(64)

To prove it, we utilize the fact that \(\mathbf {y}^{k+1}\) is the minimizer of \( \mathcal {L}_{\rho , \epsilon }(\mathbf {y}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}) \), such that

$$\begin{aligned} \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}) \le \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{*}, \hat{\mathbf {x}}^{k}, \varvec{\lambda }^{k}). \end{aligned}$$
(65)

Combining the above relation, (58) and the continuity of \(\mathcal {L}_{\rho , \epsilon }\) w.r.t. \(\hat{\mathbf {x}}\) and \(\varvec{\lambda }\), the following relation holds along the sub-sequence \(\{ (\mathbf {y}^{k_i}, \hat{\mathbf {x}}^{k_i}, \varvec{\lambda }^{k_i}) \}\) that converges to \( (\mathbf {y}^{*}, \hat{\mathbf {x}}^{*}, \varvec{\lambda }^{*}) \),

$$\begin{aligned} \underset{i \rightarrow \infty }{\lim \sup } ~ \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k_i+1}, \hat{\mathbf {x}}^{k_i+1}, \varvec{\lambda }^{k_i+1}) \le \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{*}, \hat{\mathbf {x}}^{*}, \varvec{\lambda }^{*}). \end{aligned}$$
(66)

According to (58), the sub-sequence \(\{ (\mathbf {y}^{k_i+1}, \hat{\mathbf {x}}^{k_i+1}, \varvec{\lambda }^{k_i+1}) \}\) also converges to \( (\mathbf {y}^{*}, \hat{\mathbf {x}}^{*}, \varvec{\lambda }^{*}) \). Then, utilizing the lower semi-continuity of \(\mathcal {L}_{\rho , \epsilon }\), we have

$$\begin{aligned} \underset{i \rightarrow \infty }{\lim \inf } ~ \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k_i+1}, \hat{\mathbf {x}}^{k_i+1}, \varvec{\lambda }^{k_i+1}) \ge \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{*}, \hat{\mathbf {x}}^{*}, \varvec{\lambda }^{*}). \end{aligned}$$
(67)

Combining (66) with (67), we know the existence of the limit of the sequence \(\{ \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) \}\), which proves the relation (64).

As \(\mathcal {L}_{\rho , \epsilon }\) is KL function, according to Definition 1, it has the following properties:

  • There exist a constant \(\eta \in (0, \infty ]\), a continuous concave function \(\varphi : [0, \eta ) \rightarrow \mathbb {R}_{+}\), as well as a neighbourhood \(\mathcal {V}\) of \((\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*)\). \(\varphi \) is differentiable on \((0, \eta )\) with positive derivatives.

  • For all \((\mathbf {y}, \hat{\mathbf {x}}, \varvec{\lambda }) \in \mathcal {V}\) satisfying \(l^*< \mathcal {L}_{\rho , \epsilon }(\mathbf {y}, \hat{\mathbf {x}}, \varvec{\lambda }) < l^* + \eta \), we have

    $$\begin{aligned} \varphi '( \mathcal {L}_{\rho , \epsilon }(\mathbf {y}, \hat{\mathbf {x}}, \varvec{\lambda }) - l^* ) \text {dist}(0, \partial \mathcal {L}_{\rho , \epsilon }(\mathbf {y}, \hat{\mathbf {x}}, \varvec{\lambda }) ) \ge 1. \end{aligned}$$
    (68)

Then, we define the following neighborhood sets:

$$\begin{aligned} \mathcal {V}_{\zeta }&{:=} \bigg \{ (\mathbf {y}, \hat{\mathbf {x}}, \varvec{\lambda }) ~ \bigg \vert ~ \Vert \hat{\mathbf {x}} - \hat{\mathbf {x}}^* \Vert< \zeta , \Vert \mathbf {y}- \mathbf {y}^*\Vert \nonumber \\&\quad< \bar{M}(\Vert \hat{\mathbf {A}}\Vert +1) \zeta , \Vert \varvec{\lambda } - \varvec{\lambda }^* \Vert < \zeta \bigg \} \subseteq \mathcal {V}\end{aligned}$$
(69)
$$\begin{aligned} \mathcal {V}_{\zeta , \hat{\mathbf {x}}}&{:=} \big \{ \hat{\mathbf {x}} ~ \big \vert ~ \Vert \hat{\mathbf {x}} - \hat{\mathbf {x}}^* \Vert < \zeta \big \}, \end{aligned}$$
(70)

where \(\zeta > 0\) is a small constant.

Utilizing the relations (37) and (38), as well as P2.1, we obtain that for any \(k\ge 1\), the following relation holds:

$$\begin{aligned}&\epsilon ^2 \Vert \varvec{\lambda }^k - \varvec{\lambda }^* \Vert _2^2 \le \Vert \hat{\mathbf {A}}^\top (\varvec{\lambda }^k - \varvec{\lambda }^*) \Vert _2^2 = \Vert \triangledown \hat{f}(\hat{\mathbf {x}}^k) - \triangledown \hat{f}(\hat{\mathbf {x}}^*) \Vert _2^2\nonumber \\&\quad = \epsilon ^2 \Vert \hat{\mathbf {x}}^k - \hat{\mathbf {x}}^* \Vert _2^2. \end{aligned}$$
(71)

Also, the relations (37) and (38) imply that for any \(k\ge 1\), we have

$$\begin{aligned} \Vert \mathbf {B}(\mathbf {y}^k - \mathbf {y}^*) \Vert&= \left\| \hat{\mathbf {A}} (\hat{\mathbf {x}}^k - \hat{\mathbf {x}}^*) - \frac{1}{\rho }(\varvec{\lambda }^k - \varvec{\lambda }^{k-1}) \right\| \nonumber \\&\le \Vert \hat{\mathbf {A}} \Vert \Vert (\hat{\mathbf {x}}^k - \hat{\mathbf {x}}^*) \Vert + \frac{1}{\rho } \Vert \varvec{\lambda }^k - \varvec{\lambda }^{k-1} \Vert . \end{aligned}$$
(72)

Moreover, the relation (58) implies that \(\exists N_0 \ge 1\) such that \(\forall k \ge N_0\), we have

$$\begin{aligned} \Vert \varvec{\lambda }^k - \varvec{\lambda }^{k-1} \Vert \le \rho \zeta . \end{aligned}$$
(73)

Similar to (56), the full row rank of \(\mathbf {B}\) implies \(\Vert \mathbf {y}^k - \mathbf {y}^* \Vert \le \bar{M} \Vert \mathbf {B}(\mathbf {y}^k - \mathbf {y}^*) \Vert \). Then, plugging (73) into (72), we obtain that

$$\begin{aligned} \Vert \mathbf {y}^k - \mathbf {y}^* \Vert \le \bar{M}(\Vert \hat{\mathbf {A}}\Vert +1) \zeta , \end{aligned}$$
(74)

for any \(\hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\) and \(k \ge N_0\). Combining (71) and (74), we know that if \(\hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\) and \(k \ge N_0\), then \((\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) \in \mathcal {V}_{\zeta } \subseteq \mathcal {V}\).

Moreover, (44) and (64) implies that \(\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) \ge l^*, \forall k \ge 1\). Besides, as \((\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*)\) is a cluster point, we will obtain that \(\exists N \ge N_0\), the following relations hold:

$$\begin{aligned} \left\{ \begin{array}{l} \hat{\mathbf {x}}^N \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\\ l^*< \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^N, \hat{\mathbf {x}}^N, \varvec{\lambda }^N)< l^* + \eta \\ \Vert \hat{\mathbf {x}}^N - \hat{\mathbf {x}}^* \Vert + 2 \sqrt{(\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^N, \hat{\mathbf {x}}^N, \varvec{\lambda }^N) - l^* ) / D}\\ \quad + \frac{C}{D} (\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^N, \hat{\mathbf {x}}^N, \varvec{\lambda }^N) - l^* ) < \zeta \end{array}\right. \end{aligned}$$
(75)

Next, We will show that if \(\hat{\mathbf {x}}^N \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\) and \(l^*< \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^N, \hat{\mathbf {x}}^N, \varvec{\lambda }^N) < l^* + \eta \) hold for some fixed \(k \ge N_0\), then the following relation holds

$$\begin{aligned}&\Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert + \big (\Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert - \Vert \hat{\mathbf {x}}^{k} - \hat{\mathbf {x}}^{k-1} \Vert \big ) \nonumber \\&\quad \le \frac{C}{D} \bigg [ \varphi \big (\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) - l^* \big )\nonumber \\&\quad - \varphi \big (\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1}) - l^* \big ) \bigg ]. \end{aligned}$$
(76)

To prove (76), we utilize the fact that \(\hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}, k \ge N_0\) implies that \((\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}} \subseteq \mathcal {V}\). And, combining with \(l^*< \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) < l^* + \eta \), we obtain that

$$\begin{aligned} \varphi '( \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) - l^* ) \text {dist}(0, \partial \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) ) \ge 1. \end{aligned}$$
(77)

Combining the relations (62), (63) and (77), as well as the concavity of \(\varphi \), we obtain that

$$\begin{aligned}&C \Vert \hat{\mathbf {x}}^k - \hat{\mathbf {x}}^{k-1} \Vert \cdot \big [ \varphi \big (\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) - l^* \big ) \nonumber \\&\quad - \varphi \big (\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1}) - l^* \big ) \big ]\nonumber \\&\quad \ge \text {dist}(0, \partial \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) ) \cdot \big [ \varphi \big (\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) - l^* \big ) \nonumber \\&\quad - \varphi \big (\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1}) - l^* \big ) \big ]\nonumber \\&\quad \ge \text {dist}(0, \partial \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) ) \cdot \varphi '\big (\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) - l^* \big ) \nonumber \\&\quad \cdot \big [ \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k) - \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{k+1}, \hat{\mathbf {x}}^{k+1}, \varvec{\lambda }^{k+1}) \big ]\nonumber \\&\quad \ge D \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^{k} \Vert ^2, \end{aligned}$$
(78)

for all such k. Taking square root on both sides of (78), and utilizing the fact that \(a+b \ge 2 \sqrt{a b}\), then (76) is proved.

We then prove \(\forall k \ge N, \hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\) holds. This claim can be proved through induction. Obviously it is true for \(k=N\) by construction, as shown in (75). For \(k=N+1\), we have

$$\begin{aligned}&\Vert \hat{\mathbf {x}}^{N+1} - \hat{\mathbf {x}}^{*} \Vert \le \Vert \hat{\mathbf {x}}^{N+1} - \hat{\mathbf {x}}^{N} \Vert + \Vert \hat{\mathbf {x}}^{N} - \hat{\mathbf {x}}^{*} \Vert \nonumber \\&\quad \le \sqrt{\big ( \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^N, \hat{\mathbf {x}}^N, \varvec{\lambda }^N) - \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{N+1}, \hat{\mathbf {x}}^{N+1}, \varvec{\lambda }^{N+1}) \big ) / D}\nonumber \\&\qquad + \Vert \hat{\mathbf {x}}^{N} - \hat{\mathbf {x}}^{*} \Vert \nonumber \\&\quad \le \sqrt{\big ( \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^N, \hat{\mathbf {x}}^N, \varvec{\lambda }^N) - l^* \big ) / D} + \Vert \hat{\mathbf {x}}^{N} - \hat{\mathbf {x}}^{*} \Vert < \zeta , \end{aligned}$$
(79)

where the first inequality utilizes (63), and the last inequality follows the last relation in (75). Thus, \(\forall k \ge N, \hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\) holds.

Next, we suppose that \(\hat{\mathbf {x}}^N, \ldots , \hat{\mathbf {x}}^{N+t-1} \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\) for some \(t>1\), and we need to prove that \(\hat{\mathbf {x}}^{N+t} \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\) also holds, i.e.,

$$\begin{aligned}&\Vert \hat{\mathbf {x}}^{N+t} - \hat{\mathbf {x}}^{*} \Vert \le \Vert \hat{\mathbf {x}}^{N} - \hat{\mathbf {x}}^{*} \Vert + \Vert \hat{\mathbf {x}}^{N+1} - \hat{\mathbf {x}}^{N} \Vert \nonumber \\&\qquad + \sum _{i=1}^{t-1} \Vert \hat{\mathbf {x}}^{N+i+1} - \hat{\mathbf {x}}^{N+i} \Vert = \Vert \hat{\mathbf {x}}^{N} - \hat{\mathbf {x}}^{*} \Vert + 2 \Vert \hat{\mathbf {x}}^{N+1} - \hat{\mathbf {x}}^{N} \Vert \nonumber \\&\qquad - \Vert \hat{\mathbf {x}}^{N+t} - \hat{\mathbf {x}}^{N+t-1} \Vert +\sum _{i=1}^{t-1} \bigg [ \Vert \hat{\mathbf {x}}^{N+i+1} - \hat{\mathbf {x}}^{N+i} \Vert \nonumber \\&\qquad + \big ( \Vert \hat{\mathbf {x}}^{N+i+1} - \hat{\mathbf {x}}^{N+i} \Vert - \Vert \hat{\mathbf {x}}^{N+i} - \hat{\mathbf {x}}^{N+i-1} \Vert \big ) \bigg ]\nonumber \\&\quad \le \Vert \hat{\mathbf {x}}^{N} - \hat{\mathbf {x}}^{*} \Vert + 2 \Vert \hat{\mathbf {x}}^{N+1} - \hat{\mathbf {x}}^{N} \Vert \nonumber \\&\qquad + \frac{C}{D} \sum _{i=1}^{t-1} \big [ \varphi ^{N+i} - \varphi ^{N+i+1} \big ]\nonumber \\&\quad \le \Vert \hat{\mathbf {x}}^{N} - \hat{\mathbf {x}}^{*} \Vert + 2 \Vert \hat{\mathbf {x}}^{N+1} - \hat{\mathbf {x}}^{N} \Vert + \frac{C}{D} \sum _{i=1}^{t-1} \varphi ^{N+1}\nonumber \\&\quad \le \Vert \hat{\mathbf {x}}^{N} - \hat{\mathbf {x}}^{*} \Vert + 2 \sqrt{ \frac{\mathcal {L}_{\rho , \epsilon }^N - \mathcal {L}_{\rho , \epsilon }^{N+1}}{D}} + \frac{C}{D} \sum _{i=1}^{t-1} \varphi ^{N+1}\nonumber \\&\quad \le \Vert \hat{\mathbf {x}}^{N} - \hat{\mathbf {x}}^{*} \Vert + 2 \sqrt{ \frac{\mathcal {L}_{\rho , \epsilon }^N - l^* }{D}} + \frac{C}{D} \sum _{i=1}^{t-1} \varphi ^{N+1} < \zeta \end{aligned}$$
(80)

where \(\varphi ^{N+i} = \varphi (\mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{N+i}, \hat{\mathbf {x}}^{N+i}, \varvec{\lambda }^{N+i})-l^*) \) and \(\mathcal {L}_{\rho , \epsilon }^N = \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^{N}, \hat{\mathbf {x}}^{N}, \varvec{\lambda }^{N})\). The second inequality follows from (76). The fourth inequality follows from (63). The fifth inequality utilizes the fact that \(\mathcal {L}_{\rho , \epsilon }^{N+1} > l^*\), and the last inequality follows from the last relation in (75). Thus, \(\hat{\mathbf {x}}^{N+k} \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\) holds. We have proved that \(\forall k \ge N, \hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\) holds by induction.

Then, according to \(\forall k \ge N, \hat{\mathbf {x}}^k \in \mathcal {V}_{\zeta , \hat{\mathbf {x}}}\), we can sum both sides of (76) from \(k=N\) to \(\infty \), to obtain that

$$\begin{aligned} \sum _{k=N}^{\infty } \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^{k} \Vert \le \frac{C}{D} \varphi ^N + \Vert \hat{\mathbf {x}}^{N} - \hat{\mathbf {x}}^{N-1} \Vert < \infty , \end{aligned}$$
(81)

which implies that \(\sum _{k=1}^{\infty } \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert < \infty \) holds. Thus \(\{ \hat{\mathbf {x}}^k \}\) converges. The convergence of \(\{ \mathbf {y}^k \}\) follows from \(\mathbf {B}\mathbf {y}^{k+1} = \hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} + \frac{1}{\rho }(\varvec{\lambda }^{k+1} -\varvec{\lambda }^k)\) in (38) and (58), as well as the surjectivity of \(\mathbf {B}\) (i.e., full row rank). The convergence of \(\{ \varvec{\lambda }^k \}\) follows from \(\nabla \hat{f}(\hat{\mathbf {x}}^{k+1}) = - \hat{\mathbf {A}}^\top \varvec{\lambda }^{k+1}\) in (37) and the surjectivity of \(\hat{\mathbf {A}}\) (i.e., full row rank). Consequently, \(\{ \mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k \}\) converges to the cluster point \((\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*)\). The conclusion that \((\mathbf {y}^*, \hat{\mathbf {x}}^*)\) is the KKT point of Problem (33) has been proved in Sect. A.4.

1.6 \(\epsilon \)-KKT Point of the Original LS–LP Problem

Proposition 1

The globally converged solution \((\mathbf {y}^*, \mathbf {x}^*, \varvec{\lambda }^*)\) produced by the ADMM algorithm for the perturbed LS–LP problem (33) is the \(\epsilon \)-KKT solution to the original LS–LP problem (32).

Proof

The globally converged solution \((\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*)\) to the perturbed LS–LP problem (33) satisfies the following relations:

$$\begin{aligned} \mathbf {B}^\top \varvec{\lambda }^* \in \partial h(\mathbf {y}^*), ~ \nabla \hat{f}(\hat{\mathbf {x}}^{*}) = - \hat{\mathbf {A}}^\top \varvec{\lambda }^*, ~ \hat{\mathbf {A}} \hat{\mathbf {x}}^* = \mathbf {B}\mathbf {y}^*. \end{aligned}$$
(82)

Recalling the definitions \(\hat{\mathbf {A}} = [\mathbf {A}, \epsilon \mathbf {I}]\), \(\hat{\mathbf {x}} = [\mathbf {x}; \bar{\mathbf {x}}]\) and \(\hat{f}(\hat{\mathbf {x}}) = f(\mathbf {x}) + \frac{\epsilon }{2} \hat{\mathbf {x}}^\top \hat{\mathbf {x}}\), the above relations imply that

$$\begin{aligned} \nabla \hat{f}(\hat{\mathbf {x}}^{*}) + \hat{\mathbf {A}}^\top \varvec{\lambda }^*&= \nabla f(\mathbf {x}^{*}) + \mathbf {A}^\top \varvec{\lambda }^* + \epsilon \mathbf {x}^* = \varvec{0} \nonumber \\&\Rightarrow \Vert \nabla f(\mathbf {x}^{*}) + \mathbf {A}^\top \varvec{\lambda }^* \Vert = \epsilon \Vert \mathbf {x}^* \Vert = O(\epsilon ),\end{aligned}$$
(83)
$$\begin{aligned} \hat{\mathbf {A}} \hat{\mathbf {x}}^* + \mathbf {B}\mathbf {y}^*&= \mathbf {A}\mathbf {x}^* + \epsilon \bar{\mathbf {x}}^* + \mathbf {B}\mathbf {y}^* = 0 \Rightarrow \Vert \mathbf {A}\mathbf {x}^* + \mathbf {B}\mathbf {y}^* \Vert \nonumber \\&= \Vert \epsilon \bar{\mathbf {x}}^* \Vert = O(\epsilon ), \end{aligned}$$
(84)

where we utilize the boundedness of \(\{\hat{\mathbf {x}}^*\}\). Thus, according to Definition 2, the globally converged point \((\mathbf {y}^*, \mathbf {x}^*)\) is the \(\epsilon \)-KKT solution to the original LS–LP problem (32). \(\square \)

1.7 Convergence Rate

Lemma 3

Firstly, without loss of generality, we can assume that \(l^* = \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*) = 0\) (e.g., one can replace \(l_k = \mathcal {L}_{\rho , \epsilon }(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k)\) by \(l_k - l^*\)). We further assume that \(\mathcal {L}_{\rho , \epsilon }\) has the KL property at \((\mathbf {y}^*, \hat{\mathbf {x}}^*, \varvec{\lambda }^*)\) with the concave function \(\varphi (s) = c s^{1-p}\), where \(p \in [0, 1), c>0\). Consequently, we can obtain the following inequalities:

  1. (i)

    if \(p=0\), then \(\{(\mathbf {y}^k, \hat{\mathbf {x}}^k, \varvec{\lambda }^k)\}_{k = 1, \ldots , \infty }\) can converge in finite steps;

  2. (ii)

    If \(p \in (0,\frac{1}{2}]\), then there exist \(c>0\) and \(\tau \in (0,1)\) such that \(\Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert \le c \tau ^k\);

  3. (iii)

    \(p \in (\frac{1}{2},1)\), then there exist \(c>0\) such that \(\Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert \le c k^{-\frac{1-p}{2p -1}} \).

Proof

(i) If \(p=0\), we define a subset \(H = \{k \in \mathbb {N}: \hat{\mathbf {x}}_{k} \ne \hat{\mathbf {x}}_{k+1} \}\). If \(k \in H\) is sufficiently large, then these exists \(C_3>0\) such that

$$\begin{aligned} \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert ^2 \ge C_3 > 0. \end{aligned}$$
(85)

Combining with (63), we have

$$\begin{aligned} l_k - l_{k+1} \ge D \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert ^2 \ge C_3 D > 0. \end{aligned}$$
(86)

If the subset H is infinite, then it will contradict to the fact that \(l_k - l_{k+1} \rightarrow 0\) as \(k\rightarrow \infty \). Thus, H is a finite subset, leading to the conclusion that \(\{\hat{\mathbf {x}}^k\}_{k \in \mathbb {N}}\) will converge in finite steps. Recalling the relationships between \(\hat{\mathbf {x}}_k\) and \(\mathbf {y}_k, \varvec{\lambda }_k\) (see the descriptions under (81)), we also obtain that \(\{ \mathbf {y}_k, \varvec{\lambda }_k \}_{k \in \mathbb {N}}\) converges in finite steps.

By defining \(\bigtriangleup _k = \sum _k^{\infty } \Vert x^k+1 - x^k \Vert \), the inequality (81) can be rewritten as follows

$$\begin{aligned} \bigtriangleup _k \le \frac{C}{D} \varphi (l_k) + (\bigtriangleup _{k-1} - \bigtriangleup _k) < \infty . \end{aligned}$$
(87)

Besides, the KL property and \(l^* = 0\) give that

$$\begin{aligned} \varphi '(l_k) \text {dist}(0, \partial (l_k))&= c (1-p) l_k^{1-p} \text {dist}(0, \partial (l_k)) \ge 1\nonumber \\&\Rightarrow l_k^p \le c(1-p) \text {dist}(0, \partial (l_k)). \end{aligned}$$
(88)

Combining with (62), we obtain

$$\begin{aligned} l_k^p&\le c(1-p) C (\bigtriangleup _{k-1} - \bigtriangleup _k) \Rightarrow \varphi (l_k) = c l_k^{1-p} \nonumber \\&\le c (c(1-p) C)^{\frac{1-p}{p}} (\bigtriangleup _{k-1} - \bigtriangleup _k)^{\frac{1-p}{p}} \nonumber \\&= C_1 (\bigtriangleup _{k-1} - \bigtriangleup _k)^{\frac{1-p}{p}}. \end{aligned}$$
(89)

Then, inserting (89) into (87), we obtain

$$\begin{aligned} \bigtriangleup _k \le C_2 (\bigtriangleup _{k-1} - \bigtriangleup _k)^{\frac{1-p}{p}} + (\bigtriangleup _{k-1} - \bigtriangleup _k) < \infty . \end{aligned}$$
(90)

(ii) If \(p \in (0, \frac{1}{2}]\), then \(\frac{1-p}{p}\ge 1\). Besides, since \((\bigtriangleup _{k-1} - \bigtriangleup _k) \rightarrow 0\) when \(k \rightarrow \infty \), there exists an integer \(K_0\) such that \((\bigtriangleup _{k-1} - \bigtriangleup _k) < 1\), leading to that \((\bigtriangleup _{k-1} - \bigtriangleup _k)^{\frac{1-p}{p}} \le (\bigtriangleup _{k-1} - \bigtriangleup _k)\). Inserting it into (90), we obtain that

$$\begin{aligned} \bigtriangleup _k&\le (C_2+1) (\bigtriangleup _{k-1} - \bigtriangleup _k) \Rightarrow \bigtriangleup _k \le C_3 (\bigtriangleup _{k-1} - \bigtriangleup _k) \nonumber \\&\Rightarrow \bigtriangleup _k {\le } \frac{C_3}{1{+}C_3} \bigtriangleup _{k-1} {=} \tau \bigtriangleup _{k-1}, ~\text {with}~ \tau \in (0,1),\nonumber \\&\qquad \forall k {>} K_0. \end{aligned}$$
(91)

It is easy to deduce that \(\bigtriangleup _k \le (\bigtriangleup _{K_0} \tau ^{-K_0}) \tau ^k = \frac{c}{2} \tau ^k\), with c being a positive constant. Note that k in \(\tau ^k\) indicates k power of \(\tau \), rather than the iteration index. Combining with \(\Vert \hat{\mathbf {x}}^k - \hat{\mathbf {x}}^* \Vert \le \bigtriangleup _k\), it is easy to obtain that \(\Vert \hat{\mathbf {x}}^k - \hat{\mathbf {x}}^* \Vert \le \frac{c}{2} \tau ^k\) with \(\tau \in (0,1)\) and c being a positive constant. Then, we have

$$\begin{aligned} \Vert \hat{\mathbf {x}}^{k+1} {-} \hat{\mathbf {x}}^k \Vert&\le \Vert \hat{\mathbf {x}}^k {-} \hat{\mathbf {x}}^* \Vert {+} \Vert \hat{\mathbf {x}}^{k+1} {-} \hat{\mathbf {x}}^* \Vert \nonumber \\&\le \frac{c}{2} (\tau ^{k+1} {+} \tau ^k) \le c \tau ^k. \end{aligned}$$
(92)

(iii) If \(p \in (\frac{1}{2}, 1)\), then \(\frac{1-p}{p}<1\). Then, it is easy to obtain that \((\bigtriangleup _{k-1} - \bigtriangleup _k)^{\frac{1-p}{p}} > (\bigtriangleup _{k-1} - \bigtriangleup _k)\). Inserting it into (90), we obtain that

$$\begin{aligned} \bigtriangleup _k&\le (C_2+1) (\bigtriangleup _{k-1} - \bigtriangleup _k)^{\frac{1-p}{p}} \Rightarrow \bigtriangleup _k\nonumber \\&\le C_3 (\bigtriangleup _{k-1} - \bigtriangleup _k)^{\frac{1-p}{p}} \Rightarrow \bigtriangleup _k^{\frac{p}{1-p}} \nonumber \\&\le C_4 (\bigtriangleup _{k-1} - \bigtriangleup _k), ~ \forall k > K_0. \end{aligned}$$
(93)

It has been studied in Theorem 2 of Attouch and Bolte (2009) that the above inequality can deduce \(\bigtriangleup _k \le \frac{c}{2} k^{- \frac{1-p}{2p-1}}\), with c being a positive constant. Since \(\Vert \hat{\mathbf {x}}^k - \hat{\mathbf {x}}^* \Vert \le \bigtriangleup _k\), we have that \(\Vert \hat{\mathbf {x}}^k - \hat{\mathbf {x}}^* \Vert \le \frac{c}{2} k^{- \frac{1-p}{2p-1}}\). Then, we have

$$\begin{aligned} \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert&\le \Vert \hat{\mathbf {x}}^k - \hat{\mathbf {x}}^* \Vert + \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^* \Vert \nonumber \\&\le \frac{c}{2} \big (k^{- \frac{1-p}{2p-1}} + (k+1)^{- \frac{1-p}{2p-1}} \big ) \le c k^{- \frac{1-p}{2p-1}}. \end{aligned}$$
(94)

\(\square \)

Proposition 2

We adopt the same assumptions in Lemma 3. Then,

  1. (i)

    If \(p =0\), then we will obtain the \(\epsilon \)-KKT solution to the LS–LP problem in finite steps.

  2. (ii)

    If \(p \in (0, \frac{1}{2}]\), then we will obtain the \(\epsilon \)-KKT solution to the LS–LP problem in at least \(O\big (\log _{\frac{1}{\tau }}(\frac{1}{\epsilon })^2\big )\) steps.

  3. (iii)

    If \(p \in ( \frac{1}{2}, 1)\), then we will obtain the \(\epsilon \)-KKT solution to the LS–LP problem in at least \( O\big ( (\frac{1}{\epsilon })^{\frac{4p-2}{1-p}}\big )\) steps.

Proof

The conclusion (i) directly holds from Lemma 3(i).

According to the optimality condition (36), we have

$$\begin{aligned}&\text {dist}\big ( \mathbf {B}^\top \varvec{\lambda }^{k+1}, \partial h(\mathbf {y}^{k+1}) \big ) = \Vert \rho \mathbf {B}^\top \hat{\mathbf {A}} ( \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k) \Vert _2\nonumber \\&\quad \Rightarrow \text {dist}^2\big ( \mathbf {B}^\top \varvec{\lambda }^{k+1}, \partial h(\mathbf {y}^{k+1})\big ) = \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^{k} \Vert _{\rho ^2 \hat{\mathbf {A}}^\top \mathbf {B}\mathbf {B}^\top \hat{\mathbf {A}}}^2 \nonumber \\&\quad \le \xi _{\text {max}}(\rho ^2 \hat{\mathbf {A}}^\top \mathbf {B}\mathbf {B}^\top \hat{\mathbf {A}}) \Vert \hat{\mathbf {x}}^{k+1} {-} \hat{\mathbf {x}}^{k} \Vert _2^2 {=} O\left( \frac{1}{\epsilon ^2}\right) \Vert \hat{\mathbf {x}}^{k+1} {-} \hat{\mathbf {x}}^{k} \Vert _2^2\nonumber \\&\quad \Rightarrow \text {dist}\big ( \mathbf {B}^\top \varvec{\lambda }^{k+1}, \partial h(\mathbf {y}^{k+1}) \big ) \le O\left( \frac{1}{\epsilon }\right) \cdot \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^{k} \Vert _2 \end{aligned}$$
(95)

According to the optimality condition (38) and the relation (40), we obtain that

$$\begin{aligned} \Vert \hat{\mathbf {A}} \hat{\mathbf {x}}^{k+1} - \mathbf {B}\mathbf {y}^{k+1} \Vert _2&= \frac{1}{\rho }\Vert \varvec{\lambda }^{k +1} - \varvec{\lambda }^{k} \Vert _2 \le \frac{1}{\rho } \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert _2 \nonumber \\&\le \epsilon \cdot \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^k \Vert _2 \le O\left( \frac{1}{\epsilon }\right) \nonumber \\&\quad \cdot \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^{k} \Vert _2. \end{aligned}$$
(96)

According to Lemma 3, we have

  1. (i)

    If \(p \in (0, \frac{1}{2}]\), then

    $$\begin{aligned} O\left( \frac{1}{\epsilon }\right) \cdot \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^{k} \Vert _2\le & {} O\left( \frac{1}{\epsilon }\right) \tau ^k \le O(\epsilon )\\\Rightarrow & {} k \ge O\left( \log _{\frac{1}{\tau }}\left( \frac{1}{\epsilon }\right) ^2\right) , \end{aligned}$$

    which means that when \(k \ge O\big (\log _{\frac{1}{\tau }}(\frac{1}{\epsilon })^2\big )\), we will obtain the \(\epsilon \)-KKT solution to the perturbed LS–LP problem, i.e., the \(\epsilon \)-KKT solution to the original LS–LP problem.

  2. (ii)

    If \(p \in ( \frac{1}{2}, 1)\), then

    $$\begin{aligned} O\left( \frac{1}{\epsilon }\right) \cdot \Vert \hat{\mathbf {x}}^{k+1} - \hat{\mathbf {x}}^{k} \Vert _2\le & {} O\left( \frac{1}{\epsilon }\right) k^{- \frac{1-p}{2p-1}} \le O(\epsilon )\\\Rightarrow & {} k \ge O\left( \left( \frac{1}{\epsilon }\right) ^{\frac{4p-2}{1-p}}\right) , \end{aligned}$$

    which means that when \(k \ge O\big ( (\frac{1}{\epsilon })^{\frac{4p-2}{1-p}}\big )\), we will obtain the \(\epsilon \)-KKT solution to the perturbed LS–LP problem, i.e., the \(\epsilon \)-KKT solution to the original LS–LP problem. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, B., Shen, L., Zhang, T. et al. MAP Inference Via \(\ell _2\)-Sphere Linear Program Reformulation. Int J Comput Vis 128, 1913–1936 (2020). https://doi.org/10.1007/s11263-020-01313-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01313-2

Keywords