Skip to main content
Log in

Zero-norm regularized problems: equivalent surrogates, proximal MM method and statistical error bound

  • Published:
Computational Optimization and Applications Aims and scope Submit manuscript

Abstract

For the zero-norm regularized problem, we verify that the penalty problem of its equivalent MPEC reformulation is a global exact penalty, which implies a family of equivalent surrogates. For a subfamily of these surrogates, the critical point set is demonstrated to coincide with the d-directional stationary point set and when a critical point has no too small nonzero component, it is a strongly local optimal solution of the surrogate problem and the zero-norm regularized problem. We also develop a proximal majorization-minimization (MM) method for solving the DC (difference of convex functions) surrogates, and provide its global and linear convergence analysis. For the limit of the generated sequence, the statistical error bound is established under a mild condition, which implies its good quality from a statistical respective. Numerical comparisons with ADMM for solving the DC surrogate and APG for solving its partially smoothed form indicate that our proximal MM method armed with an inexact dual PPA plus the semismooth Newton method (PMMSN for short) is remarkably superior to ADMM and APG in terms of the quality of solutions and the CPU time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Data availability

The data used to form the test problems in Subsection 5.4 are freely available in https://www.csie.ntu.edu.tw.

References

  1. Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Program. 116, 5–16 (2009)

    MathSciNet  MATH  Google Scholar 

  2. Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. 35, 438–457 (2010)

    MathSciNet  MATH  Google Scholar 

  3. Belloni, A., Chernozhukov, V.: Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika 4, 791–806 (2010)

    MathSciNet  MATH  Google Scholar 

  4. Bi, S.J., Liu, X.L., Pan, S.H.: Exact penalty decomposition method for zero-norm minimization based on MPEC formulation. SIAM J. Sci. Comput. 36, A1451–A1477 (2014)

    MathSciNet  MATH  Google Scholar 

  5. Bian, W., Chen, X.J.: A smoothing proximal gradient algorithm for nonsmooth convex regression with cardinality penalty. SIAM J. Numer. Anal. 58, 858–883 (2020)

    MathSciNet  Google Scholar 

  6. Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146, 459–494 (2014)

    MathSciNet  MATH  Google Scholar 

  7. Bot, R.I., Nguyen, D.K.: The proximal alternating direction method of multipliers in the nonconvex setting: convergence analysis and rates. Math. Oper. Res. 45, 1–31 (2018)

    MathSciNet  Google Scholar 

  8. Bradley, P.S., Mangasarian, O.L.: Feature selection via concave minimization and support vector machines. In: Proceeding of ICML (1998)

  9. Bruckstein, A.M., Donoho, D.L., Elad, M.: From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Rev. 51, 34–81 (2009)

    MathSciNet  MATH  Google Scholar 

  10. Cao, S.S., Huo, X.M., Pang, J.S.: A unifying framework of high-dimensional sparse estimation with difference-of-convex (DC) regularizations, arXiv:1812.07130 (2018)

  11. Chartrand, R.: Exact reconstruction of sparse signals via nonconvex minimization. IEEE Signal Process. Lett. 14, 707–710 (2007)

    Google Scholar 

  12. Chen, X.J., Xu, F.M., Ye, Y.Y.: Lower bound theory of nonzero entries in solutions of \(\ell _2\)-\(\ell _p\) minimization. SIAM J. Sci. Comput. 32, 2832–2852 (2010)

    MathSciNet  Google Scholar 

  13. Clarke, F.H.: Optimization and Nonsmooth Analysis. John Wiley and Sons, New York (1983)

    MATH  Google Scholar 

  14. Cui, Y., Chang, T.H., Hong, M., Pang, J.S.: A study of piecewise linear-quadratic programs. J. Optim. Theory Appl. 186, 523–553 (2020)

    MathSciNet  MATH  Google Scholar 

  15. Cui, Y., Pang, J.S.: Modern Nonconvex Nondifferentiable Optimization. Society for Industrial and Applied Mathematics, Philadelphia (2022)

    MATH  Google Scholar 

  16. Cui, Y., Sun, D.F., Toh, K.C.: On the R-superlinear convergence of the KKT residuals generated by the augmented Lagrangian method for convex composite conic programming. Math. Program. 178, 381–415 (2019)

    MathSciNet  MATH  Google Scholar 

  17. Donoho, D.L., Stark, B.F.: Uncertainty principles and signal recovery. SIAM J. Appl. Math. 49, 906–931 (1989)

    MathSciNet  MATH  Google Scholar 

  18. Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theory 52, 1289–1306 (2006)

    MathSciNet  MATH  Google Scholar 

  19. Dontchev, A.L., Rockafellar, R.T.: Implicit Functions and Solution Mappings-a View from Variational Analysis. Springer Monographs in Mathematics, LLC, New York (2009)

    MATH  Google Scholar 

  20. Facchinei, F., Pang, J.S.: Finite-Dimensional Variational Inequalities and Complementarity Problems. Springer, New York (2003)

    MATH  Google Scholar 

  21. Fan, J.Q., Li, R.Z.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001)

    MathSciNet  MATH  Google Scholar 

  22. Fan, J.Q., Xue, L.Z., Zou, H.: Strong oracle optimality of folded concave penalized estimation. Ann. Stat. 42, 819–849 (2014)

    MathSciNet  MATH  Google Scholar 

  23. Feng, M.B., Mitchell, J.E., Pang, J.S., Shen, X., Wächter, A.: Complementarity formulations of \(\ell _0\)-norm optimization problems. Pac. J. Optim. 14, 273–305 (2018)

    MathSciNet  MATH  Google Scholar 

  24. Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput. Math. Appl. 2, 17–40 (1976)

    MATH  Google Scholar 

  25. Gotoh, J.Y., Takeda, A., Tono, K.: DC formulations and algorithms for sparse optimization problems. Math. Program. 169, 141–176 (2018)

    MathSciNet  MATH  Google Scholar 

  26. Gu, Y.W., Fan, J., Kong, L.C., Ma, S.Q., Zou, H.: ADMM for high-dimensional sparse penalized quantile regression. Technometrics 60, 319–331 (2018)

    MathSciNet  Google Scholar 

  27. Hiriart-Urruty, J.B., Strodiot, J.J., Nguyen, V.H.: Generalized Hessian matrix and second-order optimality conditions for problems with \(C^{1,1}\) data. Appl. Math. Optim. 11, 43–56 (1984)

    MathSciNet  MATH  Google Scholar 

  28. Huang, J., Hom, H., Jiao, Y., Liu, Y., Lu, X.: A constructive approach to \(L_0\) penalized regression. J. Mach. Learn. Res. 19, 1–37 (2018)

    MathSciNet  Google Scholar 

  29. Ioffe, A.D., Outrata, J.V.: On metric and calmness qualification conditions in subdifferential calculus. Set-Valued Anal. 16, 199–227 (2008)

    MathSciNet  MATH  Google Scholar 

  30. Le, H.Y.: Generalized subdifferentials of the rank function. Optim. Lett. 7, 731–743 (2013)

    MathSciNet  MATH  Google Scholar 

  31. Le Thi, H.A., Le, H.M., Pham Dinh, T.: Feature selection in machine learning: an exact penalty approach using a difference of convex function algorithm. Mach. Learn. 101, 163–186 (2015)

    MathSciNet  MATH  Google Scholar 

  32. Le Thi, H.A., Pham Dinh, T.: DC programming and DCA: thirty years of developments. Mathematical Programming B, Special Issue dedicated to: DC Programming-Theory, Algorithms and Applications 169, 5-68 (2018)

  33. Li, G.Y., Pong, T.K.: Calculus of the exponent of Kurdyka-Łojasiewicz inequality and its applications to linear convergence of first-order methods. Found. Comput. Math. 18, 1199–1232 (2018)

    MathSciNet  MATH  Google Scholar 

  34. Liu, Y.L., Bi, S.J., Pan, S.H.: Equivalent Lipschitz surrogates for zero-norm and rank optimization problems. J. Global Optim. 72, 679–704 (2018)

    MathSciNet  MATH  Google Scholar 

  35. Lu, Z.: Iterative hard thresholding methods for \(\ell _0\) regularized convex cone programming. Math. Program. 147, 125–154 (2014)

    MathSciNet  Google Scholar 

  36. Loh, P.L., Wainwright, M.J.: Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima. J. Mach. Learn. Res. 16, 559–616 (2015)

    MathSciNet  MATH  Google Scholar 

  37. Mangasarian, O.L.: Machine learning via polyhedral concave minimization. In: Fischer, H., Riedmueller, B., Schaeffler, S. (eds.) Applied Mathematics and Parallel Computing-Festschrift for Klaus Ritter, pp. 175–188. Physica-Verlag, Heidelberg (1996)

    Google Scholar 

  38. Mifflin, R.: Semismooth and semiconvex functions in constrained optimization. SIAM J. Control. Optim. 15, 959–972 (1977)

    MathSciNet  MATH  Google Scholar 

  39. Nesterov, Y.: A method of solving a convex programming problem with convergence rate \(O(1/k^2)\). Soviet Math. Dokl. 27, 372–376 (1983)

    MATH  Google Scholar 

  40. Ortega, J.M., Rheinboldt, W.C.: Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, New York (1970)

    MATH  Google Scholar 

  41. Pan, S.H., Liu, Y.L.: Subregularity of subdifferential mappings relative to the critical set and KL property of exponent 1/2, arXiv:1812.00558v3

  42. Pang, J.S., Razaviyayn, M., Alvarado, A.: Computing B-stationary points of nonsmooth DC programs. Math. Oper. Res. 42, 95–118 (2017)

    MathSciNet  MATH  Google Scholar 

  43. Pham Dinh, T., Le Thi, H.A.: Convex analysis approach to DC programming: theory, algorithms and applications. Acta Math. Vietnamica 22, 289–355 (1997)

    MathSciNet  MATH  Google Scholar 

  44. Qi, L.Q., Sun, J.: A nonsmooth version of Newton’s method. Math. Program. 58, 353–367 (1993)

    MathSciNet  MATH  Google Scholar 

  45. Qian, Y.T., Pan, S.H., Liu, Y.L.: Calmness of partial perturbation to composite rank constraint systems and its applications, arXiv:2102.10373v2, October 8 (2021)

  46. Raskutti, G., Wainwright, M.J.: Restricted eigenvalue properties for correlated Gaussian designs. J. Mach. Learn. Res. 11, 2241–2259 (2010)

    MathSciNet  MATH  Google Scholar 

  47. Rinaldi, F., Schoen, F., Sciandrone, M.: Concave programming for minimizing the zero-norm over polyhedral sets. Comput. Optim. Appl. 46, 467–486 (2010)

    MathSciNet  MATH  Google Scholar 

  48. Rockafellar, R.T.: Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Math. Oper. Res. 1, 97–116 (1976)

    MathSciNet  MATH  Google Scholar 

  49. Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)

    MATH  Google Scholar 

  50. Rockafellar, R.T., Wets, R.J.-B.: Variational Analysis. Springer, Cham (1998)

    MATH  Google Scholar 

  51. Robinson, S.M.: Some continuity properties of polyhedral multifunctions. Math. Program. Study 14, 206–214 (1981)

    MathSciNet  MATH  Google Scholar 

  52. Soubies, E., Blang-Fraud, L., Aubert, G.: A unified view of exact continuous penalities for \(\ell _2\)-\(\ell _0\) minimization. SIAM J. Optim. 8, 1067–1639 (2017)

    Google Scholar 

  53. Tang, P.P., Wang, C.J., Sun, D.F., Toh, K.C.: A sparse semismooth Newton based proximal majorization-minimization algorithm for nonconvex square-root-loss regression problems. J. Mach. Learn. Res. 21, 1–38 (2020)

    MathSciNet  MATH  Google Scholar 

  54. Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. Roy. Stat. Soc. B 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  55. Wang, L., Wu, Y.C., Li, R.Z.: Quantile regression for analyzing heterogeneity in ultra high dimension. J. Am. Stat. Assoc. 107, 214–222 (2012)

    MathSciNet  MATH  Google Scholar 

  56. Wang, Y., Yin, W.T., Zeng, J.S.: Global convergence of ADMM in nonconvex nonsmooth optimization. J. Sci. Comput. 78, 29–63 (2019)

    MathSciNet  MATH  Google Scholar 

  57. Weston, J., Elisseef, A., Schölkopf, B., Tipping, M.: Use of the zero norm with linear models and kernel methods. J. Mach. Learn. Res. 3, 1439–1461 (2003)

    MathSciNet  MATH  Google Scholar 

  58. Wen, B., Chen, X.J., Pong, T.K.: Linear convergence of proximal gradient algorithm with extrapolation for a class of nonconvex nonsmooth minimization problems. SIAM J. Optim. 27, 124–145 (2017)

    MathSciNet  MATH  Google Scholar 

  59. Wright, J., Ma, Y.: Dense error correction via \(\ell _1\)-mininization. IEEE Trans. Inf. Theory 56, 3540–3560 (2010)

    MATH  Google Scholar 

  60. Wu, F., Bian, W.: Accelerated iterative hard thresholding algorithm for \(\ell _0\) regularized regression problem. J. Global Optim. 76, 819–840 (2020)

    MathSciNet  MATH  Google Scholar 

  61. Wu, F., Bian, W., Xue, X.P.: Smoothing fast iterative hard thresholding algorithm for \(\ell _0\) regularized nonsmooth convex regression problem, arXiv:2104.13107v1

  62. Ye, J.J., Zhu, D.L.: Optimality conditions for bilevel programming problems. Optimization 33, 9–27 (1995)

    MathSciNet  MATH  Google Scholar 

  63. Ye, J.J., Zhu, D.L., Zhu, Q.J.: Exact penalization and necessary optimality conditions for generalized bilevel programming problems. SIAM J. Optim. 7, 481–507 (1997)

    MathSciNet  MATH  Google Scholar 

  64. Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894–942 (2010)

    MathSciNet  MATH  Google Scholar 

  65. Zhang, X., Zhang, X.Q.: A new proximal iterative hard thresholding method with extrapolation for \(\ell _0\) minimization. J. Sci. Comput. 79, 809–826 (2019)

    MathSciNet  MATH  Google Scholar 

  66. Zhao, X.Y., Sun, D.F., Toh, K.C.: A Newton-CG augmented Lagrangian method for semidefinite programming. SIAM J. Optim. 20, 1737–1765 (2010)

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The first two authors would like to express their sincere thanks to Prof. Kim-Chuan Toh from National University of Singapore for helpful suggestions on the implementation of Algorithm A.1 when visiting SCUT, and give thanks to Prof. Liping Zhu from RenMin University of China for helpful discussion on Theorem 5.

Funding

The funding was provided by the National Natural Science Foundation of China under projects No. 11971177 and the Hong Kong Research Grant Council under grant No. 15304019

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shujun Bi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest to this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of Proposition 3

The following two technical lemmas are need for the proof of Proposition 3.

Lemma 6

Fix any \(\nu >0\) and \(\mu >0\). Let \(h_{\nu }(x){:}{=}\nu \Vert x\Vert _0\) for \(x\in {\mathbb {R}}^p\). If \(\vartheta \) is regular and strictly continuous relative to \(\textrm{dom}\vartheta \), then for any \(x\in \textrm{dom}f\) and \(\zeta \in {\mathbb {R}}^p\),

$$\begin{aligned} {\widehat{\partial }}\Theta _{\nu ,\mu }(x)&=\partial \Theta _{\nu ,\mu }(x)=\partial \!f_{\!\mu }(x)+\partial h_{\nu }(x), \end{aligned}$$
(A1)
$$\begin{aligned} {\widehat{d}}\Theta _{\nu ,\mu }(x)(\zeta )&=d\Theta _{\nu ,\mu }(x)(\zeta )=df_{\!\mu }(x)(\zeta )+dh_{\nu }(x)(\zeta ). \end{aligned}$$
(A2)

Proof

Fix any \(x\in \textrm{dom}f\) and \(\zeta \in {\mathbb {R}}^p\). Since \(\vartheta \) is strictly continuous relative to \(\textrm{dom}\vartheta \), by the expression of \(f_{\!\mu }\) in (4), the function \(f_{\!\mu }\) can be rewritten as \({\widetilde{f}}_{\!\mu }+\delta _{\textrm{dom}f}\) where \({\widetilde{f}}_{\!\mu }\) is a finite strictly continuous function on \({\mathbb {R}}^p\). Clearly, \({\widetilde{f}}_{\!\mu }\) is regular by the regularity of \(\vartheta \), and \(\delta _{\textrm{dom}f}\) is also regular by the polyhedrality of \(\textrm{dom}\vartheta \). By invoking [50, Exercise 10.10] and the first inclusion of [50, Corollary 10.9], it is not hard to obtain that

$$\begin{aligned} \partial \!{\widetilde{f}}_{\!\mu }(x) +{\widehat{\partial }}(\delta _{\textrm{dom}f}\!+\!h_{\nu })(x) \subseteq {\widehat{\partial }}\Theta _{\nu ,\mu }(x) \subseteq \partial \Theta _{\nu ,\mu }(x)\subseteq \partial \!{\widetilde{f}}_{\!\mu }(x) +\partial (\delta _{\textrm{dom}f}\!+\!h_{\nu })(x). \end{aligned}$$

Since \(\textrm{epi}\,h_{\nu }\) is a union of finitely many polyhedral sets and \(\textrm{dom}f\) is polyhedral, from [29, Page 213] it follows that \(\partial (\delta _{\textrm{dom}f}+h_{\nu })(x)\subseteq {\mathcal {N}}_{\textrm{dom}f}(x)+\partial h_{\nu }(x)\) and \(\partial ^{\infty }(\delta _{\textrm{dom}f}+h_{\nu })(x)\subseteq {\mathcal {N}}_{\textrm{dom}f}(x)+\partial ^{\infty }h_{\nu }(x)\). The first inclusion, along with the first inclusion of [50, Corollary 10.9] and the regularity of \(\delta _{\textrm{dom}f}\) and \(h_{\nu }\), implies that

$$\begin{aligned} {\mathcal {N}}_{\textrm{dom}f}(x)+\partial h_{\nu }(x) \subseteq {\widehat{\partial }}(\delta _{\textrm{dom}f}+h_{\nu })(x) \subseteq \partial (\delta _{\textrm{dom}f}+h_{\nu })(x)\subseteq {\mathcal {N}}_{\textrm{dom}f}(x)+\partial h_{\nu }(x). \end{aligned}$$

The regularity of \(h_{\nu }\) is implied by [30, Theorem 1]. The last two equations imply the first equality in (A1). By the strict continuity of \({\widetilde{f}}_{\!\mu }\) and [50, Corollary 10.9],

$$\begin{aligned} d\Theta _{\nu ,\mu }(x)(\zeta )&\ge d{\widetilde{f}}_{\!\mu }(x)(\zeta )\!+\!d\delta _{\textrm{dom}f}(x)(\zeta )\!+\!dh_{\nu }(x)(\zeta ) =df_{\!\mu }(x)(\zeta )+dh_{\nu }(x)(\zeta ),\nonumber \\ {\widehat{d}}\Theta _{\nu ,\mu }(x)(\zeta )&\le {\widehat{d}}{\widetilde{f}}_{\!\mu }(x)(\zeta )\!+\!{\widehat{d}}(\delta _{\textrm{dom}f}\!+\!h_{\nu })(x)(\zeta )\nonumber \\&\le {\widehat{d}}{\widetilde{f}}_{\!\mu }(x)(\zeta ) +{\widehat{d}}\delta _{\textrm{dom}f}(x)(\zeta )+{\widehat{d}}h_{\nu }(x)(\zeta )\nonumber \\&=d{\widetilde{f}}_{\!\mu }(x)(\zeta ) +d\delta _{\textrm{dom}f}(x)(\zeta )+dh_{\nu }(x)(\zeta )\nonumber \\&=df_{\!\mu }(x)(\zeta )+dh_{\nu }(x)(\zeta ) \end{aligned}$$
(A3)

where the second inequality in (A3) is due to \(\partial ^{\infty }(\delta _{\textrm{dom}f}\!+\!h_{\nu })({\overline{x}})\subseteq {\mathcal {N}}_{\textrm{dom}f}({\overline{x}})+\partial ^{\infty }h_{\nu }({\overline{x}})\) and [50, Exercise 8.23], and the first equality in (A3) is due to the regularity of \({\widetilde{f}}_{\!\mu },h_{\nu }\) and \(\textrm{dom}f\). Note that \({\widehat{d}}\Theta _{\nu ,\mu }({\overline{x}})(\zeta )\ge d\Theta _{\nu ,\mu }({\overline{x}})(\zeta )\). From the last two inequality, we obtain the second equality in (A1). The proof is completed. \(\square \)

Lemma 7

Pick any \(\phi \in \!{\mathscr {L}}_{\sigma ,\gamma }\). The associated function \(g_{\rho }\) for any \(\rho >0\) is continuously differentiable on \({\mathbb {R}}^p\).

Proof

Recall that \(\psi ^*\) is a finite nondecreasing convex function on \({\mathbb {R}}\). If in addition \(\phi \) is strongly convex on [0, 1] with modulus \(\sigma \), then by [49, Theorem 26.3] and [50, Proposition 12.60], \(\psi ^*\) is smooth on \({\mathbb {R}}\) and \((\psi ^*)'\) is Lipschitz continuous with constant \(1/\sigma \). Thus, by the expression of \(g_{\rho }\), it suffices to argue that \(h(t){:}{=}\rho ^{-1}\psi ^*(\rho \vert t\vert )\) for \(t\in {\mathbb {R}}\) is continuously differentiable at \(t=0\). Indeed, by the assumption on \(\phi \), it is easy to verify that \(\psi ^*(s)=0\) for all \(s\in [0,\gamma ]\). Then, for all \(\vert t\vert \le \gamma \), \(h(t)=0\). Together with \(h(0)=0\), h is differentiable at \(t=0\) with \(h'(0)=0\). \(\square \)

Proof

(i) Since the range of \(\partial \psi ^*\) is contained in \(\textrm{dom}\partial \psi =[0,1]\), for any \(x\in {\mathbb {R}}^p\) it holds that \(\Vert x\Vert _1-g_{\rho }(x)\ge 0\). Together with the nonnegativity and coerciveness of \(f_{\!\mu }\), it follows that \(\Theta _{\!\rho ,\nu ,\mu }\) is nonnegative and coercive. Fix any \(x\in \textrm{dom}f\). From Lemma 7 and [50, Exericse 8.8], \({\widehat{\partial }}\Theta _{\!\rho ,\nu ,\mu }(x)=\partial \Theta _{\!\rho ,\nu ,\mu }(x) =\partial (f_{\!\mu }\!+\!\rho \nu \Vert \cdot \Vert _1)(x)\!-\!\rho \nu \nabla g_{\rho }(x)\). Recall that \([\textrm{Im}(A)-b]\cap \textrm{dom}\vartheta \ne \emptyset \). By the convexity of \(\vartheta \) and [49, Theorem 23.9],

$$\begin{aligned} \partial (f_{\!\mu }\!+\!\rho \nu \Vert \cdot \Vert _1)(x) =A^{{\mathbb {T}}}\partial \vartheta (Ax-b)+\mu x+\rho \nu \partial \Vert x\Vert _1. \end{aligned}$$

The characterization on the regular and limiting subdifferentials of \(\Theta _{\!\rho ,\nu ,\mu }\) then holds.

(ii) By the definition of d-stationary point for DC program (see [42, Section 3]), a point \(x\in \textrm{dom}f\) is a d-stationary point of (14) iff \(\rho \nu \nabla \!g_{\rho }(x)\in \partial (f_{\!\mu }\!+\!\rho \nu \Vert \cdot \Vert _1)(x)\), which by part (i) is equivalent to saying that \(x\in \textrm{dom}f\) is a limiting critical point of \(\Theta _{\!\rho ,\nu ,\mu }\).

(iii) By [50, Theorem 13.24 (c)], it suffices to argue that \(d^2\Theta _{\!\rho ,\nu ,\mu }({\overline{x}}\vert 0)(\zeta )>0\) for all \(\zeta \ne 0\). Fix any \(\zeta \in {\mathbb {R}}^p\backslash \{0\}\). Let \(\varphi _{\!\rho ,\lambda }(x)\!{:}{=}\lambda [\Vert x\Vert _1\!-\!g_{\rho }(x)]\) with \(\lambda =\rho \nu \) for \(x\in {\mathbb {R}}^p\). Clearly, \(\varphi _{\!\rho ,\lambda }\) is Lipschitz continuous and regular by the smoothness of \(g_{\rho }\). Note that \(\Theta _{\!\rho ,\nu ,\mu }\!=\!f_{\!\mu }+\varphi _{\!\rho ,\lambda }\). By invoking [50, Proposition 13.19], it follows that

$$\begin{aligned} d^2\Theta _{\!\rho ,\nu ,\mu }({\overline{x}}\vert 0)(\zeta ) \!\ge \!\sup _{u\in \partial \!f_{\!\mu }({\overline{x}}),v\in \partial \varphi _{\!\rho ,\lambda }({\overline{x}})} \!\Big \{d^2\!f_{\!\mu }({\overline{x}}\,\vert \,u)(\zeta )+ d^2\!\varphi _{\!\rho ,\lambda }({\overline{x}}\vert v)(\zeta ) \ \ \mathrm{s.t.}\ \ u+v=0\Big \}. \end{aligned}$$
(A4)

Recall that \(f_{\!\mu }\) is strongly convex with modulus \(\mu \). By Definition 3, we have that

$$\begin{aligned} d^2\!f_{\!\mu }({\overline{x}}\,\vert \,u)(\zeta )\ge \mu \Vert \zeta \Vert ^2>0 \quad \forall u\in \partial \!f_{\!\mu }({\overline{x}}). \end{aligned}$$
(A5)

Fix any \(v\in \partial \varphi _{\!\rho ,\lambda }({\overline{x}})\). Since \(\varphi _{\!\rho ,\lambda }\) is Lipschitz and directionally differentiable,

$$\begin{aligned} \langle v,\zeta \rangle \le d\varphi _{\!\rho ,\lambda }({\overline{x}})(\zeta ) =\varphi _{\!\rho ,\lambda }'({\overline{x}},\zeta ) =\lambda (\Vert \cdot \Vert _1)'({\overline{x}},\zeta )-\lambda \langle \nabla g_{\rho }({\overline{x}}),\zeta \rangle . \end{aligned}$$

By [50, Proposition 13.5], \(d^2\varphi _{\!\rho ,\lambda }({\overline{x}}\vert v)(\zeta )=+\infty \) when \(d\varphi _{\!\rho ,\lambda }({\overline{x}})(\zeta )>\langle v,\zeta \rangle \). This, together with (A4)-(A5), implies that \(d^2\Theta _{\!\rho ,\nu ,\mu }({\overline{x}}\vert 0)(\zeta )>0\), so it suffices to consider that \(\varphi _{\!\rho ,\lambda }'({\overline{x}};\zeta )=\langle v,\zeta \rangle \). In this case, from Definition 3 it follows that

$$\begin{aligned} d^2\varphi _{\!\rho ,\lambda }({\overline{x}}\vert v)(\zeta )&=\liminf _{\begin{array}{c} {\tau \downarrow 0}\\ {\zeta '\rightarrow \zeta } \end{array}} \frac{\varphi _{\!\rho ,\lambda }({\overline{x}}+\tau \zeta ')\!-\!\varphi _{\!\rho ,\lambda }({\overline{x}}) \!-\!\tau \varphi _{\!\rho ,\lambda }'({\overline{x}},\zeta ')}{\frac{1}{2}\tau ^2}\nonumber \\&=\lambda \liminf _{\begin{array}{c} {\tau \downarrow 0}\\ {\zeta '\rightarrow \zeta } \end{array}} \frac{-g_{\rho }({\overline{x}}\!+\!\tau \zeta ')+g_{\rho }({\overline{x}}) +\tau \langle \nabla g_{\rho }({\overline{x}}),\zeta '\rangle }{\frac{1}{2}\tau ^2}, \end{aligned}$$
(A6)

where the second equality is because \(\Vert {\overline{x}}+\tau \zeta '\Vert _1-\Vert {\overline{x}}\Vert _1-\tau (\Vert \cdot \Vert _1)'({\overline{x}},\zeta ')=0\) for any \(\tau >0\) small enough. Let \(h(t){:}{=}\rho ^{-1}\psi ^*(\rho \vert t\vert )\) for \(t\in {\mathbb {R}}\). Clearly, \(g_{\rho }(z)=\sum _{i=1}^ph(z_i)\) for \(z\in {\mathbb {R}}^p\). When \(i\notin \textrm{supp}({\overline{x}})\), from the proof of Lemma 7, for all \(\tau >0\) small enough, we have \(h({\overline{x}}_i\!+\!\tau \zeta _i')-h({\overline{x}}_i)-\tau h'({\overline{x}}_i)\zeta _i'=0\). When \(i\in \textrm{supp}({\overline{x}})\), by noting that \(\psi ^*(s)=s-\phi (1)\) for all \(s\ge \phi _{+}'(1)\) and using the assumption \(\vert {\overline{x}}\vert _\textrm{nz}\ge {\phi _{+}'(1)}/{\rho }\),

$$\begin{aligned} h({\overline{x}}_i\!+\!\tau \zeta _i')-h({\overline{x}}_i)-\tau h'({\overline{x}}_i)\zeta _i'=0 =\vert {\overline{x}}_i\!+\!\tau \zeta _i'\vert -{\overline{x}}_i -\tau \textrm{sign}({\overline{x}}_i)\zeta _i'=0 \end{aligned}$$

for all sufficiently \(\tau >0\). This means that, for all \(\tau >0\) small enough,

$$\begin{aligned} -g_{\rho }({\overline{x}}+\tau \zeta ')+g_{\rho }({\overline{x}}) +\tau \langle \nabla g_{\rho }({\overline{x}}),\zeta '\rangle =0. \end{aligned}$$

By combining this with (A6), we obtain from (A4)-(A5) that \(d^2\Theta _{\!\rho ,\nu ,\mu }({\overline{x}}\vert 0)(\zeta )>0\).

(iv) By Lemma 6, \(\widehat{\textrm{crit}}\,\Theta _{\nu ,\mu }=\textrm{crit}\,\Theta _{\nu ,\mu }\). We next argue that \({\overline{x}}\in \widehat{\textrm{crit}}\,\Theta _{\nu ,\mu }\). Since \({\overline{x}}\in \textrm{crit}\,\Theta _{\!\rho ,\nu ,\mu }\), from part (i) it follows that

$$\begin{aligned} 0\in A^{{\mathbb {T}}}\partial \vartheta (A{\overline{x}}\!-b)+\mu {\overline{x}} +\rho \nu \big [(1\!-\!(w_\rho ({\overline{x}}))_1)\partial \vert {\overline{x}}_1\vert \times \cdots \times (1\!-\!(w_\rho ({\overline{x}}))_p)\partial \vert {\overline{x}}_p\vert \big ] \end{aligned}$$

where \([w_{\rho }({\overline{x}})]_i=(\psi ^*)'(\rho \vert {\overline{x}}_i\vert )\) for \(i=1,2,\ldots ,p\). By the definition of \(\psi ^*\), it is easy to deduce that \(\psi ^*(s)=s-\phi (1)\) for all \(s\ge \phi _{+}'(1)\). Together with \(\vert {\overline{x}}\vert _{\textrm{nz}}\ge \phi _{+}'(1)/\rho \), it holds that \([w_{\rho }({\overline{x}})]_i=(\psi ^*)'(\rho \vert {\overline{x}}_i\vert )=1\) for all \(i\in \textrm{supp}({\overline{x}})\). From [30, Theorem 1], we know that \({\widehat{\partial }}\Vert {\overline{x}}\Vert _0=\{v\in {\mathbb {R}}^p\,\vert \,v_i=0\ \textrm{for}\ i\in \textrm{supp}({\overline{x}})\}\). This means that

$$\begin{aligned} \rho \nu \big [(1\!-\!(w_\rho ({\overline{x}}))_1)\partial \vert {\overline{x}}_1\vert \times \cdots \times (1\!-\!(w_\rho ({\overline{x}}))_p)\partial \vert {\overline{x}}_p\vert \big ] \subseteq \nu {\widehat{\partial }}\Vert {\overline{x}}\Vert _0. \end{aligned}$$

From the last two equations, \(0\in A^{{\mathbb {T}}}\partial \vartheta (A{\overline{x}}\!-b)+\mu {\overline{x}} +{\widehat{\partial }}\Vert {\overline{x}}\Vert _0={\widehat{\partial }}\Theta _{\nu ,\mu }({\overline{x}})\), where the equality is by Lemma 6. This means that \({\overline{x}}\in \widehat{\textrm{crit}}\,\Theta _{\nu ,\mu }\). For the rest, it suffices to argue that every point in \(\textrm{crit}\,\Theta _{\nu ,\mu }\) is a strongly local optimal solution of (1). Pick any \({\overline{x}}\in \textrm{crit}\,\Theta _{\nu ,\mu }\). We only need to argue that \(d^2\Theta _{\nu ,\mu }({\overline{x}}\vert 0)(\zeta )>0\) for all \(\zeta \ne 0\). Fix any \(0\ne \zeta \in {\mathbb {R}}^p\). By combining Lemma 6 with [50, Proposition 13.19], it holds that

$$\begin{aligned} d^2\Theta _{\nu ,\mu }({\overline{x}}\vert 0)(\zeta ) \!\ge \!\sup _{u\in \partial \!f_{\mu }({\overline{x}}),v\in \partial h_{\nu }({\overline{x}})} \!\Big \{ d^2\!f_{\mu }({\overline{x}}\vert u)(\zeta )\!+\! d^2h_{\nu }({\overline{x}}\vert v)(\zeta ) \ \ \mathrm{s.t.}\ u+v=0\Big \} \end{aligned}$$
(A7)

where \(h_{\nu }\) is same as in Lemma 6. Fix any \(v\in \partial h_{\nu }({\overline{x}})\). Let \({\overline{J}}\!{:}{=}\{1,\ldots ,p\}\backslash \textrm{supp}({\overline{x}})\). Then, \(\langle v,\zeta \rangle =\langle v_{{\overline{J}}},\zeta _{{\overline{J}}}\rangle \). A simple calculation yields \( dh_{\nu }({\overline{x}})(\zeta )={\textstyle \sum _{i\in {\overline{J}}}}\,\delta _{\{0\}}(\zeta _i). \) This means that \(dh_{\nu }({\overline{x}})(\zeta )\ge \langle v,\zeta \rangle \). When \(dh_{\nu }({\overline{x}})(\zeta )>\langle v,\zeta \rangle \), by [50, Proposition 13.5] we have \(d^2h_{\nu }({\overline{x}}\vert \xi )(\zeta )=+\infty \). This along with (A5) and (A7) means that \(d^2\Theta _{\nu ,\mu }({\overline{x}}\vert 0)(\zeta )>0\), so it suffices to consider the case \(dh_{\nu }({\overline{x}})(\zeta )=\langle v,\zeta \rangle \). For this case, from \(dh_{\nu }({\overline{x}})(\zeta )={\textstyle \sum _{i\in {\overline{J}}}}\,\delta _{\{0\}}(\zeta _i)\), we have \(\zeta _{{\overline{J}}}=0\). Consequently,

$$\begin{aligned} d^2h_{\nu }({\overline{x}}\vert v)(\zeta )&=\liminf _{\tau \downarrow 0,\zeta '\rightarrow \zeta } \frac{h_{\nu }({\overline{x}}+\!\tau \zeta ')\!-\!h_{\nu }({\overline{x}}) -\tau \langle v_{{\overline{J}}},\zeta _{{\overline{J}}}'\rangle }{\frac{1}{2}\tau ^2}\\&=\liminf _{\tau \downarrow 0,\zeta _{{\overline{J}}}'\rightarrow \zeta _{{\overline{J}}}} \frac{\sum _{i\in {\overline{J}}}[\text {sign}(\tau \vert \zeta _i'\vert )\!-\!\tau v_i\zeta _i']}{\frac{1}{2}\tau ^2}\ge 0. \end{aligned}$$

This along with (A5) and (A7) implies that \(d^2\Theta _{\nu ,\mu }({\overline{x}}\vert 0)(\zeta )>0\). \(\square \)

Appendix B: Proof of results in Sect. 4.2

In this section, let \(x^*\) be the true vector in model (21), and for each \(k\in {\mathbb {N}}\) write

$$\begin{aligned} y^k\!{:}{=}Ax^k\!-b,\ \Delta x^k\!{:}{=}x^k-x^*\ \textrm{and}\ \xi ^k\!{:}{=}B_{k-1}(x^{k-1}\!- x^{k})+\delta ^{k-1}-\mu x^*. \end{aligned}$$
(B8)

By Assumption 2 and [50, Theorem 10.49], for any \({\overline{t}}\in {\mathbb {R}}\) we have \(\partial (\theta ^2)({\overline{t}})=2D^*\theta ({\overline{t}})(\theta ({\overline{t}}))\) where \(D^*\theta ({\overline{t}})\!:{\mathbb {R}}\rightrightarrows {\mathbb {R}}\) is the coderivative of \(\theta \) at \({\overline{t}}\). Together with [50, Proposition 9.24(b)], \(D^*\theta ({\overline{t}})(\theta ({\overline{t}}))=\partial (\theta ({\overline{t}})\theta )({\overline{t}})\). Thus,

$$\begin{aligned} \partial (\theta ^2)({\overline{t}}) =\left\{ \begin{array}{cl} \{0\} &{} \textrm{if}\ \theta ({\overline{t}})=0;\\ 2\theta ({\overline{t}})\partial \theta ({\overline{t}})&{}\textrm{otherwise} \end{array}\right. \quad \mathrm{for\ any}\ {\overline{t}}\in {\mathbb {R}}. \end{aligned}$$
(B9)

By using (B9) and the above notation, we can establish the following lemma.

Lemma 8

Suppose that for a certain \(k\ge 1\) there exists an index set \(S^{k-1}\supseteq S^*\) satisfying \(\min _{i\in (S^{k-1})^c}v_i^{k-1}\ge {1}/{2}\). Let \({\mathcal {I}}{:}{=}\{i\in \{1,\ldots ,n\}\ \vert \ \varpi _i\ne 0\}\). Then, when \(\lambda \ge 16{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+8\Vert \xi ^k\Vert _\infty \), it holds that \(\big \Vert \Delta x^{k}_{\!(S^{k-1})^c}\Vert _1\le 3\Vert \Delta x^{k}_{\!S^{k-1}}\Vert _1\).

Proof

From \(x^*\in \textrm{dom}f\) and the definition of \(x^k\) in Step 2, it is not difficult to obtain

$$\begin{aligned}&f(x^*)+\frac{\mu }{2}\Vert x^*\Vert ^2+\lambda \langle v^{k-1},\vert x^*\vert \rangle +\frac{1}{2}\Vert x^*\!-\!x^{k-1}\Vert _{B_{k-1}}^2\\&\ge f(x^k)+\frac{\mu }{2}\Vert x^{k}\Vert ^2+\lambda \langle v^{k-1},\vert x^k\vert \rangle +\frac{1}{2}\Vert x^k\!-\!x^{k-1}\Vert _{B_{k-1}}^2\\&\quad +\frac{1}{2}\langle x^*\!-\!x^k,(\mu I\!+\!B_{k-1})(x^*\!-\!x^k)\rangle +\langle \delta ^{k-1},x^*\!-\!x^{k}\rangle , \end{aligned}$$

where the strong convexity of the objective function of (19) is used. After a suitable rearrangement for the last inequality, we obtain

$$\begin{aligned} f(x^{k})-f(x^*)+\mu \Vert \Delta x^k\Vert ^2 \le \lambda \langle v^{k-1},\vert x^*\vert -\vert x^{k}\vert \rangle +\langle \xi ^k, x^k\!-x^*\rangle . \end{aligned}$$
(B10)

For each \(k\in {\mathbb {N}}\), let \({\mathcal {J}}_k\!{:}{=}\big \{i\notin {\mathcal {I}}\,\vert \, y_i^{k}\ne 0\big \}\). By the expression of \(\vartheta \) and \(\varpi =b-Ax^*\),

$$\begin{aligned}&\vartheta (A x^{k}\!-b)-\vartheta (A x^*\!-b) \nonumber \\&=\frac{1}{n}\bigg [\sum _{i\in {\mathcal {J}}_k}\frac{\theta ^2(y^{k}_i)-\theta ^2(\varpi _i)}{\theta (y^{k}_i)+\theta (\varpi _i)} +\sum _{i\in {\mathcal {I}}}\frac{\theta ^2(y^{k}_i)-\theta ^2(\varpi _i)}{\theta (y^{k}_i)+\theta (\varpi _i)}\bigg ]\nonumber \\&\ge \frac{1}{n}\bigg [\sum _{i\in {\mathcal {J}}_k}\frac{\theta ^2(y^{k}_i)-\theta ^2(\varpi _i)}{{\widetilde{\tau }}\Vert y^{k}\Vert _{\infty }} +\sum _{i\in {\mathcal {I}}}\frac{\theta ^2(y^{k}_i)-\theta ^2(\varpi _i)}{\theta (y^{k}_i)+\theta (\varpi _i)}\bigg ]. \end{aligned}$$
(B11)

where the inequality is since \(\theta (y_i)\le {\widetilde{\tau }}\Vert y\Vert _\infty \) for \(i=1,\ldots ,n\), implied by \(\theta (0)=0\) and (22). Fix any \(\eta _i\in \partial (\theta ^2)(\varpi _i)\). Since \(\theta ^2\) is strongly convex with modulus \(\tau \), we have

$$\begin{aligned} \theta ^2(y^{k}_i)-\theta ^2(\varpi _i) \ge \eta _i(y_i^k-\varpi _i) +0.5\tau (y^{k}_i-\varpi _i)^2 \ \ \textrm{for}\ \ i=1,\ldots ,n. \end{aligned}$$
(B12)

Along with (B9), for each \(i\in {\mathcal {J}}_k\), \(\eta _i=0\) and \(\theta ^2(y^{k}_i)-\theta ^2(\varpi _i)\ge \frac{\tau }{2}(y^{k}_i-\varpi _i)^2\), and consequently,

$$\begin{aligned} \sum _{i\in {\mathcal {J}}_k}\frac{\theta ^2(y^{k}_i)-\theta ^2(\varpi _i)}{{\widetilde{\tau }}\Vert y^{k}\Vert _{\infty }} \ge \frac{\tau }{2{\widetilde{\tau }}} \sum _{i\in {\mathcal {J}}_k}\frac{(y^{k}_i-\varpi _i)^2}{\Vert y^{k}\Vert _{\infty }}. \end{aligned}$$

For each \(i\in {\mathcal {I}}\), write \({\widetilde{y}}_i^{k}\!{:}{=}\frac{\eta _i}{\theta (y_i^{k})+\theta (\varpi _i)}\). From (B9) and (22), it is not hard to obtain \(\vert {\widetilde{y}}_i^{k}\vert \le 2{\widetilde{\tau }}\) for all \(i\in {\mathcal {I}}\). Together with (B12), \(\varpi =b-Ax^*\) and \(\theta (y^{k}_i)\le {\widetilde{\tau }}\Vert y^k\Vert _\infty \),

$$\begin{aligned} \sum _{i\in {\mathcal {I}}}\frac{\theta ^2(y^{k}_i)-\theta ^2(\varpi _i)}{\theta (y^{k}_i)+\theta (\varpi _i)}&\ge \sum _{i\in {\mathcal {I}}}{\widetilde{y}}_i^k(y_i^k-\varpi _i)+\frac{\tau }{2} \sum _{i\in {\mathcal {I}}}\frac{(y^{k}_i-\varpi _i)^2}{\theta (y^{k}_i)+\theta (\varpi _i)}\\&\ge -2{\widetilde{\tau }}\Vert [A(x^k\!- x^*)]_{{\mathcal {I}}}\Vert _1+\frac{\tau }{2} \sum _{i\in {\mathcal {I}}}\frac{(y^{k}_i-\varpi _i)^2}{{\widetilde{\tau }}(\Vert y^{k}\Vert _{\infty }\!+\!\Vert \varpi \Vert _\infty )}. \end{aligned}$$

Substituting the last two inequalities into (B11) and using the definition of f yields

$$\begin{aligned}{} & {} f(x^k)-f(x^*)=\vartheta (A x^{k}\!-b)-\vartheta (A x^*\!-b)\\{} & {} \ge -\frac{2{\widetilde{\tau }}}{n}\Vert [A(x^k\!- x^*)]_{{\mathcal {I}}}\Vert _1 +\frac{\tau \Vert A( x^k\!- x^*)\Vert ^2}{2n{\widetilde{\tau }}(\Vert y^{k}\Vert _{\infty }\!+\!\Vert \varpi \Vert _\infty )}. \end{aligned}$$

Write \(\Upsilon ^k{:}{=}\frac{\tau \Vert A(x^k\!- x^*)\Vert ^2}{2n{\widetilde{\tau }}(\Vert y^{k}\Vert _{\infty }+\Vert \varpi \Vert _\infty )}\). By combining this inequality and (B10), we get

$$\begin{aligned} \mu \Vert \Delta x^k\Vert ^2+\Upsilon ^k&\le \lambda \langle v^{k-1},\vert x^*\vert -\vert x^{k}\vert \rangle +2{\widetilde{\tau }}n^{-1}\big \Vert [A(x^k\!- x^*)]_{{\mathcal {I}}}\big \Vert _{1}+\langle \xi ^k, x^k\!- x^*\rangle \nonumber \\&\le \lambda \Big (\textstyle {\sum _{i\in S^*}}v_i^{k-1}\vert \Delta x_i^k\vert -\textstyle {\sum _{i\in (S^{k-1})^c}}v_i^{k-1}\vert \Delta x_i^k\vert \Big )\nonumber \\&+\big (2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _{\infty }\big ) \big (\Vert \Delta x_{S^{k-1}}^k\Vert _{1}+\Vert \Delta x_{(S^{k-1})^{c}}^k\Vert _{1}\big ). \end{aligned}$$
(B13)

Since \(S^{k-1}\supseteq S^*\) and \(v_i^{k-1}\in [0.5,1]\) for \(i\in (S^{k-1})^{c}\), from the last inequality we have

$$\begin{aligned} \mu \Vert \Delta x^k\Vert ^2+\Upsilon ^k&\le \textstyle {\sum _{i\in S^{k-1}}}\big (\lambda v_i^{k-1} +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _\infty \big ) \vert \Delta x_i^k\vert \\&\quad +\textstyle {\sum _{i\in (S^{k-1})^c}} \big (n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _\infty -\lambda /2\big )\vert \Delta x_i^k\vert \\&=\big (\lambda +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _\infty \big ) \big \Vert \Delta x_{\!S^{k-1}}^k\big \Vert _1\\&\quad +\big (2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _\infty \!-0.5\lambda \big ) \big \Vert \Delta x_{\!(S^{k-1})^{c}}^k\big \Vert _1. \end{aligned}$$

From the nonnegativity of the left hand side and the given assumption on \(\lambda \), we have

$$\begin{aligned} \big \Vert \Delta x_{(S^{k-1})^{c}}^k\big \Vert _1 \le \frac{\lambda +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _\infty }{0.5\lambda -2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1-\Vert \xi ^k\Vert _\infty } \big \Vert \Delta x_{S^{k-1}}^k\big \Vert _1 \le 3\big \Vert \Delta x_{S^{k-1}}^k\big \Vert _1. \end{aligned}$$

This implies that the desired result holds. The proof is completed. \(\square \)

By invoking (B13) and Lemma 8, we can obtain the following conclusion.

Lemma 9

Suppose that \(A^{{\mathbb {T}}}A/n\) satisfies the RE condition of parameter \(\kappa >0\) on \({\mathcal {C}}(S^*)\), and that for some \(k\ge 1\) there is an index set \(S^{k-1}\supseteq S^*\) and \(\vert S^{k-1}\vert \le 1.5s^*\) such that \(\min _{i\in (S^{k-1})^c}v_i^{k-1}\ge \frac{1}{2}\). Let \({\mathcal {I}}{:}{=}\{i\ \vert \ \varpi _i\ne 0\}\). If \(\lambda \) is chosen such that \(16{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+8\Vert \xi ^k\Vert _\infty \le \lambda <\frac{2\mu {\widetilde{\tau }}\Vert \varpi \Vert _\infty +\tau \kappa -4{\widetilde{\tau }}\Vert A\Vert _{\infty } (2{\widetilde{\tau }}n^{-1}|\!\Vert A_{{\mathcal {I}}\cdot }|\!\Vert _1+\Vert \xi ^k\Vert _\infty )\vert S^{k-1}\vert }{4{\widetilde{\tau }}\Vert A\Vert _{\infty }\Vert v_{\!S^*}^{k-1}\Vert _{\infty }\vert S^{k-1}\vert }\),

$$\begin{aligned} \big \Vert \Delta x^{k}\big \Vert \le \frac{2{\widetilde{\tau }}\Vert \varpi \Vert _{\infty }\big (\lambda \Vert v_{\!S^*}^{k-1}\Vert _{\infty } +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _{\infty }\big ) \sqrt{\vert S^{k-1}\vert }}{2\mu {\widetilde{\tau }}\Vert \varpi \Vert _{\infty }+\tau \kappa -4{\widetilde{\tau }}\Vert A\Vert _{\infty }\big (\lambda \Vert v_{\!S^*}^{k-1}\Vert _{\infty } +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _{\infty }\big )\vert S^{k-1}\vert }. \end{aligned}$$

Proof

Note that \(\Vert y^k\Vert _\infty +\Vert \varpi \Vert _\infty =\Vert \varpi -\!A\Delta x^k\Vert _{\infty }+\Vert \varpi \Vert _\infty \le \Vert A\Delta x^k\Vert _{\infty }+2\Vert \varpi \Vert _\infty \). Then

$$\begin{aligned} \frac{\tau \Vert A(x^k-x^*)\Vert ^2}{2n{\widetilde{\tau }}(\Vert z^{k}\Vert _{\infty }+\Vert \varpi \Vert _\infty )} \ge \frac{\tau \Vert A\Delta x^k\Vert ^2}{2n{\widetilde{\tau }}(\Vert A\Delta x^k\Vert _{\infty }+2\Vert \varpi \Vert _\infty )} {:}{=}{\widetilde{\Upsilon }}^k. \end{aligned}$$

Together with inequality (B13) and \(v_i^{k-1}\in [0.5,1]\) for \(i\in (S^{k-1})^{c}\), it follows that

$$\begin{aligned} \mu \Vert \Delta x^k\Vert ^2+{\widetilde{\Upsilon }}^k&\le \lambda {\textstyle \sum _{i\in S^*}}v_i^{k-1}\vert \Delta x_i^k\vert -({\lambda }/{2}){\textstyle \sum _{i\in (S^{k-1})^c}}\vert \Delta x_i^k\vert \\&\quad +\big (2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _{\infty }\big ) \big (\Vert \Delta x_{\!S^{k-1}}^k\Vert _{1}+\Vert \Delta x_{\!(S^{k-1})^{c}}^k\Vert _{1}\big ) \\&\le \big (\lambda \Vert v_{S^*}^{k-1}\Vert _{\infty }+2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1 +\Vert \xi ^k\Vert _{\infty }\big )\Vert \Delta x_{\!S^{k-1}}^k\Vert _{1} \end{aligned}$$

where the second inequality is due to \(\lambda \ge 16{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+8\Vert \xi ^k\Vert _\infty \). By Lemma 8, \(\Vert \Delta x^{k}_{(S^{k-1})^c}\Vert _1\le 3\Vert \Delta x^{k}_{\!S^{k-1}}\Vert _1\), which means that \(\Delta x^{k}\in {\mathcal {C}}(S^*)\). From the assumption on \(\frac{1}{n}A^{{\mathbb {T}}}A\), we have \(\Vert A\Delta x^k\Vert ^2\ge 2n\kappa \Vert \Delta x^k\Vert ^2\). Then, it holds that

$$\begin{aligned}{} & {} \mu \Vert \Delta x^k\Vert ^2+\frac{\tau \kappa \Vert \Delta x^k\Vert ^2}{{\widetilde{\tau }}(\Vert A\Delta x^k\Vert _{\infty }\!+2\Vert \varpi \Vert _\infty )}\\{} & {} \le \Big (\lambda \Vert v_{S^*}^{k-1}\Vert _{\infty }+\frac{2{\widetilde{\tau }}}{n}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1 +\Vert \xi ^k\Vert _{\infty }\Big )\big \Vert \Delta x_{S^{k-1}}^k\big \Vert _1. \end{aligned}$$

Multiplying the both sides of this inequality with \({\widetilde{\tau }}(\Vert A\Delta x^k\Vert _{\infty }+2\Vert \varpi \Vert _\infty )\) yields that

$$\begin{aligned}&\big [\mu {\widetilde{\tau }}(\Vert A\Delta x^k\Vert _{\infty }+2\Vert \varpi \Vert _\infty )+\tau \kappa \big ]\Vert \Delta x^k\Vert ^2\\&\le {\widetilde{\tau }}\Vert A\Delta x^k\Vert _{\infty }\big (\lambda \Vert v_{S^*}^{k-1}\Vert _{\infty } +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _{\infty }\big ) \big \Vert \Delta x_{S^{k-1}}^k\big \Vert _1\\&\quad + 2{\widetilde{\tau }}\Vert \varpi \Vert _\infty \big (\lambda \Vert v_{S^*}^{k-1}\Vert _{\infty } +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _{\infty }\big ) \big \Vert \Delta x_{S^{k-1}}^k\big \Vert _1. \end{aligned}$$

Note that \(\Vert A\Delta x^k\Vert _{\infty }\le \Vert A\Vert _{\infty }\Vert \Delta x^k\Vert _1\). Together with \(\Vert \Delta x^{k}_{(S^{k-1})^c}\Vert _1\le 3\Vert \Delta x^{k}_{S^{k-1}}\Vert _1\), \(\Vert A\Delta x^k\Vert _{\infty }\le 4\Vert A\Vert _{\infty }\Vert \Delta x_{S^{k-1}}^k\Vert _1\), so the right hand side of the last inequality satisfies

$$\begin{aligned} \textrm{RHS}&\le 4{\widetilde{\tau }}\Vert A\Vert _{\infty }\big (\lambda \Vert v_{S^*}^{k-1}\Vert _{\infty } +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _{\infty }\big ) \vert S^{k-1}\vert \big \Vert \Delta x_{S^{k-1}}^k\big \Vert ^2\\&\quad +2{\widetilde{\tau }}\Vert \varpi \Vert _{\infty }\big (\lambda \Vert v_{S^*}^{k-1}\Vert _{\infty } +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _{\infty }\big ) \sqrt{\vert S^{k-1}\vert }\big \Vert \Delta x_{S^{k-1}}^k\big \Vert . \end{aligned}$$

From the last two equations, a suitable rearrangement yields that

$$\begin{aligned}&\Big [2\mu {\widetilde{\tau }}\Vert \varpi \Vert _{\infty }+\tau \kappa -4{\widetilde{\tau }}\Vert A\Vert _{\infty }\big (\lambda \Vert v_{S^*}^{k-1}\Vert _{\infty } +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _{\infty }\big )\vert S^{k-1}\vert \Big ] \Vert \Delta x^k\Vert ^2\\&\le 2{\widetilde{\tau }}\Vert \varpi \Vert _{\infty }\big (\lambda \Vert v_{S^*}^{k-1}\Vert _{\infty } +2{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A_{{\mathcal {I}}\cdot }\!|\!\Vert _1+\Vert \xi ^k\Vert _{\infty }\big ) \sqrt{\vert S^{k-1}\vert }\big \Vert \Delta x_{S^{k-1}}^k\big \Vert , \end{aligned}$$

which along with \(\lambda <\frac{2\mu {\widetilde{\tau }}\Vert \varpi \Vert _\infty +\tau \kappa -4{\widetilde{\tau }}\Vert A\Vert _{\infty } (2{\widetilde{\tau }}n^{-1}|\!\Vert A_{{\mathcal {I}}\cdot }|\!\Vert _1+\Vert \xi ^k\Vert _\infty )\vert S^{k-1}\vert }{4{\widetilde{\tau }}\Vert A\Vert _{\infty }\Vert v_{S^*}^{k-1}\Vert _{\infty }\vert S^{k-1}\vert }\) implies the desired result. The proof is then completed. \(\square \)

1.1 B.1 Proof of Proposition 6:

Let \(\Delta x^{0}{:}{=}x^0-x^*\). From \(x^*\in \textrm{dom}f\) and the strong convexity of (20),

$$\begin{aligned}&f(x^*)+{\widetilde{\lambda }}\Vert x^*\Vert _1+\frac{{\widetilde{\gamma }}_{1,0}}{2}\Vert x^*\Vert ^2 +\frac{{\widetilde{\gamma }}_{2,0}}{2}\Vert Ax^*\Vert ^2\\&\ge f(x^0)+{\widetilde{\lambda }}\Vert x^0\Vert _1+\frac{{\widetilde{\gamma }}_{1,0}}{2}\Vert x^0\Vert ^2 +\frac{{\widetilde{\gamma }}_{2,0}}{2}\Vert Bx^0\Vert ^2 \\&\quad +\langle {\widetilde{\delta }}^0, x^*\!-\! x^0\rangle +\frac{1}{2}\langle (x^*\!-\!x^0), ({\widetilde{\gamma }}_{1,0}I\!+\!{\widetilde{\gamma }}_{20}A^{{\mathbb {T}}}A)(x^*\!-\!x^0)\rangle . \end{aligned}$$

From \(\vartheta (z)=\frac{1}{n}\sum _{i=1}^n\theta (z_i)\) and Assumption 2, \(f(x^*)-f(x^0)\le \frac{{\widetilde{\tau }}}{n}\Vert A(x^*\!-\!x^0)\Vert _1\). Notice that \( \Vert x^0\Vert ^2-\Vert x^*\Vert ^2 =\Vert x^0\!-\!x^*\Vert ^2+2\langle x^0-x^*,x^*\rangle \). Together with the last inequality and \(\Vert {\widetilde{\delta }}^0\Vert _\infty \le {\widetilde{\epsilon }}_0\), it follows that

$$\begin{aligned}&\Vert \Delta x^0\Vert _{{\widetilde{\gamma }}_{1,0}I+{\widetilde{\gamma }}_{2,0}A^{{\mathbb {T}}}A}^2 \le {\widetilde{\lambda }}(\Vert x^*\Vert _1\!-\Vert x^{0}\Vert _1)+n^{-1}{{\widetilde{\tau }}}\Vert A(x^*\!-\!x^0)\Vert _1\\&\qquad \qquad \qquad \qquad \quad +\langle x^0\!-x^*, {\widetilde{\delta }}^0\!+{\widetilde{\gamma }}_{2,0}A^{{\mathbb {T}}}Ax^*\!-{\widetilde{\gamma }}_{1,0}x^*\rangle \\&\le \big ({\widetilde{\lambda }} +{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A\!|\!\Vert _1+{\widetilde{\gamma }}_{1,0} \Vert x^*\Vert _{\infty }+{\widetilde{\gamma }}_{2,0}\Vert A^{{\mathbb {T}}}Ax^*\Vert _{\infty }+{\widetilde{\epsilon }}_0\big ) \Vert \Delta x_{S^{*}}^0\Vert _1 \\&\quad +\big ({\widetilde{\tau }}n^{-1}\!|\!\Vert \!A\!|\!\Vert _1+{\widetilde{\gamma }}_{1,0} \Vert x^*\Vert _{\infty }+{\widetilde{\gamma }}_{2,0}\Vert A^{{\mathbb {T}}}Ax^*\Vert _{\infty }+{\widetilde{\epsilon }}_0-{\widetilde{\lambda }}\big ) \Vert \Delta x_{(S^{*})^{c}}^0\Vert _1. \end{aligned}$$

By the assumption on \({\widetilde{\lambda }}\) and the nonnegativity of \(\Vert \Delta x^0\Vert _{{\widetilde{\gamma }}_{1,0}I+{\widetilde{\gamma }}_{2,0}A^{{\mathbb {T}}}A}^2\), we get \(\Vert \Delta x_{(S^{*})^{c}}^0\Vert _1\le 3\Vert \Delta x_{S^{*}}^0\Vert _1\). Substituting this into the last inequality yields

$$\begin{aligned}&\Vert \Delta x^0\Vert _{{\widetilde{\gamma }}_{1,0}I+{\widetilde{\gamma }}_{2,0}A^{{\mathbb {T}}}A}^2\\&\le \big ({\widetilde{\lambda }} +{\widetilde{\tau }}n^{-1}\!|\!\Vert \!A\!|\!\Vert _1+{\widetilde{\gamma }}_{1,0} \Vert x^*\Vert _{\infty }+{\widetilde{\gamma }}_{2,0}\Vert A^{{\mathbb {T}}}Ax^*\Vert _{\infty }+{\widetilde{\epsilon }}_0\big ) \big \Vert \Delta x_{S^{*}}^0\big \Vert _1\\&\le \frac{3{\widetilde{\lambda }}\sqrt{s^*}}{2}\big \Vert \Delta x^0\big \Vert \end{aligned}$$

which implies that the desired conclusion holds. The proof is completed. \(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, D., Pan, S., Bi, S. et al. Zero-norm regularized problems: equivalent surrogates, proximal MM method and statistical error bound. Comput Optim Appl 86, 627–667 (2023). https://doi.org/10.1007/s10589-023-00496-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10589-023-00496-x

Keywords

Mathematics Subject Classification

Navigation