Skip to main content
Log in

Low-rank matrix estimation via nonconvex optimization methods in multi-response errors-in-variables regression

  • Published:
Journal of Global Optimization Aims and scope Submit manuscript

Abstract

Noisy and missing data cannot be avoided in real application, such as bioinformatics, economics and remote sensing. Existing methods mainly focus on linear errors-in-variables regression, while relatively little attention is paid for the case of multivariate responses, and how to achieve consistent estimation under corrupted covariates is still an open question. In this paper, a nonconvex error-corrected estimator is proposed for the matrix estimation problem in the multi-response errors-in-variables regression model. Statistical and computational properties for global solutions of the estimator are analysed. In the statistical aspect, the nonasymptotic recovery bound for all global solutions of the nonconvex estimator is established. In the computational aspect, the proximal gradient method is applied to solve the nonconvex optimization problem and proved to linearly converge to a near-global solution. Sufficient conditions are verified in order to obtain probabilistic consequences for specific types of measurement errors by virtue of random matrix analysis. Finally, simulation results on synthetic and real neuroimaging data demonstrate the theoretical properties and show nice consistency under high-dimensional scaling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Agarwal, A., Negahban, S., Wainwright, M.J.: Fast global convergence of gradient methods for high-dimensional statistical recovery. Ann. Stat. 40(5), 2452–2482 (2012)

    Article  MathSciNet  Google Scholar 

  2. Agarwal, A., Negahban, S., Wainwright, M.J.: Noisy matrix decomposition via convex relaxation: optimal rates in high dimensions. Ann. Stat. 40(2), 1171–1197 (2012)

    Article  MathSciNet  Google Scholar 

  3. Agarwal, A., Negahban, S.N., Wainwright, M.J.: Supplementary material: Fast global convergence of gradient methods for high-dimensional statistical recovery. Ann. Stat. 40(5), 31–60 (2012)

  4. Alquier, P., Bertin, K., Doukhan, P., Garnier, R.: High-dimensional var with low-rank transition. Stat. Comput. 1–15 (2020)

  5. Annaliza, M., Khalili, A., Stephens, D.A.: Estimating sparse networks with hubs. J. Multivar. Anal. 179, 104655 (2020)

    Article  MathSciNet  Google Scholar 

  6. Barch, D.M., Burgess, G.C., Harms, M.P., Petersen, S.E., Consortium, W.M.H.: Function in the human connectome: task-fMRI and individual differences in behavior. Neuroimage 80(8), 169–189 (2013)

  7. Belloni, A., Rosenbaum, M., Tsybakov, A.B.: An \(\ell _1, \ell _2, \ell _\infty \)-regularization approach to high-dimensional errors-in-variables models. Electron. J. Stat. 10(2), 1729–1750 (2016)

    Article  MathSciNet  Google Scholar 

  8. Belloni, A., Rosenbaum, M., Tsybakov, A.B.: Linear and conic programming estimators in high dimensional errors-in-variables models. J. R. Stat. Soc. Ser. B Stat Methodol. 79(3), 939–956 (2017)

    Article  MathSciNet  Google Scholar 

  9. Bickel, P.J., Ritov, Y.: Efficient estimation in the errors in variables model. Ann. Stat. 513–540 (1987)

  10. Bühlmann, P., Van De Geer, S.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Berlin (2011)

    Book  Google Scholar 

  11. Candès, E.J., Tao, T.: The Dantzig selector: statistical estimation when \(p\) is much larger than \(n\). Ann. Stat. 35(6), 2313–2351 (2007)

    MathSciNet  Google Scholar 

  12. Candès, E.J., Tao, T.: The power of convex relaxation: near-optimal matrix completion. IEEE Trans. Inf. Theory 56(5), 2053–2080 (2010)

    Article  MathSciNet  Google Scholar 

  13. Carroll, R.J., Ruppert, D., Stefanski, L.A., Crainiceanu, C.M.: Measurement Error in Nonlinear Models: A Modern Perspective. CRC Press, Cambridge (2006)

    Book  Google Scholar 

  14. Chen, H., Raskutti, G., Yuan, M.: Non-convex projected gradient descent for generalized low-rank tensor regression. J. Mach. Learn. Res. 20, 1–37 (2019)

    MathSciNet  Google Scholar 

  15. Chen, Y., Luo, Z., Kong, L.C.: \(\ell _{2,0}\)-norm based selection and estimation for multivariate generalized linear models. J. Multivar. Anal. 185, 104782 (2021)

    Article  MathSciNet  Google Scholar 

  16. Datta, A., Zou, H.: CoCoLasso for high-dimensional error-in-variables regression. Ann. Stat. 45(6), 2400–2426 (2017)

    Article  MathSciNet  Google Scholar 

  17. Han, R.G., Willett, R., Zhang, A.R.: An optimal statistical and computational framework for generalized tensor estimation. Ann. Stat. 1(50), 1–29 (2022)

    MathSciNet  Google Scholar 

  18. Izenman, A.J.: Modern multivariate statistical techniques. Regression, classification and manifold learning. Springer Press, Berlin (2008)

  19. Li, M.Y., Li, R.Z., Ma, Y.Y.: Inference in high dimensional linear measurement error models. J. Multivar. Anal. 184, 104759 (2021)

    Article  MathSciNet  Google Scholar 

  20. Li, X., Wu, D.Y., Cui, Y., Liu, B., Walter, H., Schumann, G., Li, C., Jiang, T.Z.: Reliable heritability estimation using sparse regularization in ultrahigh dimensional genome-wide association studies. BMC Bioinform. 20(1), 219 (2019)

    Article  Google Scholar 

  21. Li, X., Wu, D.Y., Li, C., Wang, J.H., Yao, J.C.: Sparse recovery via nonconvex regularized M-estimators over \(\ell _q\)-balls. Comput. Stat. Data Anal. 152, 107047 (2020)

    Article  Google Scholar 

  22. Loh, P.L., Wainwright, M.J.: High-dimensional regression with noisy and missing data: provable guarantees with nonconvexity. Ann. Stat. 40(3), 1637–1664 (2012)

    Article  MathSciNet  Google Scholar 

  23. Loh, P.L., Wainwright, M.J.: Supplementary material: High-dimensional regression with noisy and missing data: provable guarantees with nonconvexity. Ann. Stat. 40(3), 1–21 (2012)

  24. Loh, P.L., Wainwright, M.J.: Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima. J. Mach. Learn. Res. 16(1), 559–616 (2015)

    MathSciNet  Google Scholar 

  25. Negahban, S., Wainwright, M.J.: Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Stat. 1069–1097 (2011)

  26. Negahban, S., Wainwright, M.J.: Restricted strong convexity and weighted matrix completion: optimal bounds with noise. J. Mach. Learn. Res. 13(1), 1665–1697 (2012)

    MathSciNet  Google Scholar 

  27. Nesterov, Y.: Gradient methods for minimizing composite objective function. Tech. rep. Université catholique de Louvain, Center for Operations Research and Econometrics (CORE) (2007)

  28. Raskutti, G., Yuan, M., Chen, H.: Convex regularization for high-dimensional multiresponse tensor regression. Ann. Stat. 47(3), 1554–1584 (2019)

    Article  MathSciNet  Google Scholar 

  29. Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3), 471–501 (2010)

    Article  MathSciNet  Google Scholar 

  30. Rosenbaum, M., Tsybakov, A.B.: Sparse recovery under matrix uncertainty. Ann. Stat. 38(5), 2620–2651 (2010)

    Article  MathSciNet  Google Scholar 

  31. Rosenbaum, M., Tsybakov, A.B.: Improved matrix uncertainty selector. In: From Probability to Statistics and Back: High-Dimensional Models and Processes–A Festschrift in Honor of Jon A. Wellner, pp. 276–290. Institute of Mathematical Statistics (2013)

  32. Sørensen, Ø., Frigessi, A., Thoresen, M.: Measurement error in LASSO: impact and likelihood bias correction. Stat. Sinica 25, 809–829 (2015)

    MathSciNet  Google Scholar 

  33. Sørensen, Ø., Hellton, K.H., Frigessi, A., Thoresen, M.: Covariate selection in high-dimensional generalized linear models with measurement error. J. Comput. Graph. Stat. 27(4), 739–749 (2018)

    Article  MathSciNet  Google Scholar 

  34. Wainwright, M.J.: High-Dimensional Statistics: A Non-asymptotic Viewpoint, vol. 48. Cambridge University Press, Cambridge (2019)

    Google Scholar 

  35. Wang, Z., Paterlini, S., Gao, F.C., Yang, Y.H.: Adaptive minimax regression estimation over sparse \(\ell _q\)-hulls. J. Mach. Learn. Res. 15(1), 1675–1711 (2014)

    MathSciNet  Google Scholar 

  36. Wu, D.Y., Li, X., Feng, J.: Connectome-based individual prediction of cognitive behaviors via graph propagation network reveals directed brain network topology. J. Neural Eng. 18(4) (2021)

  37. Wu, J., Zheng, Z.M., Li, Y., Zhang, Y.: Scalable interpretable learning for multi-response error-in-variables regression. J. Multivar. Anal. 104644 (2020)

  38. Zhou, H., Li, L., Zhu, H.T.: Tensor regression with applications in neuroimaging data analysis. J. Am. Stat. Assoc. 502(108), 540–552 (2013)

    Article  MathSciNet  Google Scholar 

  39. Zhou, H., Li, L.X.: Regularized matrix regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 76(2), 463–483 (2014)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

Xin Li’s work was supported in part by the National Natural Science Foundation of China (Grant No. 12201496) and the Natural Science Foundation of Shaanxi Province of China (Grant No. 2022JQ-045). Dongya Wu’s work was supported in part by the National Natural Science Foundation of China (Grant No. 62103329).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xin Li or Dongya Wu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Proof of Theorem 1

Set \(\hat{\varDelta }:=\hat{\varTheta }-\varTheta ^*\). By the feasibility of \(\varTheta ^*\) and optimality of \(\hat{\varTheta }\), one has that \(\Psi (\hat{\varTheta })\le \Psi (\varTheta ^*)\). Then it follows from elementary algebra and the triangle inequality that

$$\begin{aligned} \frac{1}{2}\langle \langle \hat{\varGamma }\hat{\varDelta },\hat{\varDelta } \rangle \rangle\le & {} \langle \langle \hat{\varUpsilon }-\hat{\varGamma }\varTheta ^*,\hat{\varDelta } \rangle \rangle + \lambda {\left| \left| \left| \varTheta ^* \right| \right| \right| }_*-\lambda {\left| \left| \left| \varTheta ^*+\hat{\varDelta } \right| \right| \right| }_*\\\le & {} \langle \langle \hat{\varUpsilon }-\hat{\varGamma }\varTheta ^*,\hat{\varDelta } \rangle \rangle +\lambda {\left| \left| \left| \hat{\varDelta } \right| \right| \right| }_*. \end{aligned}$$

Applying Hölder’s inequality and by the deviation condition (14), one has that

$$\begin{aligned} \langle \langle \hat{\varUpsilon }-\hat{\varGamma }\varTheta ^*,\hat{\varDelta } \rangle \rangle \le \phi ({\mathbb {Q}},\sigma _\epsilon )\sqrt{\frac{\max (d_1,d_2)}{n}}{\left| \left| \left| \hat{\varDelta } \right| \right| \right| }_*. \end{aligned}$$

Combining the above two inequalities, and noting (21), we obtain that

$$\begin{aligned} \langle \langle \hat{\varGamma }\hat{\varDelta },\hat{\varDelta } \rangle \rangle \le 3\lambda {\left| \left| \left| \hat{\varDelta } \right| \right| \right| }_*. \end{aligned}$$
(A.1)

Applying the RSC condition (12) to the left-hand side of (A.1), yields that

$$\begin{aligned} \alpha _1{\left| \left| \left| \hat{\varDelta } \right| \right| \right| }_F^2-\tau {\left| \left| \left| \hat{\varDelta } \right| \right| \right| }_*^2\le 3\lambda {\left| \left| \left| \hat{\varDelta } \right| \right| \right| }_*. \end{aligned}$$
(A.2)

On the other hand, by assumptions (20) and (21) and noting the fact that \({\left| \left| \left| \hat{\varDelta } \right| \right| \right| }_*\le {\left| \left| \left| \varTheta ^* \right| \right| \right| }_*+{\left| \left| \left| \hat{\varTheta } \right| \right| \right| }_*\le 2\omega \), the left-hand side of (A.2) is lower bounded as

$$\begin{aligned} \alpha _1{\left| \left| \left| \hat{\varDelta } \right| \right| \right| }_F^2-\tau {\left| \left| \left| \hat{\varDelta } \right| \right| \right| }_*^2 \ge \alpha _1{\left| \left| \left| \hat{\varDelta } \right| \right| \right| }_F^2-2\tau \omega {\left| \left| \left| \hat{\varDelta } \right| \right| \right| }_* \ge \alpha _1{\left| \left| \left| \hat{\varDelta } \right| \right| \right| }_F^2-\lambda {\left| \left| \left| \hat{\varDelta } \right| \right| \right| }_*. \end{aligned}$$

Combining this inequality with (A.2), one has that

$$\begin{aligned} \alpha _1{\left| \left| \left| \hat{\varDelta } \right| \right| \right| }_F^2 \le 4\lambda {\left| \left| \left| \hat{\varDelta } \right| \right| \right| }_*. \end{aligned}$$
(A.3)

Then it follows from Lemma 1 that there exists a matrix \(\hat{\varDelta }'\) such that

$$\begin{aligned} {\left| \left| \left| \hat{\varDelta } \right| \right| \right| }_*\le 4{\left| \left| \left| \hat{\varDelta }' \right| \right| \right| }_*+4\sum _{j=r+1}^d\sigma _j(\varTheta ^*)\le 4\sqrt{2r}{\left| \left| \left| \hat{\varDelta }' \right| \right| \right| }_F+4\sum _{j=r+1}^d\sigma _j(\varTheta ^*), \end{aligned}$$
(A.4)

where \(\text {rank}(\hat{\varDelta }')\le 2r\) with r to be chosen later, and the second inequality is due to the fact that \({\left| \left| \left| \hat{\varDelta }' \right| \right| \right| }_*\le \sqrt{2r}{\left| \left| \left| \hat{\varDelta }' \right| \right| \right| }_F\). Combining (A.3) and (A.4), we obtain that

$$\begin{aligned} \alpha _1{\left| \left| \left| \hat{\varDelta } \right| \right| \right| }_F^2 \le 16\lambda \left( \sqrt{2r}{\left| \left| \left| \hat{\varDelta }' \right| \right| \right| }_F+\sum _{j=r+1}^d\sigma _j(\varTheta ^*)\right) \le 16\lambda \left( \sqrt{2r}{\left| \left| \left| \hat{\varDelta } \right| \right| \right| }_F+\sum _{j=r+1}^d\sigma _j(\varTheta ^*)\right) . \end{aligned}$$

Then it follows that

$$\begin{aligned} {\left| \left| \left| \hat{\varDelta } \right| \right| \right| }_F^2\le \frac{512r\lambda ^2+32\alpha _1\lambda \sum _{j=r+1}^d\sigma _j(\varTheta ^*)}{\alpha _1^2}. \end{aligned}$$
(A.5)

Recall the set \(K_\eta \) defined in (16) and set \(r=|K_\eta |\). Combining (A.5) with (18) and setting \(\eta =\frac{\lambda }{\alpha _1}\), we arrive at (22). Moreover, it follows from (A.4) that (23) holds. The proof is complete.

B Proof of Theorem 2

Before providing the proof of Theorem 2, we need several useful lemmas first.

Lemma B.1

Suppose that the conditions of Theorem 2 are satisfied, and that there exists a pair \((\delta , T)\) such that (28) holds. Then for any iteration \(t\ge T\), it holds that

$$\begin{aligned} \begin{aligned} {\left| \left| \left| \varTheta ^t-\hat{\varTheta } \right| \right| \right| }_*&\le 4\sqrt{2}\lambda ^{-\frac{q}{2}} R_q^{\frac{1}{2}}{\left| \left| \left| \varTheta ^t-\hat{\varTheta } \right| \right| \right| }_F+\bar{\epsilon }_{stat }+\epsilon (\delta ), \end{aligned} \end{aligned}$$

where \(\bar{\epsilon }_{stat }\) and \(\epsilon (\delta )\) are defined respectively in (25) and (29).

Proof

We first prove that if \(\lambda \ge 4\phi ({\mathbb {Q}},\sigma _\epsilon )\sqrt{\frac{\max (d_1,d_2)}{n}}\), then for any \(\varTheta \in {\mathbb {S}}\) satisfying

$$\begin{aligned} \Psi (\varTheta )-\Psi (\varTheta ^*)\le \delta , \end{aligned}$$
(B.1)

it holds that

$$\begin{aligned} \begin{aligned} {\left| \left| \left| \varTheta -\varTheta ^* \right| \right| \right| }_*&\le 4\sqrt{2}\lambda ^{-\frac{q}{2}}R_q^{\frac{1}{2}}{\left| \left| \left| \varTheta -\varTheta ^* \right| \right| \right| }_F+ 4\lambda ^{1-q}R_q+2\min \left( \frac{\delta }{\lambda },\omega \right) . \end{aligned} \end{aligned}$$
(B.2)

Set \(\varDelta :=\varTheta -\varTheta ^*\). From (B.1), one has that

$$\begin{aligned} {\mathcal {L}}(\varTheta ^*+\varDelta )+\lambda {\left| \left| \left| \varTheta ^*+\varDelta \right| \right| \right| }_*\le {\mathcal {L}}(\varTheta ^*)+\lambda {\left| \left| \left| \varTheta ^* \right| \right| \right| }_*+\delta . \end{aligned}$$

Then subtracting \(\langle \langle \nabla {\mathcal {L}}(\varTheta ^*),\varDelta \rangle \rangle \) from both sides of the former inequality and recalling the formulation of \({\mathcal {L}}(\cdot )\), we obtain that

$$\begin{aligned} \frac{1}{2}\langle \langle \hat{\varGamma }\varDelta ,\varDelta \rangle \rangle +\lambda {\left| \left| \left| \varTheta ^*+\varDelta \right| \right| \right| }_*-\lambda {\left| \left| \left| \varTheta ^* \right| \right| \right| }_*\le -\langle \langle \hat{\varGamma }\varTheta ^*-\hat{\varUpsilon },\varDelta \rangle \rangle +\delta . \end{aligned}$$
(B.3)

We now claim that

$$\begin{aligned} \lambda {\left| \left| \left| \varTheta ^*+\varDelta \right| \right| \right| }_*-\lambda {\left| \left| \left| \varTheta ^* \right| \right| \right| }_*\le \frac{\lambda }{2}{\left| \left| \left| \varDelta \right| \right| \right| }_*+\delta . \end{aligned}$$
(B.4)

In fact, combining (B.3) with the RSC condition (12) and Hölder’s inequality, one has that

$$\begin{aligned} \frac{1}{2}\{\alpha _1{\left| \left| \left| \varDelta \right| \right| \right| }_F^2-\tau {\left| \left| \left| \varDelta \right| \right| \right| }_*^2\}+\lambda {\left| \left| \left| \varTheta ^*+\varDelta \right| \right| \right| }_*-\lambda {\left| \left| \left| \varTheta ^* \right| \right| \right| }_*\le {\left| \left| \left| \hat{\varUpsilon }-\hat{\varGamma }\varTheta ^* \right| \right| \right| }_{\text {op}}{\left| \left| \left| \varDelta \right| \right| \right| }_*+\delta . \end{aligned}$$

This inequality, together with the deviation condition (14) and the assumption that \(\lambda \ge 4\phi ({\mathbb {Q}},\sigma _\epsilon )\sqrt{\frac{\max (d_1,d_2)}{n}}\), implies that

$$\begin{aligned} \frac{1}{2}\{\alpha _1{\left| \left| \left| \varDelta \right| \right| \right| }_F^2-\tau {\left| \left| \left| \varDelta \right| \right| \right| }_*^2\} +\lambda {\left| \left| \left| \varTheta ^*+\varDelta \right| \right| \right| }_*-\lambda {\left| \left| \left| \varTheta ^* \right| \right| \right| }_*\le \frac{\lambda }{4}{\left| \left| \left| \varDelta \right| \right| \right| }_*+\delta . \end{aligned}$$

Noting the facts that \(\alpha _1>0\) and that \({\left| \left| \left| \varDelta \right| \right| \right| }_*\le {\left| \left| \left| \varTheta ^* \right| \right| \right| }_*+{\left| \left| \left| \varTheta \right| \right| \right| }_*\le 2\omega \), one arrives at (B.4) by combining assumptions (30) and (31). On the other hand, it follows from Lemma 1(i) that there exists two matrices \(\varDelta '\) and \(\varDelta ''\) such that \(\varDelta =\varDelta '+\varDelta ''\), where \(\text {rank}(\varDelta ')\le 2r\) with r to be chosen later. Recall the definitions of \({\mathbb {A}}^r\) and \({\mathbb {B}}^r\) given respectively in (15a) and (15b). Then the decomposition \(\varTheta ^*=\Pi _{{\mathbb {A}}^r}(\varTheta ^*)+\Pi _{{\mathbb {B}}^r}(\varTheta ^*)\) holds. This equality, together with the triangle inequality and Lemma 1(i), implies that

$$\begin{aligned} \begin{aligned} {\left| \left| \left| \varTheta \right| \right| \right| }_*&={\left| \left| \left| (\Pi _{{\mathbb {A}}^r}(\varTheta ^*)+\varDelta '')+(\Pi _{{\mathbb {B}}^r}(\varTheta ^*)+\varDelta ') \right| \right| \right| }_*\\&\ge {\left| \left| \left| \Pi _{{\mathbb {A}}^r}(\varTheta ^*)+\varDelta '') \right| \right| \right| }_*-{\left| \left| \left| \Pi _{{\mathbb {B}}^r}(\varTheta ^*)+\varDelta ') \right| \right| \right| }_*\\&\ge {\left| \left| \left| \Pi _{{\mathbb {A}}^r}(\varTheta ^*) \right| \right| \right| }_*+{\left| \left| \left| \varDelta '' \right| \right| \right| }_*-\{{\left| \left| \left| \Pi _{{\mathbb {B}}^r}(\varTheta ^*) \right| \right| \right| }_*+{\left| \left| \left| \varDelta ' \right| \right| \right| }_*\}. \end{aligned} \end{aligned}$$

Consequently, we have

$$\begin{aligned} {\left| \left| \left| {\varTheta }^* \right| \right| \right| }_*-{\left| \left| \left| \varTheta \right| \right| \right| }_*\le & {} {\left| \left| \left| {\varTheta }^* \right| \right| \right| }_*-{\left| \left| \left| \Pi _{{\mathbb {A}}^r}(\varTheta ^*) \right| \right| \right| }- {\left| \left| \left| \varDelta '' \right| \right| \right| }_*+\{{\left| \left| \left| \Pi _{{\mathbb {B}}^r}(\varTheta ^*) \right| \right| \right| }_*+{\left| \left| \left| \varDelta ' \right| \right| \right| }_*\}\nonumber \\\le & {} 2{\left| \left| \left| \Pi _{{\mathbb {B}}^r}(\varTheta ^*) \right| \right| \right| }_*+{\left| \left| \left| \varDelta ' \right| \right| \right| }_*-{\left| \left| \left| \varDelta '' \right| \right| \right| }_*. \end{aligned}$$
(B.5)

Combining (B.5) and (B.4) and noting the fact that \({\left| \left| \left| \Pi _{{\mathbb {B}}^r}(\varTheta ^*) \right| \right| \right| }_*=\sum _{j=r+1}^{d}\sigma _j(\varTheta ^*)\), one has that \(0\le \frac{3\lambda }{2}{\left| \left| \left| \varDelta ' \right| \right| \right| }_*-\frac{\lambda }{2}{\left| \left| \left| \varDelta '' \right| \right| \right| }_*+2\lambda \sum _{j=r+1}^{d}\sigma _j(\varTheta ^*)+\delta \), and consequently, \({\left| \left| \left| \varDelta '' \right| \right| \right| }_*\le 3{\left| \left| \left| \varDelta ' \right| \right| \right| }_*+4\sum _{j=r+1}^{d}\sigma _j(\varTheta ^*)+\frac{2\delta }{\lambda }\). Using the trivial bound \({\left| \left| \left| \varDelta \right| \right| \right| }_*\le 2\omega \), one has that

$$\begin{aligned} {\left| \left| \left| \varDelta \right| \right| \right| }_*\le 4\sqrt{2r}{\left| \left| \left| \varDelta \right| \right| \right| }_F+4\sum _{j=r+1}^d\sigma _j(\varTheta ^*)+2\min \left( \frac{\delta }{\lambda },\omega \right) . \end{aligned}$$
(B.6)

Recall the set \(K_\eta \) defined in (16) and set \(r=|K_\eta |\). Combining (B.6) with (18) and setting \(\eta =\lambda \), we arrive at (B.2). We now verify that (B.1) is held by the matrix \(\hat{\varTheta }\) and \(\varTheta ^t\), respectively. Since \(\hat{\varTheta }\) is the optimal solution, it holds that \(\Psi (\hat{\varTheta })\le \Psi (\varTheta ^*)\), and by assumption (28), it holds that \(\Psi (\varTheta ^t)\le \Psi (\hat{\varTheta })+\delta \le \Psi (\varTheta ^*)+\delta \). Consequently, it follows from (B.2) that

$$\begin{aligned} \begin{aligned} {\left| \left| \left| \hat{\varTheta }-\varTheta ^* \right| \right| \right| }_*&\le 4\sqrt{2}\lambda ^{-\frac{q}{2}} R_q^{\frac{1}{2}}{\left| \left| \left| \hat{\varTheta }-\varTheta ^* \right| \right| \right| }_F+4\lambda ^{1-q}R_q,\\ {\left| \left| \left| \varTheta ^t-\varTheta ^* \right| \right| \right| }_*&\le 4\sqrt{2}\lambda ^{-\frac{q}{2}}R_q^{\frac{1}{2}}{\left| \left| \left| \varTheta ^t-\varTheta ^* \right| \right| \right| }_F +4\lambda ^{1-q}R_q+2\min \left( \frac{\delta }{\lambda },\omega \right) . \end{aligned} \end{aligned}$$

By the triangle inequality, we then arrive at that

$$\begin{aligned} \begin{aligned} {\left| \left| \left| \varTheta ^t-\hat{\varTheta } \right| \right| \right| }_*&\le {\left| \left| \left| \hat{\varTheta }-\varTheta ^* \right| \right| \right| }_*+{\left| \left| \left| \varTheta ^t-\varTheta ^* \right| \right| \right| }_*\\&\le 4\sqrt{2}\lambda ^{-\frac{q}{2}}R_q^{\frac{1}{2}}({\left| \left| \left| \hat{\varTheta }-\varTheta ^* \right| \right| \right| }_F +{\left| \left| \left| \varTheta ^t-\varTheta ^* \right| \right| \right| }_F)+8\lambda ^{1-q}R_q+2\min \left( \frac{\delta }{\lambda },\omega \right) \\&\le 4\sqrt{2}\lambda ^{-\frac{q}{2}}R_q^{\frac{1}{2}}{\left| \left| \left| \varTheta ^t-\hat{\varTheta } \right| \right| \right| }_F +\bar{\epsilon }_{\text {stat}}+\epsilon (\delta ). \end{aligned} \end{aligned}$$

The proof is complete. \(\square \)

Lemma B.2

Suppose that the conditions of Theorem 2 are satisfied, and that there exists a pair \((\varDelta , T)\) such that (28) holds. Then for any iteration \(t\ge T\), we have that

$$\begin{aligned}&{\mathcal {L}}(\hat{\varTheta })-{\mathcal {L}}(\varTheta ^t)-\langle \langle \nabla {\mathcal {L}}(\varTheta ^t),\hat{\varTheta }-\varTheta ^t \rangle \rangle \ge -\tau (\bar{\epsilon }_{stat }+\epsilon (\delta ))^2, \end{aligned}$$
(B.7)
$$\begin{aligned}&\Psi (\varTheta ^t)-\Psi (\hat{\varTheta })\ge \frac{\alpha _1}{4}{\left| \left| \left| \varTheta ^t-\hat{\varTheta } \right| \right| \right| }_F^2 -\tau (\bar{\epsilon }_{stat }+\epsilon (\delta ))^2, \end{aligned}$$
(B.8)
$$\begin{aligned}&\Psi (\varTheta ^t)-\Psi (\hat{\varTheta })\le \kappa ^{t-T}(\Psi (\varTheta ^T)-\Psi (\hat{\varTheta })) +\frac{2\xi }{1-\kappa }(\bar{\epsilon }_{stat }^2+\epsilon ^2(\delta )), \end{aligned}$$
(B.9)

where \(\bar{\epsilon }_{stat }\), \(\epsilon (\delta )\), \(\kappa \) and \(\xi \) are defined in (25), (29), (26) and (27), respectively.

Proof

By the RSC condition (12), one has that

$$\begin{aligned} {\mathcal {L}}(\varTheta ^t)-{\mathcal {L}}(\hat{\varTheta })-\langle \langle \nabla {\mathcal {L}}(\hat{\varTheta }),\varTheta ^t-\hat{\varTheta } \rangle \rangle \ge \frac{1}{2}\left\{ \alpha _1{\left| \left| \left| \varTheta ^t-\hat{\varTheta } \right| \right| \right| }_F^2-\tau {\left| \left| \left| \varTheta ^t-\hat{\varTheta } \right| \right| \right| }_*^2\right\} .\nonumber \\ \end{aligned}$$
(B.10)

It then follows from Lemma B.1 and the assumption that \(\lambda \ge \left( \frac{128\tau R_q}{\alpha _1}\right) ^{1/q}\) that

$$\begin{aligned} {\mathcal {L}}(\hat{\varTheta })-{\mathcal {L}}(\varTheta ^t)-\langle \langle \nabla {\mathcal {L}}(\varTheta ^t),\hat{\varTheta }-\varTheta ^t \rangle \rangle\ge & {} \frac{1}{2}\left\{ \alpha _1{\left| \left| \left| \varTheta ^t-\hat{\varTheta } \right| \right| \right| }_F^2-\tau {\left| \left| \left| \varTheta ^t-\hat{\varTheta } \right| \right| \right| }_*^2\right\} \\\ge & {} -\tau (\bar{\epsilon }_{\text {stat}}+\epsilon (\delta ))^2, \end{aligned}$$

which establishes (B.7). Furthermore, by the convexity of \({\left| \left| \left| \cdot \right| \right| \right| }_*\), let \(g\in \partial \left( {\left| \left| \left| \hat{\varTheta } \right| \right| \right| }_*\right) \). Then one has that

$$\begin{aligned} \lambda {\left| \left| \left| \varTheta ^t \right| \right| \right| }_*-\lambda {\left| \left| \left| \hat{\varTheta } \right| \right| \right| }_*-\langle \langle \lambda g,\varTheta ^t-\hat{\varTheta } \rangle \rangle \ge 0, \end{aligned}$$
(B.11)

and by the first-order optimality condition for \(\hat{\varTheta }\), one has that

$$\begin{aligned} \langle \langle \nabla \Psi (\hat{\varTheta }),\varTheta ^t-\hat{\varTheta } \rangle \rangle \ge 0. \end{aligned}$$
(B.12)

Combining (B.10), (B.11) and (B.12), we obtain that

$$\begin{aligned} \Psi (\varTheta ^t)-\Psi (\hat{\varTheta })\ge \frac{1}{2}\left\{ \alpha _1{\left| \left| \left| \varTheta ^t -\hat{\varTheta } \right| \right| \right| }_F^2-\tau {\left| \left| \left| \varTheta ^t-\hat{\varTheta } \right| \right| \right| }_*^2\right\} . \end{aligned}$$

Then applying Lemma B.1 to bound the term \({\left| \left| \left| \varTheta ^t-\hat{\varTheta } \right| \right| \right| }_*^2\) and noting the assumption that \(\lambda \ge \left( \frac{128\tau R_q}{\alpha _1}\right) ^{1/q}\), we arrive at (B.8). Now we turn to prove (B.9). Define

$$\begin{aligned} \Psi _t(\varTheta ):={\mathcal {L}}(\varTheta ^t)+\langle \langle \nabla {{\mathcal {L}}(\varTheta ^t)},\varTheta -\varTheta ^t \rangle \rangle +\frac{v}{2}{\left| \left| \left| \varTheta -\varTheta ^t \right| \right| \right| }_F^2+\lambda {\left| \left| \left| \varTheta \right| \right| \right| }_*, \end{aligned}$$

which is the objective function minimized over the feasible region \({\mathbb {S}}=\{\varTheta \big |{\left| \left| \left| \varTheta \right| \right| \right| }_*\le \omega \}\) at iteration count t. For any \(a\in [0,1]\), it is easy to check that the matrix \(\varTheta _a=a\hat{\varTheta }+(1-a)\varTheta ^t\) belongs to \({\mathbb {S}}\) due to the convexity of \({\mathbb {S}}\). Since \(\varTheta ^{t+1}\) is the optimal solution of the optimization problem (24), we have that

$$\begin{aligned} \begin{aligned} \Psi _t(\varTheta ^{t+1})&\le \Psi _t(\varTheta _a)={\mathcal {L}}(\varTheta ^t)+ \langle \langle \nabla {{\mathcal {L}}(\varTheta ^t)},\varTheta _a-\varTheta ^t \rangle \rangle +\frac{v}{2} \Vert \varTheta _a-\varTheta ^t\Vert _2^2+\lambda {\left| \left| \left| \varTheta _a \right| \right| \right| }_*\\&\le {\mathcal {L}}(\varTheta ^t)+ \langle \langle \nabla {\mathcal {L}}(\varTheta ^t),a\hat{\varTheta }- a\varTheta ^t \rangle \rangle \\&\quad +\frac{va^2}{2}{\left| \left| \left| \hat{\varTheta }-\varTheta ^t \right| \right| \right| }_F^2 +a\lambda {\left| \left| \left| \hat{\varTheta } \right| \right| \right| }_*+(1-a)\lambda {\left| \left| \left| \varTheta ^t \right| \right| \right| }_*, \end{aligned} \end{aligned}$$

where the last inequality is from the convexity of \({\left| \left| \left| \cdot \right| \right| \right| }_*\). Then by (B.7), one sees that

(B.13)

Applying the RSM condition (13) on the matrix \(\varTheta ^{t+1}-\varTheta ^t\) with some algebra, we have by assumption \(v\ge \alpha _2\) that

$$\begin{aligned} \begin{aligned} {\mathcal {L}}(\varTheta ^{t+1})-{\mathcal {L}}(\varTheta ^t)-\langle \langle \nabla {\mathcal {L}}(\varTheta ^t),\varTheta ^{t+1}-\varTheta ^t \rangle \rangle&\le \frac{1}{2}\left\{ \alpha _2{\left| \left| \left| \varTheta ^{t+1}-\varTheta ^t \right| \right| \right| }_F^2+\tau {\left| \left| \left| \varTheta ^{t+1}-\varTheta ^t \right| \right| \right| }_*^2\right\} \\&\le \frac{v}{2}{\left| \left| \left| \varTheta ^{t+1}-\varTheta ^t \right| \right| \right| }_F^2+\frac{\tau }{2}{\left| \left| \left| \varTheta ^{t+1}-\varTheta ^t \right| \right| \right| }_*^2. \end{aligned} \end{aligned}$$

Adding \(\lambda {\left| \left| \left| \varTheta ^{t+1} \right| \right| \right| }_*\) to both sides of the former inequality, we obtain that

$$\begin{aligned} \begin{aligned} \Psi (\varTheta ^{t+1})&\le {\mathcal {L}}(\varTheta ^t)+\langle \langle \nabla {\mathcal {L}}(\varTheta ^t),\varTheta ^{t+1}-\varTheta ^t \rangle \rangle + \lambda {\left| \left| \left| \varTheta ^{t+1} \right| \right| \right| }_*\\&\quad +\frac{v}{2}{\left| \left| \left| \varTheta ^{t+1}-\varTheta ^t \right| \right| \right| }_F^2 +\frac{\tau }{2}{\left| \left| \left| \varTheta ^{t+1}-\varTheta ^t \right| \right| \right| }_*^2\\&= \Psi _t(\varTheta ^{t+1})+\frac{\tau }{2}{\left| \left| \left| \varTheta ^{t+1}-\varTheta ^t \right| \right| \right| }_*^2. \end{aligned} \end{aligned}$$

This, together with (B.13), implies that

$$\begin{aligned} \Psi (\varTheta ^{t+1})\le & {} \Psi (\varTheta ^t)-a(\Psi (\varTheta ^t)-\Psi (\hat{\varTheta })) +\frac{va^2}{2}{\left| \left| \left| \hat{\varTheta }-\varTheta ^t \right| \right| \right| }_F^2+\frac{\tau }{2}{\left| \left| \left| \varTheta ^{t+1} -\varTheta ^t \right| \right| \right| }_*^2\nonumber \\{} & {} +\tau (\bar{\epsilon }_{\text {stat}}+\epsilon (\delta ))^2. \end{aligned}$$
(B.14)

Define \(\varDelta ^t:=\varTheta ^t-\hat{\varTheta }\). Then it follows directly that \({\left| \left| \left| \varTheta ^{t+1}-\varTheta ^t \right| \right| \right| }_*^2\le ({\left| \left| \left| \varDelta ^{t+1} \right| \right| \right| }_*+{\left| \left| \left| \varDelta ^t \right| \right| \right| }_*)^2 \le 2{\left| \left| \left| \varDelta ^{t+1} \right| \right| \right| }_*^2+2{\left| \left| \left| \varDelta ^t \right| \right| \right| }_*^2\). Combining this inequality with (B.14), one has that

$$\begin{aligned} \Psi (\varTheta ^{t+1})\le & {} \Psi (\varTheta ^t)-a(\Psi (\varTheta ^t)-\Psi (\hat{\varTheta }))+ \frac{va^2}{2}{\left| \left| \left| \hat{\varTheta }-\varTheta ^t \right| \right| \right| }_F^2+\tau ({\left| \left| \left| \varDelta ^{t+1} \right| \right| \right| }_*^2 +{\left| \left| \left| \varDelta ^t \right| \right| \right| }_*^2)\\{} & {} +\tau (\bar{\epsilon }_{\text {stat}}+\epsilon (\delta ))^2. \end{aligned}$$

To simplify the notations, define \(\psi :=\tau (\bar{\epsilon }_{\text {stat}}+\epsilon (\varDelta ))^2\), \(\zeta :=\tau \lambda ^{-q}R_q\) and \(\delta _t:=\Psi (\varTheta ^t)-\Psi (\hat{\varTheta })\). Using Lemma B.1 to bound the term \({\left| \left| \left| \varDelta ^{t+1} \right| \right| \right| }_*^2\) and \({\left| \left| \left| \varDelta ^t \right| \right| \right| }_*^2\), we arrive at that

$$\begin{aligned} \begin{aligned} \Psi (\varTheta ^{t+1})&\le \Psi (\varTheta ^t)-a(\Psi (\varTheta ^t)-\Psi (\hat{\varTheta }))+\frac{va^2}{2}{\left| \left| \left| \varDelta ^t \right| \right| \right| }_F^2\\&\quad +64\tau \lambda ^{-q}R_q({\left| \left| \left| \varDelta ^{t+1} \right| \right| \right| }_F^2+{\left| \left| \left| \varDelta ^t \right| \right| \right| }_F^2)+5\psi \\&= \Psi (\varTheta ^t)-a(\Psi (\varTheta ^t)-\Psi (\hat{\varTheta }))+\left( \frac{va^2}{2}+64\zeta \right) {\left| \left| \left| \varDelta ^t \right| \right| \right| }_F^2+64\zeta {\left| \left| \left| \varDelta ^{t+1} \right| \right| \right| }_F^2+5\psi . \end{aligned} \end{aligned}$$
(B.15)

Subtracting \(\Psi (\hat{\varTheta })\) from both sides of (B.15), one has by (B.8) that

$$\begin{aligned} \begin{aligned} \delta _{t+1}&\le (1-a)\delta _t+\frac{2va^2+256\zeta }{\alpha _1}(\delta _t+\psi ) +\frac{256\zeta }{\alpha _1}(\delta _{t+1}+\psi )+5\psi . \end{aligned} \end{aligned}$$

Setting \(a=\frac{\alpha _1}{4v}\in (0,1)\), it follows from the former inequality that

$$\begin{aligned} \begin{aligned} \left( 1-\frac{256\zeta }{\alpha _1}\right) \delta _{t+1}&\le \left( 1-\frac{\alpha _1}{8v}+\frac{256\zeta }{\alpha _1}\right) \delta _t +\left( \frac{\alpha _1}{8v}+\frac{512\zeta }{\alpha _1}+5\right) \psi , \end{aligned} \end{aligned}$$

or equivalently, \(\delta _{t+1}\le \kappa \delta _t+\xi (\bar{\epsilon }_{\text {stat}} +\epsilon (\delta ))^2\), where \(\kappa \) and \(\xi \) were previously defined in (26) and (27), respectively. Finally, we conclude that

$$\begin{aligned} \begin{aligned} \delta _t&\le \kappa ^{t-T}\delta _T+\xi (\bar{\epsilon }_{\text {stat}} +\epsilon (\delta ))^2 (1+\kappa +\kappa ^2+\cdots +\kappa ^{t-T-1})\\&\le \kappa ^{t-T}\delta _T+\frac{\xi }{1-\kappa }(\bar{\epsilon }_{\text {stat}} +\epsilon (\delta ))^2\le \kappa ^{t-T}\delta _T+\frac{2\xi }{1-\kappa }(\bar{\epsilon }_{\text {stat}}^2 +\epsilon ^2(\delta )). \end{aligned} \end{aligned}$$

The proof is complete. \(\square \)

By virtue of the above lemmas, we are now ready to prove Theorem 2. The proof mainly follows the arguments in [1, 21].

We first prove the following inequality:

$$\begin{aligned} \Psi (\varTheta ^t)-\Psi (\hat{\varTheta })\le \delta ^*,\quad \forall t\ge T(\delta ^*). \end{aligned}$$
(B.16)

Divide iterations \(t=0,1,\ldots \) into a sequence of disjoint epochs \([T_k,T_{k+1}]\) and define the associated sequence of tolerances \(\delta _0>\delta _1>\cdots \) such that

$$\begin{aligned} \Psi (\varTheta ^t)-\Psi (\hat{\varTheta })\le \delta _k,\quad \forall t\ge T_k, \end{aligned}$$

as well as the corresponding error term \(\epsilon _k:=2\min \left\{ \frac{\delta _k}{\lambda },\omega \right\} \). The values of \(\{(\delta _k,T_k)\}_{k\ge 1}\) will be decided later. Then at the first iteration, Lemma B.2 (cf. (B.9)) is applied with \(\epsilon _0=2\omega \) and \(T_0=0\) to conclude that

$$\begin{aligned} \Psi (\varTheta ^t)-\Psi (\hat{\varTheta })\le \kappa ^t(\Psi (\varTheta ^0)-\Psi (\hat{\varTheta }))+ \frac{2\xi }{1-\kappa }(\bar{\epsilon }_{\text {stat}}^2 +4\omega ^2),\quad \forall t\ge T_0. \end{aligned}$$
(B.17)

Set \(\delta _1:=\frac{4\xi }{1-\kappa }(\bar{\epsilon }_{\text {stat}}^2 +4\omega ^2)\). Noting that \(\kappa \in (0,1)\) by assumption, it follows from (B.17) that for \(T_1:=\lceil \frac{\log (2\delta _0/\delta _1)}{\log (1/\kappa )} \rceil \),

$$\begin{aligned} \begin{aligned} \Psi (\varTheta ^t)-\Psi (\hat{\varTheta })&\le \frac{\delta _1}{2}+\frac{2\xi }{1-\kappa }\left( \bar{\epsilon }_{\text {stat}}^2 +4\omega ^2\right) =\delta _1\le \frac{8\xi }{1-\kappa }\max \left\{ \bar{\epsilon }_{\text {stat}}^2,4 \omega ^2\right\} ,\quad \forall t\ge T_1. \end{aligned} \end{aligned}$$

For \(k\ge 1\), define

(B.18)

Then Lemma B.2 (cf. (B.9)) is applied to concluding that for all \(t\ge T_k\),

$$\begin{aligned} \Psi (\varTheta ^t)-\Psi (\hat{\varTheta })\le \kappa ^{t-T_k}(\Psi (\varTheta ^{T_k})-\Psi (\hat{\varTheta })) +\frac{2\xi }{1-\kappa } (\bar{\epsilon }_{\text {stat}}^2+\epsilon _k^2), \end{aligned}$$

which further implies that

$$\begin{aligned} \Psi (\varTheta ^t)-\Psi (\hat{\varTheta })\le \delta _{k+1}\le \frac{8\xi }{1-\kappa }\max \{\bar{\epsilon }_{\text {stat}}^2, \epsilon _k^2\},\quad \forall t\ge T_{k+1}. \end{aligned}$$

From (B.18), one obtains the recursion for \(\{(\delta _k,T_k)\}_{k=0}^\infty \) as follows

$$\begin{aligned} \delta _{k+1}&\le \frac{8\xi }{1-\kappa }\max \{\epsilon _k^2,\bar{\epsilon }_{\text {stat}}^2\}, \end{aligned}$$
(B.19a)
$$\begin{aligned} T_k&\le k+\frac{\log (2^k\delta _0/\delta _k)}{\log (1/\kappa )}. \end{aligned}$$
(B.19b)

Then by [3, Section 7.2], it is easy to see that (B.19a) implies that

$$\begin{aligned} \delta _{k+1}\le \frac{\delta _k}{4^{2^{k+1}}}\quad \text{ and }\quad \frac{\delta _{k+1}}{\lambda }\le \frac{\omega }{4^{2^k}},\quad \forall k\ge 1. \end{aligned}$$
(B.20)

Now let us show how to determine the smallest k such that \(\delta _k\le \delta ^*\) by using (B.20). If we are at the first epoch, (B.16) is clearly held due to (B.19a). Otherwise, from (B.19b), one sees that \(\delta _k\le \delta ^*\) is held after at most

$$\begin{aligned} k(\delta ^*)\ge \frac{\log (\log (\omega \lambda /\delta ^*)/\log 4)}{\log (2)}+1=\log _2\log _2(\omega \lambda /\delta ^*) \end{aligned}$$

epoches. Combining the above bound on \(k(\delta ^*)\) with (B.19b), one obtains that \(\Psi (\varTheta ^t)-\Psi (\hat{\varTheta })\le \delta ^*\) holds for all iterations

$$\begin{aligned} t\ge \log _2\log _2\left( \frac{\omega \lambda }{\delta ^*}\right) \left( 1+\frac{\log 2}{\log (1/\kappa )}\right) +\frac{\log (\delta _0/\delta ^*)}{\log (1/\kappa )}, \end{aligned}$$

which establishes (B.16). Finally, as (B.16) is proved, one has by (B.8) in Lemma B.2 and the assumption that \(\lambda \ge \left( \frac{128\tau R_q}{\alpha _1}\right) ^{1/q}\) that, for any \(t\ge T(\delta ^*)\),

$$\begin{aligned} \frac{\alpha _1}{4}{\left| \left| \left| \varTheta ^t-\hat{\varTheta } \right| \right| \right| }_F^2 \le \Psi (\varTheta ^t)-\Psi (\hat{\varTheta }) +\tau \left( \epsilon (\delta ^*)+\bar{\epsilon }_{\text {stat}}\right) ^2 \le \delta ^*+\tau \left( \frac{2\delta ^*}{\lambda }+ \bar{\epsilon }_{\text {stat}}\right) ^2. \end{aligned}$$

Consequently, it follows from (30) and (31) that, for any \(t\ge T(\delta ^*)\),

$$\begin{aligned} {\left| \left| \left| \varTheta ^t-\hat{\varTheta } \right| \right| \right| }_F^2\le \frac{4}{\alpha _1} \left( \delta ^*+\frac{{\delta ^*}^2}{2\tau \omega ^2}+2\tau \bar{\epsilon }_{\text {stat}}^2\right) . \end{aligned}$$

The proof is complete.

C Proofs of Sect. 4

In this section, several important technical lemmas are provided first, which are used to verify the RSC/RSM conditions and deviation conditions for specific errors-in-variables models (cf. Propositions 14). Some notations are needed to ease the expositions. For a symbol \(x\in \{0,*,F\}\) and a positive real number \(r\in {\mathbb {R}}^+\), define \({\mathbb {M}}_x(r):=\{A\in R^{d_1\times d_2}|{\left| \left| \left| A \right| \right| \right| }_x\le r\}\), where \({\left| \left| \left| A \right| \right| \right| }_0\) denotes the rank of matrix A. Then define the sparse set

$$\begin{aligned} {\mathbb {K}}(r):={\mathbb {M}}_0(r)\cap {\mathbb {M}}_F(1), \end{aligned}$$
(C.1)

and the cone set

$$\begin{aligned} {\mathbb {C}}(r):=\{A\in {\mathbb {R}}^{d_1\times d_2}\big |{\left| \left| \left| A \right| \right| \right| }_*\le \sqrt{r}{\left| \left| \left| A \right| \right| \right| }_F\}. \end{aligned}$$
(C.2)

The following lemma tells us that the intersection of the matrix \(\ell _1\)-ball with the matrix \(\ell _2\)-ball can be bounded by virtue of a simpler set.

Lemma C.1

For any constant \(r\ge 1\), it holds that

$$\begin{aligned} {\mathbb {M}}_*(\sqrt{r})\cap {\mathbb {M}}_F(1)\subseteq 2cl \{conv \{{\mathbb {M}}_0(r)\cap {\mathbb {M}}_F(1)\}\}, \end{aligned}$$

where \(cl \{\cdot \}\) and \(conv \{\cdot \}\) denote the topological closure and convex hull, respectively.

Proof

Note that when \(r> \min \{d_1,d_2\}\), this containment is trivial, since the right-hand set is equal to \({\mathbb {M}}_F(2)\) and the left-hand set is contained in \({\mathbb {M}}_F(1)\). Thus, we will assume \(1\le r\le \min \{d_1,d_2\}\).

Let \(A\in {\mathbb {M}}_*(\sqrt{r})\cap {\mathbb {M}}_F(1)\). Then it follows that \({\left| \left| \left| A \right| \right| \right| }_*\le \sqrt{r}\) and \({\left| \left| \left| A \right| \right| \right| }_F\le 1\). Consider a singular value decomposition of A: \(A=UDV^\top \), where \(U\in {\mathbb {R}}^{d_1\times d_1}\) and \(V\in {\mathbb {R}}^{d_2\times d_2}\) are orthogonal matrices, and \(D\in {\mathbb {R}}^{d_1\times d_2}\) consists of \(\sigma _1(D),\sigma _2(D),\ldots ,\sigma _k(D)\) on the “diagonal” and 0 elsewhere with \(k=\text {rank}(A)\). Write \(D=\text {diag}(\sigma _1(D),\sigma _2(D),\ldots ,\sigma _k(D),0\ldots ,0)\), and use \(\text {vec}(D)\) to denote the vectorized form of the matrix D. Then it follows that \(\Vert \text {vec}(D)\Vert _1\le \sqrt{r}\) and \(\Vert \text {vec}(D)\Vert _2\le 1\). Partition the support of \(\text {vec}(D)\) into disjoint subsets \(T_1,T_2,\ldots \), such that \(T_1\) is the index set corresponding to the first r largest elements in absolute value of \(\text {vec}(D)\), \(T_2\) indexes the next r largest elements, and so on. Write \(D_i=\text {diag}(\text {vec}(D)_{T_i})\), and \(A_i=UD_iV^\top \). Then one has that \({\left| \left| \left| A_i \right| \right| \right| }_0=\text {rank}(A_i)\le r\) and \({\left| \left| \left| A_i \right| \right| \right| }_F\le 1\). Write \(B_i=2A_i/{\left| \left| \left| A_i \right| \right| \right| }_F\) and \(t_i={\left| \left| \left| A_i \right| \right| \right| }_F/2\). Then it holds that \(B_i\in 2\{{\mathbb {M}}_0(r)\cap {\mathbb {M}}_F(1)\}\}\) and \(t_i\ge 0\). Now it suffices to check that A can be expressed as a convex combination of matrices in \(2\{\{{\mathbb {M}}_0(r)\cap {\mathbb {M}}_F(1)\}\}\), namely \(A=\sum _{i\ge 1}t_iB_i\). Since the zero matrix contains in \(2\{{\mathbb {M}}_0(r)\cap {\mathbb {M}}_F(1)\}\), it suffices to show that \(\sum _{i\ge 1}t_i\le 1\), which is equivalent to \(\sum _{i\ge 1}\Vert \text {vec}(D)_{T_i}\Vert _2\le 2\). To prove this, first note that \(\Vert \text {vec}(D)_{T_1}\Vert _2\le \Vert \text {vec}(D)\Vert _2\). Second, note that for \(i\ge 2\), each elements of \(\text {vec}(D)_{T_i}\) is bounded in magnitude by \(\Vert \text {vec}(D)_{T_{i-1}}\Vert _1/r\), and thus \(\Vert \text {vec}(D)_{T_i}\Vert _2\le \Vert \text {vec}(D)_{T_{i-1}}\Vert _1/\sqrt{r}\). Combining these two facts, one has that

$$\begin{aligned} \sum _{i\ge 1}\Vert \text {vec}(D)_{T_i}\Vert _2\le & {} 1+\sum _{i\ge 2}\Vert \text {vec}(D)_{T_i}\Vert _2\le 1+\sum _{i\ge 2}\Vert \text {vec}(D)_{T_{i-1}}\Vert _1/\sqrt{r}\\\le & {} 1+\Vert \text {vec}(D)\Vert _1/\sqrt{r}\le 2. \end{aligned}$$

The proof is complete. \(\square \)

Lemma C.2

Let \(r\ge 1\), \(\delta >0\) be two constants, and \(\varGamma \in {\mathbb {R}}^{d_1\times d_1}\) be a fixed matrix. Suppose that the following condition holds

$$\begin{aligned} |\langle \langle \varGamma \varDelta ,\varDelta \rangle \rangle |\le \delta ,\quad \forall \varDelta \in {\mathbb {K}}(2r)\ (cf. \ (\mathrm{C. 1})). \end{aligned}$$
(C.3)

Then we have that

$$\begin{aligned} |\langle \langle \varGamma \varDelta ,\varDelta \rangle \rangle |\le 12\delta ({\left| \left| \left| \varDelta \right| \right| \right| }_F^2+\frac{1}{r}{\left| \left| \left| \varDelta \right| \right| \right| }_*^2),\quad \forall \varDelta \in {\mathbb {R}}^{d_1\times d_2}. \end{aligned}$$
(C.4)

Proof

We begin with establishing the inequalities

$$\begin{aligned} |\langle \langle \varGamma \varDelta ,\varDelta \rangle \rangle |&\le 12\delta {\left| \left| \left| \varDelta \right| \right| \right| }_F^2,\quad \forall \varDelta \in {\mathbb {C}}(r), \end{aligned}$$
(C.5a)
$$\begin{aligned} |\langle \langle \varGamma \varDelta ,\varDelta \rangle \rangle |&\le \frac{12\delta }{r}{\left| \left| \left| \varDelta \right| \right| \right| }_*^2,\quad \forall \varDelta \notin {\mathbb {C}}(r), \end{aligned}$$
(C.5b)

where \({\mathbb {C}}(r)\) is defined in (C.2). Then (C.4) follows directly.

Now we turn to prove (C.5). By rescaling, (C.5a) holds if one can check that

$$\begin{aligned} |\langle \langle \varGamma \varDelta ,\varDelta \rangle \rangle |\le 12\delta ,\quad \text {for all}\ \varDelta \ \text {satisfying} {\left| \left| \left| \varDelta \right| \right| \right| }_F=1\ \text {and}\ {\left| \left| \left| \varDelta \right| \right| \right| }_*\le \sqrt{r}. \end{aligned}$$
(C.6)

It then follows from Lemma C.1 and continuity that (C.6) can be reduced to the problem of proving that

$$\begin{aligned} |\langle \langle \varGamma \varDelta ,\varDelta \rangle \rangle |\le 12\delta ,\quad \forall \varDelta \in 2\text {conv}\{{\mathbb {K}}(r)\}=\text {conv}\{{\mathbb {M}}_0(r)\cap {\mathbb {M}}_F(2)\}. \end{aligned}$$

For this purpose, consider a weighted linear combination of the form \(\varDelta =\sum _it_i\varDelta _i\), with weights \(t_i\ge 0\) such that \(\sum _it_i=1\), \({\left| \left| \left| \varDelta _i \right| \right| \right| }_0\le r\), and \({\left| \left| \left| \varDelta _i \right| \right| \right| }_F\le 2\) for each i. Then one has that

$$\begin{aligned} \langle \langle \varGamma \varDelta ,\varDelta \rangle \rangle =\langle \langle \varGamma (\sum _it_i\varDelta _i),(\sum _it_i\varDelta _i) \rangle \rangle =\sum _{i,j}t_it_j\langle \langle \varGamma \varDelta _i,\varDelta _j \rangle \rangle . \end{aligned}$$

On the other hand, it holds that for all ij

$$\begin{aligned} |\langle \langle \varGamma \varDelta _i,\varDelta _j \rangle \rangle |=\frac{1}{2}|\langle \langle \varGamma (\varDelta _i+\varDelta _j),(\varDelta _i+\varDelta _j) \rangle \rangle -\langle \langle \varGamma \varDelta _i,\varDelta _i \rangle \rangle -\langle \langle \varGamma \varDelta _j,\varDelta _j \rangle \rangle |.\nonumber \\ \end{aligned}$$
(C.7)

Noting that \(\frac{1}{2}\varDelta _i\), \(\frac{1}{2}\varDelta _j\), \(\frac{1}{4}(\varDelta _i+\varDelta _j)\) all belong to \({\mathbb {K}}(2r)\), and then combining (C.7) with (C.3), we have that

$$\begin{aligned} |\langle \langle \varGamma \varDelta _i,\varDelta _j \rangle \rangle |\le \frac{1}{2} (16\delta +4\delta +4\delta )=12\delta , \end{aligned}$$

for all ij, and thus \(\langle \langle \varGamma \varDelta ,\varDelta \rangle \rangle \le \sum _{i,j}t_it_j(12\delta )=12\delta (\sum _it_i)^2=12\delta \), which establishes (C.5a). As for inequality (C.5b), note that for \(\varDelta \notin {\mathbb {C}}(r)\), one has that

$$\begin{aligned} \frac{|\langle \langle \varGamma \varDelta ,\varDelta \rangle \rangle |}{{\left| \left| \left| \varDelta \right| \right| \right| }_*^2}\le \frac{1}{r}\sup _{{\left| \left| \left| U \right| \right| \right| }_*\le \sqrt{r},{\left| \left| \left| U \right| \right| \right| }_F\le 1}|\langle \langle \varGamma U,U \rangle \rangle |\le \frac{12\delta }{r}, \end{aligned}$$
(C.8)

where the first inequality follows by the substitution \(U=\sqrt{r}\frac{\varDelta }{{\left| \left| \left| \varDelta \right| \right| \right| }_*}\), and the second inequality is due to the same argument used to prove (C.5a) as \(U\in {\mathbb {C}}(r)\). Rearranging (C.8) yields (C.5b). The proof is complete. \(\square \)

Lemma C.3

Let \(r\ge 1\) be any constant. Suppose that \(\hat{\varGamma }\in {\mathbb {R}}^{d_1\times d_1}\) is an estimator of \(\Sigma _x\) satisfying

$$\begin{aligned} |\langle \langle (\hat{\varGamma }-\Sigma _x)\varDelta ,\varDelta \rangle \rangle |\le \frac{\lambda _{\min }(\Sigma _x)}{24},\quad \forall \varDelta \in {\mathbb {K}}(2r) \ (cf. \ (\mathrm{C.1})). \end{aligned}$$

Then we have that

$$\begin{aligned} \langle \langle \hat{\varGamma }\varDelta ,\varDelta \rangle \rangle \ge \frac{\lambda _{\min }(\Sigma _x)}{2}{\left| \left| \left| \varDelta \right| \right| \right| }_F^2-\frac{\lambda _{\min }(\Sigma _x)}{2r}{\left| \left| \left| \varDelta \right| \right| \right| }_*^2,\\ \langle \langle \hat{\varGamma }\varDelta ,\varDelta \rangle \rangle \le \frac{3\lambda _{\max }(\Sigma _x)}{2}{\left| \left| \left| \varDelta \right| \right| \right| }_F^2+\frac{\lambda _{\min }(\Sigma _x)}{2r}{\left| \left| \left| \varDelta \right| \right| \right| }_*^2. \end{aligned}$$

Proof

Set \(\varGamma =\hat{\varGamma }-\Sigma _x\) and \(\delta =\frac{\lambda _{\min }(\Sigma _x)}{24}\). Then Lemma C.2 is applicable to concluding that

$$\begin{aligned} |\langle \langle (\hat{\varGamma }-\Sigma _x)\varDelta ,\varDelta \rangle \rangle |\le \frac{\lambda _{\min }(\Sigma _x)}{2}({\left| \left| \left| \varDelta \right| \right| \right| }_F^2+\frac{1}{r}{\left| \left| \left| \varDelta \right| \right| \right| }_*^2), \end{aligned}$$

which implies that

$$\begin{aligned} \langle \langle \hat{\varGamma }\varDelta ,\varDelta \rangle \rangle\ge & {} \langle \langle \Sigma _x\varDelta ,\varDelta \rangle \rangle -\frac{\lambda _{\min }(\Sigma _x)}{2}({\left| \left| \left| \varDelta \right| \right| \right| }_F^2 +\frac{1}{r}{\left| \left| \left| \varDelta \right| \right| \right| }_*^2),\\ \langle \langle \hat{\varGamma }\varDelta ,\varDelta \rangle \rangle\le & {} \langle \langle \Sigma _x\varDelta ,\varDelta \rangle \rangle +\frac{\lambda _{\min }(\Sigma _x)}{2}({\left| \left| \left| \varDelta \right| \right| \right| }_F^2 +\frac{1}{r}{\left| \left| \left| \varDelta \right| \right| \right| }_*^2). \end{aligned}$$

Then the conclusion follows from the fact that \(\lambda _{\min }(\Sigma _x){\left| \left| \left| \varDelta \right| \right| \right| }_F^2\le \langle \langle \Sigma _x\varDelta ,\varDelta \rangle \rangle \le \lambda _{\max }(\Sigma _x){\left| \left| \left| \varDelta \right| \right| \right| }_F^2\). The proof is complete. \(\square \)

Lemma C.4

Let \(t>0\) be any constant, and \(X\in {\mathbb {R}}^{n\times d_1}\) be a zero-mean sub-Gaussian matrix with parameters \((\Sigma _x,\sigma _x^2)\). Then for any fixed matrix \(\varDelta \in {\mathbb {R}}^{d_1\times d_2}\), there exists a universal positive constant c such that

$$\begin{aligned} {\mathbb {P}}\left[ \Big |\frac{{\left| \left| \left| X\varDelta \right| \right| \right| }_F^2}{n}-{\mathbb {E}}\left( \frac{{\left| \left| \left| X\varDelta \right| \right| \right| }_F^2}{n}\right) \Big |\ge t\right] \le 2\exp \left( -cn\min \left( \frac{t^2}{d_2^2\sigma _x^4},\frac{t}{d_2\sigma _x^2}\right) +\log d_2\right) . \end{aligned}$$

Proof

By the definition of matrix Frobenius norm, one has that

$$\begin{aligned} \frac{{\left| \left| \left| X\varDelta \right| \right| \right| }_F^2}{n}-{\mathbb {E}}\left( \frac{{\left| \left| \left| X\varDelta \right| \right| \right| }_F^2}{n}\right) = \sum _{j=1}^{d_2}\left[ \frac{\Vert X\varDelta _{\cdot j}\Vert _2^2}{n}-{\mathbb {E}}\left( \frac{\Vert X\varDelta _{\cdot j}\Vert _2^2}{n}\right) \right] . \end{aligned}$$

Then it follows from elementary probability theory that

$$\begin{aligned} \begin{aligned} {\mathbb {P}}\left[ \Big |\frac{{\left| \left| \left| X\varDelta \right| \right| \right| }_F^2}{n}-{\mathbb {E}}\left( \frac{{\left| \left| \left| X\varDelta \right| \right| \right| }_F^2}{n}\right) \Big |\le t\right]&= {\mathbb {P}}\left\{ \Big |\sum _{j=1}^{d_2}\left[ \frac{\Vert X\varDelta _{\cdot j}\Vert _2^2}{n}-{\mathbb {E}}\left( \frac{\Vert X\varDelta _{\cdot j} \Vert _2^2}{n}\right) \right] \Big |\le t\right\} \\&\ge {\mathbb {P}}\left\{ \bigcap _{j=1}^{d_2}\left\{ \Big |\frac{\Vert X\varDelta _{\cdot j}\Vert _2^2}{n} -{\mathbb {E}}\left( \frac{\Vert X\varDelta _{\cdot j}\Vert _2^2}{n}\right) \Big | \le \frac{t}{d_2}\right\} \right\} \\&\ge \sum _{j=1}^{d_2}{\mathbb {P}}\left[ \Big |\frac{\Vert X\varDelta _{\cdot j}\Vert _2^2}{n}-{\mathbb {E}}\left( \frac{\Vert X\varDelta _{\cdot j}\Vert _2^2}{n}\right) \Big |\le \frac{t}{d_2}\right] -(d_2-1) \end{aligned} \end{aligned}$$

On the other hand, note the assumption that X is a sub-Gaussian matrix with parameters \((\Sigma _x,\sigma _x^2)\). Then [23, Lemma 14] is applicable to concluding that there exists a universal positive constant c such that

$$\begin{aligned} \begin{aligned} {\mathbb {P}}\left[ \Big |\frac{{\left| \left| \left| X\varDelta \right| \right| \right| }_F^2}{n}-{\mathbb {E}}\left( \frac{{\left| \left| \left| X\varDelta \right| \right| \right| }_F^2}{n}\right) \Big |\le t\right]&\ge d_2\left( 1-2\exp \left( -cn\min \left( \frac{t^2}{d_2^2\sigma _x^4}, \frac{t}{d_2\sigma _x^2}\right) \right) \right) -(d_2-1)\\&= 1-2\exp \left( -cn\min \left( \frac{t^2}{d_2^2\sigma _x^4}, \frac{t}{d_2\sigma _x^2}\right) +\log d_2\right) , \end{aligned} \end{aligned}$$

which completes the proof. \(\square \)

Lemma C.5

Let \(t>0\), \(r\ge 1\) be any constants, and \(X\in {\mathbb {R}}^{n\times d_1}\) be a zero-mean sub-Gaussian matrix with parameters \((\Sigma _x,\sigma _x^2)\). Then there exists a universal positive constant c such that

$$\begin{aligned} \begin{aligned}&{\mathbb {P}}\left[ \sup _{\varDelta \in {\mathbb {K}}(2r)}\Big |\frac{{\left| \left| \left| X\varDelta \right| \right| \right| }_F^2}{n} -{\mathbb {E}}\left( \frac{{\left| \left| \left| X\varDelta \right| \right| \right| }_F^2}{n}\right) \Big |\ge t\right] \\&\quad \le 2\exp \left( -cn\min \left( \frac{t^2}{d_2^2\sigma _x^4},\frac{t}{d_2\sigma _x^2}\right) +\log d_2+2r(2\max (d_1,d_2)+\log (\min (d_1,d_2))\right) \end{aligned} \end{aligned}$$

Proof

For an index set \(J\subseteq \{1,2,\ldots ,\min \{d_1,d_2\}\}\), we define the set \(S_J=\{\varDelta \in {\mathbb {R}}^{d_1\times d_2}\big |{\left| \left| \left| \varDelta \right| \right| \right| }_F\le 1, \text {supp}(\text {vec}(\sigma (\varDelta )))\subseteq J\}\), where \(\text {vec}(\sigma (\varDelta ))\) refers to the singular vector of the matrix \(\varDelta \). Then it is easy to see that \({\mathbb {K}}(2r)=\cup _{|J|\le 2r}S_J\). Let \(G=\{U_1,U_2,\ldots ,U_m\}\) be a 1/3-cover of \(S_J\), then for every \(\varDelta \in S_J\), there exists some \(U_i\) such that \({\left| \left| \left| \tilde{\varDelta } \right| \right| \right| }_F\le 1/3\), where \(\tilde{\varDelta }=\varDelta -U_i\). It then follows from [4, Section 7.2] that one can construct G with \(|G|\le 27^{4r\max (d_1,d_2)}\). Define \(\Psi (\varDelta _1,\varDelta _2)=\langle \langle (\frac{X^\top X}{n}-\Sigma _x)\varDelta ,\varDelta \rangle \rangle \). Then one has that

$$\begin{aligned} \sup _{\varDelta \in S_J}|\Psi (\varDelta ,\varDelta )|\le \max _i|\Psi (U_i,U_i)|+2\sup _{\varDelta \in S_J}|\max _i\Psi (\tilde{\varDelta },U_i)|+\sup _{\varDelta \in S_J}|\Psi (\tilde{\varDelta },\tilde{\varDelta })|. \end{aligned}$$

It then follows from the fact that \(3\tilde{\varDelta }\in S_J\) that

$$\begin{aligned} \sup _{\varDelta \in S_J}|\Psi (\varDelta ,\varDelta )|\le \max _i|\Psi (U_i,U_i)|+\sup _{\varDelta \in S_J}(\frac{2}{3}|\Psi (\varDelta ,\varDelta )|+\frac{1}{9}|\Psi (\varDelta ,\varDelta )|), \end{aligned}$$

and hence, \(\sup _{\varDelta \in S_J}|\Psi (\varDelta ,\varDelta )|\le \frac{9}{2}\max _i|\Psi (U_i,U_i)|\). By Lemma C.4 and a union bound, one has that there exists a universal positive constant \(c'\) such that

$$\begin{aligned} \begin{aligned}&{\mathbb {P}}\left[ \sup _{\varDelta \in S_J}\Big |\frac{{\left| \left| \left| X\varDelta \right| \right| \right| }_F^2}{n}-{\mathbb {E}}\left( \frac{{\left| \left| \left| X\varDelta \right| \right| \right| }_F^2}{n}\right) \Big |\ge t\right] \le 27^{4r\max (d_1,d_2)} \\&\quad \times 2\exp \left( -c'n\min \left( \frac{t^2}{d_2^2\sigma _x^4},\frac{t}{d_2\sigma _x^2}\right) +\log d_2\right) . \end{aligned} \end{aligned}$$

Finally, taking a union bound over the \(\frac{\min (d_1,d_2)}{\lfloor 2r\rfloor }\le (\min (d_1,d_2))^{2r}\) choices of set J yields that there exists a universal positive constant c such that

$$\begin{aligned} \begin{aligned}&{\mathbb {P}}\left[ \sup _{\varDelta \in {\mathbb {K}}(2r)}\Big |\frac{{\left| \left| \left| X\varDelta \right| \right| \right| }_F^2}{n}- {\mathbb {E}}\left( \frac{{\left| \left| \left| X\varDelta \right| \right| \right| }_F^2}{n}\right) \Big |\ge t\right] \\&\quad \le 2\exp \left( -cn\min \left( \frac{t^2}{d_2^2\sigma _x^4},\frac{t}{d_2\sigma _x^2}\right) +\log d_2+2r(2\max (d_1,d_2)+\log (\min (d_1,d_2))\right) . \end{aligned} \end{aligned}$$

The proof is complete. \(\square \)

By virtue of the above lemmas, we are now at the stage to prove Propositions 14.

Proof of Proposition 1

Set

$$\begin{aligned} r=\frac{1}{c'}\min \left( \frac{\lambda _{\min }^2(\Sigma _x)}{d_2^2\sigma _z^4}, \frac{\lambda _{\min }(\Sigma _x)}{d_2\sigma _z^2}\right) \frac{n}{2\max (d_1,d_2)+\log (\min (d_1,d_2))}, \end{aligned}$$
(C.9)

with \(c'>0\) being chosen sufficiently small so that \(r\ge 1\). Then noting that \(\hat{\varGamma }_\text {add}-\Sigma _x=\frac{Z^\top Z}{n}-\Sigma _z\) and by Lemma C.3, one sees that it suffices to show that

$$\begin{aligned} \sup _{\varDelta \in {\mathbb {K}}(2r)}\Big |\langle \langle (\frac{Z^\top Z}{n}-\Sigma _z)\varDelta ,\varDelta \rangle \rangle \Big |\le \frac{\lambda _{\min }(\Sigma _x)}{24} \end{aligned}$$

holds with high probability. Let \(D(r):=\sup _{\varDelta \in {\mathbb {K}}(2r)}\Big |\langle \langle (\frac{Z^\top Z}{n}-\Sigma _z)\varDelta ,\varDelta \rangle \rangle \Big |\) for simplicity. Note that the matrix Z is sub-Gaussian with parameters \((\Sigma _z,\sigma _z^2)\). Then it follows from Lemma C.5 that there exists a universal positive constant \(c''\) such that

$$\begin{aligned} \begin{aligned}&{\mathbb {P}}\left[ D(r)\ge \frac{\lambda _{\min }(\Sigma _x)}{24}\right] \\&\quad \le 2\exp \left( -c''n\min \left( \frac{\lambda _{\min }^2(\Sigma _x)}{576d_2^2\sigma _z^4}, \frac{\lambda _{\min }(\Sigma _x)}{24d_2\sigma _z^2}\right) +\log d_2+2r(2\max (d_1,d_2)+\log (\min (d_1,d_2))\right) . \end{aligned} \end{aligned}$$

This inequality, together with (C.9), implies that there exist universal positive constants \((c_0,c_1)\) such that \(\tau =c_0\tau _\text {add}\), and

$$\begin{aligned} {\mathbb {P}}\left[ D(r)\ge \frac{\lambda _{\min }(\Sigma _x)}{24}\right] \le 2\exp \left( -c_1n\min \left( \frac{\lambda _{\min }^2(\Sigma _x)}{d_2^2 \sigma _z^4},\frac{\lambda _{\min }(\Sigma _x)}{d_2\sigma _z^2}\right) +\log d_2\right) , \end{aligned}$$

which completes the proof. \(\square \)

Proof of Proposition 2

By the definition of \(\hat{\varGamma }_{\text {add}}\) and \(\hat{\varUpsilon }_{\text {add}}\)(cf. (10)), one has that

$$\begin{aligned} \begin{aligned} {\left| \left| \left| \hat{\varUpsilon }_{\text {add}}-\hat{\varGamma }_{\text {add}}\varTheta ^* \right| \right| \right| }_{\text {op}}&= {\left| \left| \left| \frac{Z^\top Y}{n}-(\frac{Z^\top Z}{n}-\Sigma _w)\varTheta ^* \right| \right| \right| }_{\text {op}}\\&= {\left| \left| \left| \frac{Z^\top (X\varTheta ^*+\epsilon )}{n}-(\frac{Z^\top Z}{n}-\Sigma _w)\varTheta ^* \right| \right| \right| }_{\text {op}}\\&\le {\left| \left| \left| \frac{Z^\top \epsilon }{n} \right| \right| \right| }_{\text {op}}+{\left| \left| \left| (\Sigma _w-\frac{Z^\top W}{n})\varTheta ^* \right| \right| \right| }_{\text {op}}\\&\le {\left| \left| \left| \frac{Z^\top \epsilon }{n} \right| \right| \right| }_{\text {op}}+\left( {\left| \left| \left| \Sigma _w \right| \right| \right| }_{\text {op}}+{\left| \left| \left| \frac{Z^\top W}{n} \right| \right| \right| }_{\text {op}}\right) {\left| \left| \left| \varTheta ^* \right| \right| \right| }_*, \end{aligned} \end{aligned}$$

where the second inequality is from the fact that \(Y=X\varTheta ^*+\epsilon \), and the third inequality is due to the triangle inequality. Recall the assumption that the matrices X, W and \(\epsilon \) are assumed to be with i.i.d. rows sampled from Gaussian distributions \({\mathcal {N}}(0,\Sigma _x)\), \({\mathcal {N}}(0,\sigma _w^2{\mathbb {I}}_{d_1})\) and \({\mathcal {N}}(0,\sigma _\epsilon ^2{\mathbb {I}}_{d_2})\), respectively. Then one has that \(\Sigma _w=\sigma _w^2{\mathbb {I}}_{d_1}\) and \({\left| \left| \left| \Sigma _w \right| \right| \right| }_\text {op}=\sigma _w\). It follows from [25, Lemma 3] that there exist universal positive constant \((c_3,c_4,c_5)\) such that

$$\begin{aligned} {\left| \left| \left| \hat{\varUpsilon }_{\text {add}}-\hat{\varGamma }_{\text {add}}\varTheta ^* \right| \right| \right| }_{\text {op}} \le c_3\sigma _\epsilon \sqrt{\lambda _{\max }(\Sigma _z)}\sqrt{\frac{d_1+d_2}{n}} +(\sigma _w+c_3\sigma _w\sqrt{\lambda _{\max }(\Sigma _z)}\sqrt{\frac{2d_1}{n}}){\left| \left| \left| \varTheta ^* \right| \right| \right| }_*, \end{aligned}$$

with probability at least \(1-c_4\exp (-c_5\log (\max (d_1,d_2))\). Recall that the nuclear norm of \(\varTheta ^*\) is assumed to be bounded as \({\left| \left| \left| \varTheta ^* \right| \right| \right| }_*\le \omega \). Then up to constant factors, we conclude that there exists universal positive constants \((c_0,c_1,c_2)\) such that

$$\begin{aligned} {\left| \left| \left| \hat{\varUpsilon }_{\text {add}}-\hat{\varGamma }_{\text {add}}\varTheta ^* \right| \right| \right| }_{\text {op}} \le c_0\phi _\text {add}\sqrt{\frac{\max (d_1,d_2)}{n}}, \end{aligned}$$

with probability at least \(1-c_1\exp (-c_2\log (\max (d_1,d_2))\). The proof is complete. \(\square \)

Proof of Proposition 3

This proof is similar to that of Proposition 1 in the additive noise case. Set \(\sigma ^2=\frac{{\left| \left| \left| \Sigma _x \right| \right| \right| }_\text {op}^2}{(1-\rho )^2}\), and

$$\begin{aligned} r=\frac{1}{c'}\min \left( \frac{\lambda _{\min }^2(\Sigma _x)}{d_2^2\sigma ^4}, \frac{\lambda _{\min }(\Sigma _x)}{d_2\sigma ^2}\right) \frac{n}{2\max (d_1,d_2)+\log (\min (d_1,d_2))}, \end{aligned}$$
(C.10)

with \(c'>0\) being chosen sufficiently small so that \(r\ge 1\). Recall that \(\oslash \) denote element-wise division and note that

$$\begin{aligned} \hat{\varGamma }_{\text {mis}}=\frac{1}{(1-\rho )^2}\frac{Z^\top Z}{n}-\rho \cdot \text {diag}\left( \frac{1}{(1-\rho )^2}\frac{Z^\top Z}{n}\right) =\frac{Z^\top Z}{n}\oslash M, \end{aligned}$$

and thus

$$\begin{aligned} \hat{\varGamma }_{\text {mis}}-\Sigma _x=\frac{Z^\top Z}{n}\oslash M-\Sigma _x=(\frac{Z^\top Z}{n}-\Sigma _z)\oslash M. \end{aligned}$$

By Lemma C.3, one sees that it suffices to show that

$$\begin{aligned} \sup _{\varDelta \in {\mathbb {K}}(2r)}\Big |\langle \langle ((\frac{Z^\top Z}{n}-\Sigma _z)\oslash M) \varDelta ,\varDelta \rangle \rangle \Big |\le \frac{\lambda _{\min }(\Sigma _x)}{24} \end{aligned}$$

holds with high probability. Let \(D(r):=\sup _{\varDelta \in {\mathbb {K}}(2r)}\Big |\langle \langle ((\frac{Z^\top Z}{n}-\Sigma _z)\oslash M) \varDelta ,\varDelta \rangle \rangle \Big |\) for simplicity. On the other hand, one has that

$$\begin{aligned} \Big |\langle \langle ((\frac{Z^\top Z}{n}-\Sigma _z)\oslash M) \varDelta ,\varDelta \rangle \rangle \Big |\le \frac{1}{(1-\rho )^2}\Big | \langle \langle (\frac{Z^\top Z}{n}-\Sigma _z)\varDelta ,\varDelta \rangle \rangle \Big | \end{aligned}$$

Note that the matrix Z is sub-Gaussian with parameters \((\Sigma _z,{\left| \left| \left| \Sigma _x \right| \right| \right| }_\text {op}^2)\) [23]. Then it follows from Lemma C.5 that there exists a universal positive constant \(c''\) such that

$$\begin{aligned} \begin{aligned}&{\mathbb {P}}\left[ \Big |\langle \langle (\frac{Z^\top Z}{n}-\Sigma _z)\varDelta ,\varDelta \rangle \rangle \Big |\ge (1-\rho )^2\frac{\lambda _{\min }(\Sigma _x)}{24}\right] \\&\quad \le 2\exp (-c''n\min \left( (1-\rho )^4\frac{\lambda _{\min }^2 (\Sigma _x)}{576d_2^2{\left| \left| \left| \Sigma _x \right| \right| \right| }_\text {op}^4},(1-\rho )^2 \frac{\lambda _{\min }(\Sigma _x)}{24d_2{\left| \left| \left| \Sigma _x \right| \right| \right| }_\text {op}^2}\right) \\&\quad +\log d_2+2r(2\max (d_1,d_2)+\log (\min (d_1,d_2))). \end{aligned} \end{aligned}$$

This inequality, together with (C.10), implies that there exists universal positive constants \((c_0,c_1)\) such that \(\tau =c_0\tau _\text {add}\), and

$$\begin{aligned}{} & {} {\mathbb {P}}\left[ D(r)\ge \frac{\lambda _{\min }(\Sigma _x)}{24}\right] \\{} & {} \quad \le 2\exp \left( -c_1n\min \left( (1-\rho )^4\frac{\lambda _{\min }^2(\Sigma _x)}{d_2^2{\left| \left| \left| \Sigma _x \right| \right| \right| }_\text {op}^4},(1-\rho )^2\frac{\lambda _{\min } (\Sigma _x)}{d_2{\left| \left| \left| \Sigma _x \right| \right| \right| }_\text {op}^2}\right) +\log d_2\right) , \end{aligned}$$

which completes the proof. \(\square \)

Proof of Proposition 4

Note that the matrix Z is sub-Gaussian with parameters \((\Sigma _z,{\left| \left| \left| \Sigma _x \right| \right| \right| }_\text {op}^2)\) [23]. The following discussion is divided into two parts. First consider the quantity \({\left| \left| \left| {\varUpsilon }_{\text {mis}}-\Sigma _x\varTheta ^* \right| \right| \right| }_\text {op}\). By the definition of \(\hat{\varUpsilon }_{\text {mis}}\) (cf. (11)) and the fact that \(Y=X\varTheta ^*+\epsilon \), one has that

$$\begin{aligned} \begin{aligned} {\left| \left| \left| {\hat{\varUpsilon }}_{\text {mis}}-\Sigma _x\varTheta ^* \right| \right| \right| }_\text {op}&= \frac{1}{1-\rho }{\left| \left| \left| \frac{1}{n}Z^\top Y-(1-\rho )\Sigma _x\varTheta ^* \right| \right| \right| }_\text {op}\\&= \frac{1}{1-\rho }{\left| \left| \left| \frac{1}{n}Z^\top (X\varTheta ^*+\epsilon )-(1-\rho )\Sigma _x\varTheta ^* \right| \right| \right| }_\text {op}\\&\le \frac{1}{1-\rho }\left( {\left| \left| \left| \left( \frac{1}{n}Z^\top X-(1-\rho )\Sigma _x\right) \varTheta ^* \right| \right| \right| }_\text {op} +{\left| \left| \left| \frac{1}{n}Z^\top \epsilon \right| \right| \right| }_\text {op}\right) . \end{aligned} \end{aligned}$$

It then follows from the assumption that \({\left| \left| \left| \varTheta ^* \right| \right| \right| }_*\le \omega \) that

$$\begin{aligned} \begin{aligned} {\left| \left| \left| {\hat{\varUpsilon }}_{\text {mis}}-\Sigma _x\varTheta ^* \right| \right| \right| }_\text {op}&\le \frac{1}{1-\rho }\left[ \left( {\left| \left| \left| \frac{1}{n}Z^\top X \right| \right| \right| }_\text {op}+(1-\rho ){\left| \left| \left| \Sigma _x \right| \right| \right| }_\text {op}\right) {\left| \left| \left| \varTheta ^* \right| \right| \right| }_*+{\left| \left| \left| \frac{1}{n}Z^\top \epsilon \right| \right| \right| }_\text {op}\right] \\&\le \frac{1}{1-\rho }\left[ \left( {\left| \left| \left| \frac{1}{n}Z^\top X \right| \right| \right| }_\text {op}+(1-\rho ){\left| \left| \left| \Sigma _x \right| \right| \right| }_\text {op}\right) \omega +{\left| \left| \left| \frac{1}{n}Z^\top \epsilon \right| \right| \right| }_\text {op}\right] . \end{aligned} \end{aligned}$$

Recall the assumption that the matrices X, W and \(\epsilon \) are assumed to be with i.i.d. rows sampled from Gaussian distributions \({\mathcal {N}}(0,\Sigma _x)\), \({\mathcal {N}}(0,\sigma _w^2{\mathbb {I}}_{d_1})\) and \({\mathcal {N}}(0,\sigma _\epsilon ^2{\mathbb {I}}_{d_2})\), respectively. Then it follows from [25, Lemma 3] that there exist universal positive constant \((c_3,c_4,c_5)\) such that

$$\begin{aligned} {\left| \left| \left| {\hat{\varUpsilon }}_{\text {mis}}-\Sigma _x\varTheta ^* \right| \right| \right| }_\text {op}\le & {} c_3\frac{\sigma _\epsilon }{1-\rho }\sqrt{\lambda _{\max }(\Sigma _z)}\sqrt{\frac{d_1+d_2}{n}}\nonumber \\{} & {} + \left( c_3\frac{{\left| \left| \left| \Sigma _x \right| \right| \right| }_\text {op}}{1-\rho }\sqrt{\lambda _{\max }(\Sigma _z)} \sqrt{\frac{2d_1}{n}}+{\left| \left| \left| \Sigma _x \right| \right| \right| }_\text {op}\right) \omega \end{aligned}$$
(C.11)

with probability at least \(1-c_4\exp (-c_5\log (\max (d_1,d_2))\). Now let us consider the quantity \({\left| \left| \left| ({\varGamma }_{\text {mis}}-\Sigma _x)\varTheta ^* \right| \right| \right| }_\text {op}\). By the definition of \(\hat{\varGamma }_{\text {mis}}\) (cf. (11)), one has that

$$\begin{aligned} \begin{aligned} {\left| \left| \left| (\hat{\varGamma }_{\text {mis}}-\Sigma _x)\varTheta ^* \right| \right| \right| }_\text {op}&= {\left| \left| \left| ((\frac{Z^\top Z}{n}-\Sigma _z)\oslash M)\varTheta ^* \right| \right| \right| }_{\text {op}} \le \frac{1}{(1-\rho )^2}{\left| \left| \left| (\frac{Z^\top Z}{n}-\Sigma _z)\varTheta ^* \right| \right| \right| }_{\text {op}}\\&\le \frac{1}{(1-\rho )^2}\left( {\left| \left| \left| \frac{Z^\top Z}{n} \right| \right| \right| }_{\text {op}}+{\left| \left| \left| \Sigma _z \right| \right| \right| }_{\text {op}}\right) \omega \end{aligned} \end{aligned}$$

This inequality, together with [25, Lemma 3], implies that there exists universal positive constants \((c_6,c_7,c_8)\) such that

$$\begin{aligned} {\left| \left| \left| (\hat{\varGamma }_{\text {mis}}-\Sigma _x)\varTheta ^* \right| \right| \right| }_\text {op}\le c_6\frac{1}{(1-\rho )^2}{\left| \left| \left| \Sigma _x \right| \right| \right| }_\text {op}\sqrt{\lambda _{\max } (\Sigma _z)}\sqrt{\frac{2d_1}{n}}\omega +\frac{1}{(1-\rho )^2}{\left| \left| \left| \Sigma _x \right| \right| \right| }_\text {op}\omega \nonumber \\ \end{aligned}$$
(C.12)

with probability at least \(1-c_7\exp (-c_8\log (\max (d_1,d_2))\). Combining (C.11) and (C.12), up to constant factors, we conclude that there exist universal positive constant \((c_0,c_1,c_2)\) such that

$$\begin{aligned} {\left| \left| \left| \hat{\varUpsilon }_{\text {mis}}-\hat{\varGamma }_{\text {mis}} \varTheta ^* \right| \right| \right| }_{\text {op}} \le c_0\frac{\lambda _{\max }(\Sigma _z)}{1-\rho } \left( \frac{\omega }{1-\rho }{\left| \left| \left| \Sigma _x \right| \right| \right| }_\text {op}+ \sigma _\epsilon \right) \sqrt{\frac{\max ( d_1,d_2)}{n}}, \end{aligned}$$

with probability at least \(1-c_1\exp (-c_2\max ( d_1,d_2))\). The proof is complete. \(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, X., Wu, D. Low-rank matrix estimation via nonconvex optimization methods in multi-response errors-in-variables regression. J Glob Optim 88, 79–114 (2024). https://doi.org/10.1007/s10898-023-01293-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10898-023-01293-w

Keywords

Navigation