Abstract
Noisy and missing data cannot be avoided in real application, such as bioinformatics, economics and remote sensing. Existing methods mainly focus on linear errors-in-variables regression, while relatively little attention is paid for the case of multivariate responses, and how to achieve consistent estimation under corrupted covariates is still an open question. In this paper, a nonconvex error-corrected estimator is proposed for the matrix estimation problem in the multi-response errors-in-variables regression model. Statistical and computational properties for global solutions of the estimator are analysed. In the statistical aspect, the nonasymptotic recovery bound for all global solutions of the nonconvex estimator is established. In the computational aspect, the proximal gradient method is applied to solve the nonconvex optimization problem and proved to linearly converge to a near-global solution. Sufficient conditions are verified in order to obtain probabilistic consequences for specific types of measurement errors by virtue of random matrix analysis. Finally, simulation results on synthetic and real neuroimaging data demonstrate the theoretical properties and show nice consistency under high-dimensional scaling.
Similar content being viewed by others
References
Agarwal, A., Negahban, S., Wainwright, M.J.: Fast global convergence of gradient methods for high-dimensional statistical recovery. Ann. Stat. 40(5), 2452–2482 (2012)
Agarwal, A., Negahban, S., Wainwright, M.J.: Noisy matrix decomposition via convex relaxation: optimal rates in high dimensions. Ann. Stat. 40(2), 1171–1197 (2012)
Agarwal, A., Negahban, S.N., Wainwright, M.J.: Supplementary material: Fast global convergence of gradient methods for high-dimensional statistical recovery. Ann. Stat. 40(5), 31–60 (2012)
Alquier, P., Bertin, K., Doukhan, P., Garnier, R.: High-dimensional var with low-rank transition. Stat. Comput. 1–15 (2020)
Annaliza, M., Khalili, A., Stephens, D.A.: Estimating sparse networks with hubs. J. Multivar. Anal. 179, 104655 (2020)
Barch, D.M., Burgess, G.C., Harms, M.P., Petersen, S.E., Consortium, W.M.H.: Function in the human connectome: task-fMRI and individual differences in behavior. Neuroimage 80(8), 169–189 (2013)
Belloni, A., Rosenbaum, M., Tsybakov, A.B.: An \(\ell _1, \ell _2, \ell _\infty \)-regularization approach to high-dimensional errors-in-variables models. Electron. J. Stat. 10(2), 1729–1750 (2016)
Belloni, A., Rosenbaum, M., Tsybakov, A.B.: Linear and conic programming estimators in high dimensional errors-in-variables models. J. R. Stat. Soc. Ser. B Stat Methodol. 79(3), 939–956 (2017)
Bickel, P.J., Ritov, Y.: Efficient estimation in the errors in variables model. Ann. Stat. 513–540 (1987)
Bühlmann, P., Van De Geer, S.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Berlin (2011)
Candès, E.J., Tao, T.: The Dantzig selector: statistical estimation when \(p\) is much larger than \(n\). Ann. Stat. 35(6), 2313–2351 (2007)
Candès, E.J., Tao, T.: The power of convex relaxation: near-optimal matrix completion. IEEE Trans. Inf. Theory 56(5), 2053–2080 (2010)
Carroll, R.J., Ruppert, D., Stefanski, L.A., Crainiceanu, C.M.: Measurement Error in Nonlinear Models: A Modern Perspective. CRC Press, Cambridge (2006)
Chen, H., Raskutti, G., Yuan, M.: Non-convex projected gradient descent for generalized low-rank tensor regression. J. Mach. Learn. Res. 20, 1–37 (2019)
Chen, Y., Luo, Z., Kong, L.C.: \(\ell _{2,0}\)-norm based selection and estimation for multivariate generalized linear models. J. Multivar. Anal. 185, 104782 (2021)
Datta, A., Zou, H.: CoCoLasso for high-dimensional error-in-variables regression. Ann. Stat. 45(6), 2400–2426 (2017)
Han, R.G., Willett, R., Zhang, A.R.: An optimal statistical and computational framework for generalized tensor estimation. Ann. Stat. 1(50), 1–29 (2022)
Izenman, A.J.: Modern multivariate statistical techniques. Regression, classification and manifold learning. Springer Press, Berlin (2008)
Li, M.Y., Li, R.Z., Ma, Y.Y.: Inference in high dimensional linear measurement error models. J. Multivar. Anal. 184, 104759 (2021)
Li, X., Wu, D.Y., Cui, Y., Liu, B., Walter, H., Schumann, G., Li, C., Jiang, T.Z.: Reliable heritability estimation using sparse regularization in ultrahigh dimensional genome-wide association studies. BMC Bioinform. 20(1), 219 (2019)
Li, X., Wu, D.Y., Li, C., Wang, J.H., Yao, J.C.: Sparse recovery via nonconvex regularized M-estimators over \(\ell _q\)-balls. Comput. Stat. Data Anal. 152, 107047 (2020)
Loh, P.L., Wainwright, M.J.: High-dimensional regression with noisy and missing data: provable guarantees with nonconvexity. Ann. Stat. 40(3), 1637–1664 (2012)
Loh, P.L., Wainwright, M.J.: Supplementary material: High-dimensional regression with noisy and missing data: provable guarantees with nonconvexity. Ann. Stat. 40(3), 1–21 (2012)
Loh, P.L., Wainwright, M.J.: Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima. J. Mach. Learn. Res. 16(1), 559–616 (2015)
Negahban, S., Wainwright, M.J.: Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Stat. 1069–1097 (2011)
Negahban, S., Wainwright, M.J.: Restricted strong convexity and weighted matrix completion: optimal bounds with noise. J. Mach. Learn. Res. 13(1), 1665–1697 (2012)
Nesterov, Y.: Gradient methods for minimizing composite objective function. Tech. rep. Université catholique de Louvain, Center for Operations Research and Econometrics (CORE) (2007)
Raskutti, G., Yuan, M., Chen, H.: Convex regularization for high-dimensional multiresponse tensor regression. Ann. Stat. 47(3), 1554–1584 (2019)
Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3), 471–501 (2010)
Rosenbaum, M., Tsybakov, A.B.: Sparse recovery under matrix uncertainty. Ann. Stat. 38(5), 2620–2651 (2010)
Rosenbaum, M., Tsybakov, A.B.: Improved matrix uncertainty selector. In: From Probability to Statistics and Back: High-Dimensional Models and Processes–A Festschrift in Honor of Jon A. Wellner, pp. 276–290. Institute of Mathematical Statistics (2013)
Sørensen, Ø., Frigessi, A., Thoresen, M.: Measurement error in LASSO: impact and likelihood bias correction. Stat. Sinica 25, 809–829 (2015)
Sørensen, Ø., Hellton, K.H., Frigessi, A., Thoresen, M.: Covariate selection in high-dimensional generalized linear models with measurement error. J. Comput. Graph. Stat. 27(4), 739–749 (2018)
Wainwright, M.J.: High-Dimensional Statistics: A Non-asymptotic Viewpoint, vol. 48. Cambridge University Press, Cambridge (2019)
Wang, Z., Paterlini, S., Gao, F.C., Yang, Y.H.: Adaptive minimax regression estimation over sparse \(\ell _q\)-hulls. J. Mach. Learn. Res. 15(1), 1675–1711 (2014)
Wu, D.Y., Li, X., Feng, J.: Connectome-based individual prediction of cognitive behaviors via graph propagation network reveals directed brain network topology. J. Neural Eng. 18(4) (2021)
Wu, J., Zheng, Z.M., Li, Y., Zhang, Y.: Scalable interpretable learning for multi-response error-in-variables regression. J. Multivar. Anal. 104644 (2020)
Zhou, H., Li, L., Zhu, H.T.: Tensor regression with applications in neuroimaging data analysis. J. Am. Stat. Assoc. 502(108), 540–552 (2013)
Zhou, H., Li, L.X.: Regularized matrix regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 76(2), 463–483 (2014)
Acknowledgements
Xin Li’s work was supported in part by the National Natural Science Foundation of China (Grant No. 12201496) and the Natural Science Foundation of Shaanxi Province of China (Grant No. 2022JQ-045). Dongya Wu’s work was supported in part by the National Natural Science Foundation of China (Grant No. 62103329).
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Proof of Theorem 1
Set \(\hat{\varDelta }:=\hat{\varTheta }-\varTheta ^*\). By the feasibility of \(\varTheta ^*\) and optimality of \(\hat{\varTheta }\), one has that \(\Psi (\hat{\varTheta })\le \Psi (\varTheta ^*)\). Then it follows from elementary algebra and the triangle inequality that
Applying Hölder’s inequality and by the deviation condition (14), one has that
Combining the above two inequalities, and noting (21), we obtain that
Applying the RSC condition (12) to the left-hand side of (A.1), yields that
On the other hand, by assumptions (20) and (21) and noting the fact that \({\left| \left| \left| \hat{\varDelta } \right| \right| \right| }_*\le {\left| \left| \left| \varTheta ^* \right| \right| \right| }_*+{\left| \left| \left| \hat{\varTheta } \right| \right| \right| }_*\le 2\omega \), the left-hand side of (A.2) is lower bounded as
Combining this inequality with (A.2), one has that
Then it follows from Lemma 1 that there exists a matrix \(\hat{\varDelta }'\) such that
where \(\text {rank}(\hat{\varDelta }')\le 2r\) with r to be chosen later, and the second inequality is due to the fact that \({\left| \left| \left| \hat{\varDelta }' \right| \right| \right| }_*\le \sqrt{2r}{\left| \left| \left| \hat{\varDelta }' \right| \right| \right| }_F\). Combining (A.3) and (A.4), we obtain that
Then it follows that
Recall the set \(K_\eta \) defined in (16) and set \(r=|K_\eta |\). Combining (A.5) with (18) and setting \(\eta =\frac{\lambda }{\alpha _1}\), we arrive at (22). Moreover, it follows from (A.4) that (23) holds. The proof is complete.
B Proof of Theorem 2
Before providing the proof of Theorem 2, we need several useful lemmas first.
Lemma B.1
Suppose that the conditions of Theorem 2 are satisfied, and that there exists a pair \((\delta , T)\) such that (28) holds. Then for any iteration \(t\ge T\), it holds that
where \(\bar{\epsilon }_{stat }\) and \(\epsilon (\delta )\) are defined respectively in (25) and (29).
Proof
We first prove that if \(\lambda \ge 4\phi ({\mathbb {Q}},\sigma _\epsilon )\sqrt{\frac{\max (d_1,d_2)}{n}}\), then for any \(\varTheta \in {\mathbb {S}}\) satisfying
it holds that
Set \(\varDelta :=\varTheta -\varTheta ^*\). From (B.1), one has that
Then subtracting \(\langle \langle \nabla {\mathcal {L}}(\varTheta ^*),\varDelta \rangle \rangle \) from both sides of the former inequality and recalling the formulation of \({\mathcal {L}}(\cdot )\), we obtain that
We now claim that
In fact, combining (B.3) with the RSC condition (12) and Hölder’s inequality, one has that
This inequality, together with the deviation condition (14) and the assumption that \(\lambda \ge 4\phi ({\mathbb {Q}},\sigma _\epsilon )\sqrt{\frac{\max (d_1,d_2)}{n}}\), implies that
Noting the facts that \(\alpha _1>0\) and that \({\left| \left| \left| \varDelta \right| \right| \right| }_*\le {\left| \left| \left| \varTheta ^* \right| \right| \right| }_*+{\left| \left| \left| \varTheta \right| \right| \right| }_*\le 2\omega \), one arrives at (B.4) by combining assumptions (30) and (31). On the other hand, it follows from Lemma 1(i) that there exists two matrices \(\varDelta '\) and \(\varDelta ''\) such that \(\varDelta =\varDelta '+\varDelta ''\), where \(\text {rank}(\varDelta ')\le 2r\) with r to be chosen later. Recall the definitions of \({\mathbb {A}}^r\) and \({\mathbb {B}}^r\) given respectively in (15a) and (15b). Then the decomposition \(\varTheta ^*=\Pi _{{\mathbb {A}}^r}(\varTheta ^*)+\Pi _{{\mathbb {B}}^r}(\varTheta ^*)\) holds. This equality, together with the triangle inequality and Lemma 1(i), implies that
Consequently, we have
Combining (B.5) and (B.4) and noting the fact that \({\left| \left| \left| \Pi _{{\mathbb {B}}^r}(\varTheta ^*) \right| \right| \right| }_*=\sum _{j=r+1}^{d}\sigma _j(\varTheta ^*)\), one has that \(0\le \frac{3\lambda }{2}{\left| \left| \left| \varDelta ' \right| \right| \right| }_*-\frac{\lambda }{2}{\left| \left| \left| \varDelta '' \right| \right| \right| }_*+2\lambda \sum _{j=r+1}^{d}\sigma _j(\varTheta ^*)+\delta \), and consequently, \({\left| \left| \left| \varDelta '' \right| \right| \right| }_*\le 3{\left| \left| \left| \varDelta ' \right| \right| \right| }_*+4\sum _{j=r+1}^{d}\sigma _j(\varTheta ^*)+\frac{2\delta }{\lambda }\). Using the trivial bound \({\left| \left| \left| \varDelta \right| \right| \right| }_*\le 2\omega \), one has that
Recall the set \(K_\eta \) defined in (16) and set \(r=|K_\eta |\). Combining (B.6) with (18) and setting \(\eta =\lambda \), we arrive at (B.2). We now verify that (B.1) is held by the matrix \(\hat{\varTheta }\) and \(\varTheta ^t\), respectively. Since \(\hat{\varTheta }\) is the optimal solution, it holds that \(\Psi (\hat{\varTheta })\le \Psi (\varTheta ^*)\), and by assumption (28), it holds that \(\Psi (\varTheta ^t)\le \Psi (\hat{\varTheta })+\delta \le \Psi (\varTheta ^*)+\delta \). Consequently, it follows from (B.2) that
By the triangle inequality, we then arrive at that
The proof is complete. \(\square \)
Lemma B.2
Suppose that the conditions of Theorem 2 are satisfied, and that there exists a pair \((\varDelta , T)\) such that (28) holds. Then for any iteration \(t\ge T\), we have that
where \(\bar{\epsilon }_{stat }\), \(\epsilon (\delta )\), \(\kappa \) and \(\xi \) are defined in (25), (29), (26) and (27), respectively.
Proof
By the RSC condition (12), one has that
It then follows from Lemma B.1 and the assumption that \(\lambda \ge \left( \frac{128\tau R_q}{\alpha _1}\right) ^{1/q}\) that
which establishes (B.7). Furthermore, by the convexity of \({\left| \left| \left| \cdot \right| \right| \right| }_*\), let \(g\in \partial \left( {\left| \left| \left| \hat{\varTheta } \right| \right| \right| }_*\right) \). Then one has that
and by the first-order optimality condition for \(\hat{\varTheta }\), one has that
Combining (B.10), (B.11) and (B.12), we obtain that
Then applying Lemma B.1 to bound the term \({\left| \left| \left| \varTheta ^t-\hat{\varTheta } \right| \right| \right| }_*^2\) and noting the assumption that \(\lambda \ge \left( \frac{128\tau R_q}{\alpha _1}\right) ^{1/q}\), we arrive at (B.8). Now we turn to prove (B.9). Define
which is the objective function minimized over the feasible region \({\mathbb {S}}=\{\varTheta \big |{\left| \left| \left| \varTheta \right| \right| \right| }_*\le \omega \}\) at iteration count t. For any \(a\in [0,1]\), it is easy to check that the matrix \(\varTheta _a=a\hat{\varTheta }+(1-a)\varTheta ^t\) belongs to \({\mathbb {S}}\) due to the convexity of \({\mathbb {S}}\). Since \(\varTheta ^{t+1}\) is the optimal solution of the optimization problem (24), we have that
where the last inequality is from the convexity of \({\left| \left| \left| \cdot \right| \right| \right| }_*\). Then by (B.7), one sees that
Applying the RSM condition (13) on the matrix \(\varTheta ^{t+1}-\varTheta ^t\) with some algebra, we have by assumption \(v\ge \alpha _2\) that
Adding \(\lambda {\left| \left| \left| \varTheta ^{t+1} \right| \right| \right| }_*\) to both sides of the former inequality, we obtain that
This, together with (B.13), implies that
Define \(\varDelta ^t:=\varTheta ^t-\hat{\varTheta }\). Then it follows directly that \({\left| \left| \left| \varTheta ^{t+1}-\varTheta ^t \right| \right| \right| }_*^2\le ({\left| \left| \left| \varDelta ^{t+1} \right| \right| \right| }_*+{\left| \left| \left| \varDelta ^t \right| \right| \right| }_*)^2 \le 2{\left| \left| \left| \varDelta ^{t+1} \right| \right| \right| }_*^2+2{\left| \left| \left| \varDelta ^t \right| \right| \right| }_*^2\). Combining this inequality with (B.14), one has that
To simplify the notations, define \(\psi :=\tau (\bar{\epsilon }_{\text {stat}}+\epsilon (\varDelta ))^2\), \(\zeta :=\tau \lambda ^{-q}R_q\) and \(\delta _t:=\Psi (\varTheta ^t)-\Psi (\hat{\varTheta })\). Using Lemma B.1 to bound the term \({\left| \left| \left| \varDelta ^{t+1} \right| \right| \right| }_*^2\) and \({\left| \left| \left| \varDelta ^t \right| \right| \right| }_*^2\), we arrive at that
Subtracting \(\Psi (\hat{\varTheta })\) from both sides of (B.15), one has by (B.8) that
Setting \(a=\frac{\alpha _1}{4v}\in (0,1)\), it follows from the former inequality that
or equivalently, \(\delta _{t+1}\le \kappa \delta _t+\xi (\bar{\epsilon }_{\text {stat}} +\epsilon (\delta ))^2\), where \(\kappa \) and \(\xi \) were previously defined in (26) and (27), respectively. Finally, we conclude that
The proof is complete. \(\square \)
By virtue of the above lemmas, we are now ready to prove Theorem 2. The proof mainly follows the arguments in [1, 21].
We first prove the following inequality:
Divide iterations \(t=0,1,\ldots \) into a sequence of disjoint epochs \([T_k,T_{k+1}]\) and define the associated sequence of tolerances \(\delta _0>\delta _1>\cdots \) such that
as well as the corresponding error term \(\epsilon _k:=2\min \left\{ \frac{\delta _k}{\lambda },\omega \right\} \). The values of \(\{(\delta _k,T_k)\}_{k\ge 1}\) will be decided later. Then at the first iteration, Lemma B.2 (cf. (B.9)) is applied with \(\epsilon _0=2\omega \) and \(T_0=0\) to conclude that
Set \(\delta _1:=\frac{4\xi }{1-\kappa }(\bar{\epsilon }_{\text {stat}}^2 +4\omega ^2)\). Noting that \(\kappa \in (0,1)\) by assumption, it follows from (B.17) that for \(T_1:=\lceil \frac{\log (2\delta _0/\delta _1)}{\log (1/\kappa )} \rceil \),
For \(k\ge 1\), define
Then Lemma B.2 (cf. (B.9)) is applied to concluding that for all \(t\ge T_k\),
which further implies that
From (B.18), one obtains the recursion for \(\{(\delta _k,T_k)\}_{k=0}^\infty \) as follows
Then by [3, Section 7.2], it is easy to see that (B.19a) implies that
Now let us show how to determine the smallest k such that \(\delta _k\le \delta ^*\) by using (B.20). If we are at the first epoch, (B.16) is clearly held due to (B.19a). Otherwise, from (B.19b), one sees that \(\delta _k\le \delta ^*\) is held after at most
epoches. Combining the above bound on \(k(\delta ^*)\) with (B.19b), one obtains that \(\Psi (\varTheta ^t)-\Psi (\hat{\varTheta })\le \delta ^*\) holds for all iterations
which establishes (B.16). Finally, as (B.16) is proved, one has by (B.8) in Lemma B.2 and the assumption that \(\lambda \ge \left( \frac{128\tau R_q}{\alpha _1}\right) ^{1/q}\) that, for any \(t\ge T(\delta ^*)\),
Consequently, it follows from (30) and (31) that, for any \(t\ge T(\delta ^*)\),
The proof is complete.
C Proofs of Sect. 4
In this section, several important technical lemmas are provided first, which are used to verify the RSC/RSM conditions and deviation conditions for specific errors-in-variables models (cf. Propositions 1–4). Some notations are needed to ease the expositions. For a symbol \(x\in \{0,*,F\}\) and a positive real number \(r\in {\mathbb {R}}^+\), define \({\mathbb {M}}_x(r):=\{A\in R^{d_1\times d_2}|{\left| \left| \left| A \right| \right| \right| }_x\le r\}\), where \({\left| \left| \left| A \right| \right| \right| }_0\) denotes the rank of matrix A. Then define the sparse set
and the cone set
The following lemma tells us that the intersection of the matrix \(\ell _1\)-ball with the matrix \(\ell _2\)-ball can be bounded by virtue of a simpler set.
Lemma C.1
For any constant \(r\ge 1\), it holds that
where \(cl \{\cdot \}\) and \(conv \{\cdot \}\) denote the topological closure and convex hull, respectively.
Proof
Note that when \(r> \min \{d_1,d_2\}\), this containment is trivial, since the right-hand set is equal to \({\mathbb {M}}_F(2)\) and the left-hand set is contained in \({\mathbb {M}}_F(1)\). Thus, we will assume \(1\le r\le \min \{d_1,d_2\}\).
Let \(A\in {\mathbb {M}}_*(\sqrt{r})\cap {\mathbb {M}}_F(1)\). Then it follows that \({\left| \left| \left| A \right| \right| \right| }_*\le \sqrt{r}\) and \({\left| \left| \left| A \right| \right| \right| }_F\le 1\). Consider a singular value decomposition of A: \(A=UDV^\top \), where \(U\in {\mathbb {R}}^{d_1\times d_1}\) and \(V\in {\mathbb {R}}^{d_2\times d_2}\) are orthogonal matrices, and \(D\in {\mathbb {R}}^{d_1\times d_2}\) consists of \(\sigma _1(D),\sigma _2(D),\ldots ,\sigma _k(D)\) on the “diagonal” and 0 elsewhere with \(k=\text {rank}(A)\). Write \(D=\text {diag}(\sigma _1(D),\sigma _2(D),\ldots ,\sigma _k(D),0\ldots ,0)\), and use \(\text {vec}(D)\) to denote the vectorized form of the matrix D. Then it follows that \(\Vert \text {vec}(D)\Vert _1\le \sqrt{r}\) and \(\Vert \text {vec}(D)\Vert _2\le 1\). Partition the support of \(\text {vec}(D)\) into disjoint subsets \(T_1,T_2,\ldots \), such that \(T_1\) is the index set corresponding to the first r largest elements in absolute value of \(\text {vec}(D)\), \(T_2\) indexes the next r largest elements, and so on. Write \(D_i=\text {diag}(\text {vec}(D)_{T_i})\), and \(A_i=UD_iV^\top \). Then one has that \({\left| \left| \left| A_i \right| \right| \right| }_0=\text {rank}(A_i)\le r\) and \({\left| \left| \left| A_i \right| \right| \right| }_F\le 1\). Write \(B_i=2A_i/{\left| \left| \left| A_i \right| \right| \right| }_F\) and \(t_i={\left| \left| \left| A_i \right| \right| \right| }_F/2\). Then it holds that \(B_i\in 2\{{\mathbb {M}}_0(r)\cap {\mathbb {M}}_F(1)\}\}\) and \(t_i\ge 0\). Now it suffices to check that A can be expressed as a convex combination of matrices in \(2\{\{{\mathbb {M}}_0(r)\cap {\mathbb {M}}_F(1)\}\}\), namely \(A=\sum _{i\ge 1}t_iB_i\). Since the zero matrix contains in \(2\{{\mathbb {M}}_0(r)\cap {\mathbb {M}}_F(1)\}\), it suffices to show that \(\sum _{i\ge 1}t_i\le 1\), which is equivalent to \(\sum _{i\ge 1}\Vert \text {vec}(D)_{T_i}\Vert _2\le 2\). To prove this, first note that \(\Vert \text {vec}(D)_{T_1}\Vert _2\le \Vert \text {vec}(D)\Vert _2\). Second, note that for \(i\ge 2\), each elements of \(\text {vec}(D)_{T_i}\) is bounded in magnitude by \(\Vert \text {vec}(D)_{T_{i-1}}\Vert _1/r\), and thus \(\Vert \text {vec}(D)_{T_i}\Vert _2\le \Vert \text {vec}(D)_{T_{i-1}}\Vert _1/\sqrt{r}\). Combining these two facts, one has that
The proof is complete. \(\square \)
Lemma C.2
Let \(r\ge 1\), \(\delta >0\) be two constants, and \(\varGamma \in {\mathbb {R}}^{d_1\times d_1}\) be a fixed matrix. Suppose that the following condition holds
Then we have that
Proof
We begin with establishing the inequalities
where \({\mathbb {C}}(r)\) is defined in (C.2). Then (C.4) follows directly.
Now we turn to prove (C.5). By rescaling, (C.5a) holds if one can check that
It then follows from Lemma C.1 and continuity that (C.6) can be reduced to the problem of proving that
For this purpose, consider a weighted linear combination of the form \(\varDelta =\sum _it_i\varDelta _i\), with weights \(t_i\ge 0\) such that \(\sum _it_i=1\), \({\left| \left| \left| \varDelta _i \right| \right| \right| }_0\le r\), and \({\left| \left| \left| \varDelta _i \right| \right| \right| }_F\le 2\) for each i. Then one has that
On the other hand, it holds that for all i, j
Noting that \(\frac{1}{2}\varDelta _i\), \(\frac{1}{2}\varDelta _j\), \(\frac{1}{4}(\varDelta _i+\varDelta _j)\) all belong to \({\mathbb {K}}(2r)\), and then combining (C.7) with (C.3), we have that
for all i, j, and thus \(\langle \langle \varGamma \varDelta ,\varDelta \rangle \rangle \le \sum _{i,j}t_it_j(12\delta )=12\delta (\sum _it_i)^2=12\delta \), which establishes (C.5a). As for inequality (C.5b), note that for \(\varDelta \notin {\mathbb {C}}(r)\), one has that
where the first inequality follows by the substitution \(U=\sqrt{r}\frac{\varDelta }{{\left| \left| \left| \varDelta \right| \right| \right| }_*}\), and the second inequality is due to the same argument used to prove (C.5a) as \(U\in {\mathbb {C}}(r)\). Rearranging (C.8) yields (C.5b). The proof is complete. \(\square \)
Lemma C.3
Let \(r\ge 1\) be any constant. Suppose that \(\hat{\varGamma }\in {\mathbb {R}}^{d_1\times d_1}\) is an estimator of \(\Sigma _x\) satisfying
Then we have that
Proof
Set \(\varGamma =\hat{\varGamma }-\Sigma _x\) and \(\delta =\frac{\lambda _{\min }(\Sigma _x)}{24}\). Then Lemma C.2 is applicable to concluding that
which implies that
Then the conclusion follows from the fact that \(\lambda _{\min }(\Sigma _x){\left| \left| \left| \varDelta \right| \right| \right| }_F^2\le \langle \langle \Sigma _x\varDelta ,\varDelta \rangle \rangle \le \lambda _{\max }(\Sigma _x){\left| \left| \left| \varDelta \right| \right| \right| }_F^2\). The proof is complete. \(\square \)
Lemma C.4
Let \(t>0\) be any constant, and \(X\in {\mathbb {R}}^{n\times d_1}\) be a zero-mean sub-Gaussian matrix with parameters \((\Sigma _x,\sigma _x^2)\). Then for any fixed matrix \(\varDelta \in {\mathbb {R}}^{d_1\times d_2}\), there exists a universal positive constant c such that
Proof
By the definition of matrix Frobenius norm, one has that
Then it follows from elementary probability theory that
On the other hand, note the assumption that X is a sub-Gaussian matrix with parameters \((\Sigma _x,\sigma _x^2)\). Then [23, Lemma 14] is applicable to concluding that there exists a universal positive constant c such that
which completes the proof. \(\square \)
Lemma C.5
Let \(t>0\), \(r\ge 1\) be any constants, and \(X\in {\mathbb {R}}^{n\times d_1}\) be a zero-mean sub-Gaussian matrix with parameters \((\Sigma _x,\sigma _x^2)\). Then there exists a universal positive constant c such that
Proof
For an index set \(J\subseteq \{1,2,\ldots ,\min \{d_1,d_2\}\}\), we define the set \(S_J=\{\varDelta \in {\mathbb {R}}^{d_1\times d_2}\big |{\left| \left| \left| \varDelta \right| \right| \right| }_F\le 1, \text {supp}(\text {vec}(\sigma (\varDelta )))\subseteq J\}\), where \(\text {vec}(\sigma (\varDelta ))\) refers to the singular vector of the matrix \(\varDelta \). Then it is easy to see that \({\mathbb {K}}(2r)=\cup _{|J|\le 2r}S_J\). Let \(G=\{U_1,U_2,\ldots ,U_m\}\) be a 1/3-cover of \(S_J\), then for every \(\varDelta \in S_J\), there exists some \(U_i\) such that \({\left| \left| \left| \tilde{\varDelta } \right| \right| \right| }_F\le 1/3\), where \(\tilde{\varDelta }=\varDelta -U_i\). It then follows from [4, Section 7.2] that one can construct G with \(|G|\le 27^{4r\max (d_1,d_2)}\). Define \(\Psi (\varDelta _1,\varDelta _2)=\langle \langle (\frac{X^\top X}{n}-\Sigma _x)\varDelta ,\varDelta \rangle \rangle \). Then one has that
It then follows from the fact that \(3\tilde{\varDelta }\in S_J\) that
and hence, \(\sup _{\varDelta \in S_J}|\Psi (\varDelta ,\varDelta )|\le \frac{9}{2}\max _i|\Psi (U_i,U_i)|\). By Lemma C.4 and a union bound, one has that there exists a universal positive constant \(c'\) such that
Finally, taking a union bound over the \(\frac{\min (d_1,d_2)}{\lfloor 2r\rfloor }\le (\min (d_1,d_2))^{2r}\) choices of set J yields that there exists a universal positive constant c such that
The proof is complete. \(\square \)
By virtue of the above lemmas, we are now at the stage to prove Propositions 1–4.
Proof of Proposition 1
Set
with \(c'>0\) being chosen sufficiently small so that \(r\ge 1\). Then noting that \(\hat{\varGamma }_\text {add}-\Sigma _x=\frac{Z^\top Z}{n}-\Sigma _z\) and by Lemma C.3, one sees that it suffices to show that
holds with high probability. Let \(D(r):=\sup _{\varDelta \in {\mathbb {K}}(2r)}\Big |\langle \langle (\frac{Z^\top Z}{n}-\Sigma _z)\varDelta ,\varDelta \rangle \rangle \Big |\) for simplicity. Note that the matrix Z is sub-Gaussian with parameters \((\Sigma _z,\sigma _z^2)\). Then it follows from Lemma C.5 that there exists a universal positive constant \(c''\) such that
This inequality, together with (C.9), implies that there exist universal positive constants \((c_0,c_1)\) such that \(\tau =c_0\tau _\text {add}\), and
which completes the proof. \(\square \)
Proof of Proposition 2
By the definition of \(\hat{\varGamma }_{\text {add}}\) and \(\hat{\varUpsilon }_{\text {add}}\)(cf. (10)), one has that
where the second inequality is from the fact that \(Y=X\varTheta ^*+\epsilon \), and the third inequality is due to the triangle inequality. Recall the assumption that the matrices X, W and \(\epsilon \) are assumed to be with i.i.d. rows sampled from Gaussian distributions \({\mathcal {N}}(0,\Sigma _x)\), \({\mathcal {N}}(0,\sigma _w^2{\mathbb {I}}_{d_1})\) and \({\mathcal {N}}(0,\sigma _\epsilon ^2{\mathbb {I}}_{d_2})\), respectively. Then one has that \(\Sigma _w=\sigma _w^2{\mathbb {I}}_{d_1}\) and \({\left| \left| \left| \Sigma _w \right| \right| \right| }_\text {op}=\sigma _w\). It follows from [25, Lemma 3] that there exist universal positive constant \((c_3,c_4,c_5)\) such that
with probability at least \(1-c_4\exp (-c_5\log (\max (d_1,d_2))\). Recall that the nuclear norm of \(\varTheta ^*\) is assumed to be bounded as \({\left| \left| \left| \varTheta ^* \right| \right| \right| }_*\le \omega \). Then up to constant factors, we conclude that there exists universal positive constants \((c_0,c_1,c_2)\) such that
with probability at least \(1-c_1\exp (-c_2\log (\max (d_1,d_2))\). The proof is complete. \(\square \)
Proof of Proposition 3
This proof is similar to that of Proposition 1 in the additive noise case. Set \(\sigma ^2=\frac{{\left| \left| \left| \Sigma _x \right| \right| \right| }_\text {op}^2}{(1-\rho )^2}\), and
with \(c'>0\) being chosen sufficiently small so that \(r\ge 1\). Recall that \(\oslash \) denote element-wise division and note that
and thus
By Lemma C.3, one sees that it suffices to show that
holds with high probability. Let \(D(r):=\sup _{\varDelta \in {\mathbb {K}}(2r)}\Big |\langle \langle ((\frac{Z^\top Z}{n}-\Sigma _z)\oslash M) \varDelta ,\varDelta \rangle \rangle \Big |\) for simplicity. On the other hand, one has that
Note that the matrix Z is sub-Gaussian with parameters \((\Sigma _z,{\left| \left| \left| \Sigma _x \right| \right| \right| }_\text {op}^2)\) [23]. Then it follows from Lemma C.5 that there exists a universal positive constant \(c''\) such that
This inequality, together with (C.10), implies that there exists universal positive constants \((c_0,c_1)\) such that \(\tau =c_0\tau _\text {add}\), and
which completes the proof. \(\square \)
Proof of Proposition 4
Note that the matrix Z is sub-Gaussian with parameters \((\Sigma _z,{\left| \left| \left| \Sigma _x \right| \right| \right| }_\text {op}^2)\) [23]. The following discussion is divided into two parts. First consider the quantity \({\left| \left| \left| {\varUpsilon }_{\text {mis}}-\Sigma _x\varTheta ^* \right| \right| \right| }_\text {op}\). By the definition of \(\hat{\varUpsilon }_{\text {mis}}\) (cf. (11)) and the fact that \(Y=X\varTheta ^*+\epsilon \), one has that
It then follows from the assumption that \({\left| \left| \left| \varTheta ^* \right| \right| \right| }_*\le \omega \) that
Recall the assumption that the matrices X, W and \(\epsilon \) are assumed to be with i.i.d. rows sampled from Gaussian distributions \({\mathcal {N}}(0,\Sigma _x)\), \({\mathcal {N}}(0,\sigma _w^2{\mathbb {I}}_{d_1})\) and \({\mathcal {N}}(0,\sigma _\epsilon ^2{\mathbb {I}}_{d_2})\), respectively. Then it follows from [25, Lemma 3] that there exist universal positive constant \((c_3,c_4,c_5)\) such that
with probability at least \(1-c_4\exp (-c_5\log (\max (d_1,d_2))\). Now let us consider the quantity \({\left| \left| \left| ({\varGamma }_{\text {mis}}-\Sigma _x)\varTheta ^* \right| \right| \right| }_\text {op}\). By the definition of \(\hat{\varGamma }_{\text {mis}}\) (cf. (11)), one has that
This inequality, together with [25, Lemma 3], implies that there exists universal positive constants \((c_6,c_7,c_8)\) such that
with probability at least \(1-c_7\exp (-c_8\log (\max (d_1,d_2))\). Combining (C.11) and (C.12), up to constant factors, we conclude that there exist universal positive constant \((c_0,c_1,c_2)\) such that
with probability at least \(1-c_1\exp (-c_2\max ( d_1,d_2))\). The proof is complete. \(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, X., Wu, D. Low-rank matrix estimation via nonconvex optimization methods in multi-response errors-in-variables regression. J Glob Optim 88, 79–114 (2024). https://doi.org/10.1007/s10898-023-01293-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10898-023-01293-w