Efficient and provable online reduced rank regression via online gradient descent

Liu, Xiao; Liu, Weidong; Mao, Xiaojun

doi:10.1007/s10994-024-06622-y

Efficient and provable online reduced rank regression via online gradient descent

Published: 23 September 2024

Volume 113, pages 8711–8748, (2024)
Cite this article

Machine Learning Aims and scope Submit manuscript

373 Accesses
1 Altmetric
Explore all metrics

Abstract

The Reduced Rank Regression (RRR) model is frequently employed in machine learning. It increases efficiency and interpretability by adding a low-rank restriction to the coefficient matrix, which can be used to cut down on the number of parameters. In this paper, we study the RRR issue in an online setting. Only a small batch of data can be utilized each time, arriving in a stream. Previous analogous methods have relied on conventional least squares estimation, which is inefficient and does not theoretically guarantee convergence rate or build connections with offline strategies. We proposed an efficient online RRR algorithm based on non-convex online gradient descent. More importantly, based on a constant order batch size and appropriate initialization, we theoretically prove the convergence result of the mean estimation error generated by our algorithm. Our result achieves an optimal rate of up to a logarithmic factor. We also propose an accelerated version of our algorithm. Our methods compete with the existing method in terms of accuracy and calculation speed in numerical simulations and real applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Provable accelerated gradient method for nonconvex low rank optimization

Article 26 June 2019

Online optimization for max-norm regularization

Article 07 February 2017

Provably Accelerating Ill-Conditioned Low-Rank Estimation via Scaled Gradient Descent, Even with Overparameterization

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Availability of data and materials

Only public datasets, in https://stats.oecd.org/ and https://www.nytimes.com/article/coronavirus-county-data-us.html, are used.

Code availability

The demo codes of our simulations and real applications can be found at https://github.com/shawstat/ORRR_code_and_data.git.

Notes

References

Arce, P., Antognini, J., Kristjanpoller, W., & Salinas, L. (2015). An online vector error correction model for exchange rates forecasting. In: Proceedings of the international conference on pattern recognition applications and methods (pp. 193–200). https://doi.org/10.5220/0005205901930200
Balzano, L., Nowak, R., & Recht, B. (2010). Online identification and tracking of subspaces from highly incomplete information. In: 2010 48th annual Allerton conference on communication, control, and computing (Allerton) (pp. 704–711). IEEE. https://doi.org/10.1109/ALLERTON.2010.5706976
Bunea, F., She, Y., & Wegkamp, M. H. (2011). Optimal selection of reduced rank estimators of high-dimensional matrices. The Annals of Statistics, 39(2), 1282–1309. https://doi.org/10.1214/11-AOS876
Article MathSciNet MATH Google Scholar
Bunea, F., She, Y., & Wegkamp, M. H. (2012). Joint variable and rank selection for parsimonious estimation of high-dimensional matrices. The Annals of Statistics, 40(5), 2359–2388. https://doi.org/10.1214/12-AOS1039
Article MathSciNet MATH Google Scholar
Candes, E. J., & Plan, Y. (2011). Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Transactions on Information Theory, 57(4), 2342–2359. https://doi.org/10.1109/TIT.2011.2111771
Article MathSciNet MATH Google Scholar
Chen, J., Liu, D., & Li, X. (2020). Nonconvex rectangular matrix completion via gradient descent without $l_{2,\infty }$ regularization. IEEE Transactions on Information Theory, 66(9), 5806–5841. https://doi.org/10.1109/TIT.2020.2992234
Article MathSciNet MATH Google Scholar
Chen, K., Dong, H., & Chan, K. S. (2013). Reduced rank regression via adaptive nuclear norm penalization. Biometrika, 100(4), 901–920. https://doi.org/10.1093/biomet/ast036
Article MathSciNet MATH Google Scholar
Chen, L., & Huang, J. Z. (2012). Sparse reduced-rank regression for simultaneous dimension reduction and variable selection. Journal of the American Statistical Association, 107(500), 1533–1545. https://doi.org/10.1080/01621459.2012.734178
Article MathSciNet MATH Google Scholar
Chen, X., Lai, Z., Li, H., & Zhang, Y. (2024). Online statistical inference for stochastic optimization via Kiefer–Wolfowitz methods. Journal of the American Statistical Association. https://doi.org/10.1080/01621459.2021.1933498
Article MathSciNet MATH Google Scholar
Chen, X., Liu, W., & Mao, X. (2022). Robust reduced rank regression in a distributed setting. Science China Mathematics, 65, 1707–1730. https://doi.org/10.1007/s11425-020-1785-0
Article MathSciNet MATH Google Scholar
Chen, Y., Chi, Y., Fan, J., Ma, C., & Yan, Y. (2020). Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. SIAM Journal on Optimization, 30(4), 3098–3121. https://doi.org/10.1137/19M1290000
Article MathSciNet MATH Google Scholar
De Lamare, R. C., & Sampaio-Neto, R. (2007). Reduced-rank adaptive filtering based on joint iterative optimization of adaptive filters. IEEE Signal Processing Letters, 14(12), 980–983. https://doi.org/10.1109/LSP.2007.907995
Article MATH Google Scholar
De Lamare, R. C., & Sampaio-Neto, R. (2009). Adaptive reduced-rank processing based on joint and iterative interpolation, decimation, and filtering. IEEE Transactions on Signal Processing, 57, 2503–2514. https://doi.org/10.1109/TSP.2009.2018641
Article MathSciNet MATH Google Scholar
De Lamare, R. C., & Sampaio-Neto, R. (2009). Reduced-rank space-time adaptive interference suppression with joint iterative least squares algorithms for spread-spectrum systems. IEEE Transactions on Vehicular Technology, 59, 1217–1228. https://doi.org/10.1109/TVT.2009.2038391
Article MATH Google Scholar
Dubois, B., Delmas, J. F., & Obozinski, G. (2019). Fast algorithms for sparse reduced-rank regression. In: The 22nd international conference on artificial intelligence and statistics (pp. 2415–2424). PMLR.
Ghadimi, E., Feyzmahdavian, H. R., & Johansson, M. (2015). Global convergence of the heavy-ball method for convex optimization. In 2015 European control conference (ECC) (pp. 310–315). IEEE. https://doi.org/10.1109/ECC.2015.7330562
Hazan, E. (2016). Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3–4), 157–325. https://doi.org/10.1561/9781680831719
Article MATH Google Scholar
Hazan, E., Rakhlin, A., & Bartlett, P. (2007). Adaptive online gradient descent. In Advances in neural information processing systems (Vol. 20, pp 1–8). Curran Associates, Inc.
Herbster, M., Pasteris, S., & Tse, L. (2020). Online matrix completion with side information. Advances in Neural Information Processing Systems, 33, 20402–20414.
Article MATH Google Scholar
Honig, M. L., & Goldstein, J. S. (2002). Adaptive reduced-rank interference suppression based on the multistage wiener filter. IEEE Transactions on Communications, 50, 986–994. https://doi.org/10.1109/TCOMM.2002.1010618
Article MATH Google Scholar
Hua, Y., Nikpour, M., & Stoica, P. (2001). Optimal reduced-rank estimation and filtering. IEEE Transactions on Signal Processing, 49(3), 457–469. https://doi.org/10.1109/78.905856
Article MATH Google Scholar
Huang, D., & Torre, F. D. l. (2010). Bilinear kernel reduced rank regression for facial expression synthesis. In European conference on computer vision (pp. 364–377). Springer. https://doi.org/10.1007/978-3-642-15552-9_27
Izenman, A. J. (1975). Reduced-rank regression for the multivariate linear model. Journal of Multivariate Analysis, 5(2), 248–264. https://doi.org/10.1016/0047-259X(75)90042-1
Article MathSciNet MATH Google Scholar
Jin, C., Kakade, S. M., & Netrapalli, P. (2016). Provable efficient online matrix completion via non-convex stochastic gradient descent. Advances in Neural Information Processing Systems, 29, 4520–4528. https://doi.org/10.5555/3157382.3157603
Article MATH Google Scholar
Kidambi, R., Netrapalli, P., Jain, P., & Kakade, S. (2018). On the insufficiency of existing momentum schemes for stochastic optimization. In: 2018 Information Theory and Applications workshop (ITA) (pp. 1–9). IEEE. https://doi.org/10.1109/ITA.2018.8503173
Kingma, D. P., & Ba, J. (2015). International Conference on Learning Representations (ICLR) (pp. 1–13).
Kushner, H., & Yin, G. G. (2003). Stochastic approximation and recursive algorithms and applications (Vol. 35). Springer.
MATH Google Scholar
Liu, W., Liu, G., & Tang, Y. (2022). Robust sparse reduced-rank regression with response dependency. Symmetry, 14(8), 1617–1629. https://doi.org/10.3390/sym14081617
Article MATH Google Scholar
Liu, Y., Gao, Y., & Yin, W. (2020). An improved analysis of stochastic gradient descent with momentum. Advances in Neural Information Processing Systems, 33, 18261–18271. https://doi.org/10.48550/arXiv.2007.07989
Article MATH Google Scholar
Lois, B., & Vaswani, N. (2015). Online matrix completion and online robust PCA. In 2015 IEEE International Symposium on Information Theory (ISIT) (pp. 1826–1830). IEEE. https://doi.org/10.1109/ISIT.2015.7282771
Ma, C., Wang, K., Chi, Y., & Chen, Y. (2018). Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion. In: International conference on machine learning (pp. 3345–3354). PMLR. https://doi.org/10.1007/s10208-019-09429-9
Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11(1), 19–60. https://doi.org/10.1145/1756006.1756008
Article MathSciNet MATH Google Scholar
Nesterov, Y. (2013). Gradient methods for minimizing composite functions. Mathematical Programming, 140(1), 125–161. https://doi.org/10.1007/s10107-012-0629-5
Article MathSciNet MATH Google Scholar
Nicoli, M., & Spagnolini, U. (2005). Reduced-rank channel estimation for time-slotted mobile communication systems. IEEE Transactions on Signal Processing, 53(3), 926–944. https://doi.org/10.1109/TSP.2004.842191
Article MathSciNet MATH Google Scholar
Park, D., Kyrillidis, A., Caramanis, C., & Sanghavi, S. (2018). Finding low-rank solutions via nonconvex matrix factorization, efficiently and provably. SIAM Journal on Imaging Sciences, 11(4), 2165–2204. https://doi.org/10.1137/17M1150189
Article MathSciNet MATH Google Scholar
Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. Ussr Computational Mathematics and Mathematical Physics, 4(5), 1–17. https://doi.org/10.1016/0041-5553(64)90137-5
Article MATH Google Scholar
Qian, H., & Batalama, S. N. (2003). Data record-based criteria for the selection of an auxiliary vector estimator of the MMSE/MVDR filter. IEEE Transactions on Communications, 51, 1700–1708. https://doi.org/10.1109/TCOMM.2003.818089
Article MATH Google Scholar
Qiu, C., Vaswani, N., Lois, B., et al. (2014). Recursive robust PCA or recursive sparse recovery in large but structured noise. IEEE Transactions on Information Theory, 60(8), 5007–5039. https://doi.org/10.1109/ICASSP.2013.6638807
Article MathSciNet MATH Google Scholar
Robinson, P. (1974). Identification, estimation and large-sample theory for regressions containing unobservable variables. International Economic Review, 680–692.
Scharf, L. L. (1991). The SVD and reduced rank signal processing. Signal Processing, 25, 113–133. https://doi.org/10.1016/0165-1684(91)90058-Q
Article MATH Google Scholar
Shalev-Shwartz, S. (2012). Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2), 107–194. https://doi.org/10.1561/9781601985477
Article MATH Google Scholar
She, Y. (2017). Selective factor extraction in high dimensions. Biometrika, 104(1), 97–110.
MathSciNet MATH Google Scholar
She, Y., & Tran, H. (2019). On cross-validation for sparse reduced rank regression. Journal of the Royal Statistical Society Series B: Statistical Methodology, 81(1), 145–161. https://doi.org/10.1111/rssb.12295
Article MathSciNet MATH Google Scholar
Tan, K. M., Sun, Q., & Witten, D. (2022). Sparse reduced-rank Huber regression in high dimensions. Journal of the American Statistical Association. https://doi.org/10.1080/01621459.2022.2050243
Article MATH Google Scholar
Tu, S., Boczar, R., Simchowitz, M., Soltanolkotabi, M., & Recht, B. (2016) Low-rank solutions of linear matrix equations via procrustes flow. In International conference on machine learning (pp. 964–973). PMLR.
Velu, R., & Reinsel, G. C. (2013). Multivariate reduced-rank regression: Theory and applications (Vol. 136). Springer. https://doi.org/10.1007/978-1-4757-2853-8
Book MATH Google Scholar
Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. In Compressed sensing: Theory and applications (pp. 210—268). Cambridge University Press. https://doi.org/10.1017/CBO9780511794308.006
Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint (Vol. 48). Cambridge University Press. https://doi.org/10.1017/9781108627771
Book MATH Google Scholar
Wang, L., Zhang, X., & Gu, Q. (2017). A unified computational and statistical framework for nonconvex low-rank matrix estimation. In Artificial intelligence and statistics (pp. 981–990). PMLR.
Yang, Y. F., & Zhao, Z. (2020). Online robust reduced-rank regression. In 2020 IEEE 11th sensor array and multichannel signal processing workshop (SAM) (pp. 1–5). IEEE. https://doi.org/10.1109/SAM48682.2020.9104268
Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67. https://doi.org/10.1111/j.1467-9868.2005.00532.x
Article MathSciNet MATH Google Scholar
Zhao, R., Tan, V., & Xu, H. (2017). Online nonnegative matrix factorization with general divergences. In Artificial intelligence and statistics (pp. 37–45). PMLR.
Zhao, Z., & Palomar, D. P. (2018). Mean-reverting portfolio with budget constraint. IEEE Transactions on Signal Processing, 66(9), 2342–2357. https://doi.org/10.1109/TSP.2018.2799193
Article MathSciNet MATH Google Scholar
Zhao, Z., & Palomar, D. P. (2018b). Sparse reduced rank regression with nonconvex regularization. In 2018 IEEE statistical signal processing workshop (SSP) (pp. 811–815). IEEE. https://doi.org/10.1109/SSP.2018.8450724
Zhao, Z., Zhou, R., & Palomar, D. P. (2019). Optimal mean-reverting portfolio with leverage constraint for statistical arbitrage in finance. IEEE Transactions on Signal Processing, 67(7), 1681–1695. https://doi.org/10.1109/TSP.2019.2893862
Article MathSciNet MATH Google Scholar
Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning (pp. 928–936). PMLR.

Download references

Funding

Weidong Liu’s research is supported by NSFC Grant No. 11825104. Xiaojun Mao’s research is supported by NSFC Grant No. 12422111 and 12371273, the Shanghai Rising-Star Program 23QA1404600 and Young Elite Scientists Sponsorship Program by CAST (2023QNRC001).

Author information

Authors and Affiliations

School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China
Xiao Liu, Weidong Liu & Xiaojun Mao
MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, 200240, China
Weidong Liu
Ministry of Education Key Laboratory of Scientific and Engineering Computing, Shanghai Jiao Tong University, Shanghai, 200240, China
Xiaojun Mao

Authors

Xiao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Weidong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojun Mao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Xiao Liu developed the theory, performed the computations, wrote the original preparation draft, and edited the writing. Weidong Liu and Xiaojun Mao conceived the presented idea, verified the analytical methods, supervised the findings of this work, and reviewed and edited the writing. All authors discussed the results and contributed to the final manuscript.

Corresponding authors

Correspondence to Weidong Liu or Xiaojun Mao.

Ethics declarations

Conflict of interest

There is no Conflict of interest.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Editor: Paolo Frasconi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: The Proof of Lemma 1

For convenience, we divide the direction matrix ${\textbf{V}}$ into two parts by rows. Let ${\textbf{V}}_{A}\in {\mathbb {R}}^{p\times r}$ contain the first p rows of ${\textbf{V}}$ and ${\textbf{V}}_{B}\in {\mathbb {R}}^{q\times r}$ be consist of the rest q rows. By algebraic calculation, the Hessian of $f^{(t)}$ with respect to $[{\textbf{A}}^{\top },{\textbf{B}}^{\top }]^{\top }$ can be expressed as:

$$\begin{aligned} \begin{aligned}&\mathbf {(}{\textbf{V}})^{\top }\nabla ^2_{{\textbf{F}}}f^{(t)}(\varvec{\mu },{\textbf{A}},{\textbf{B}})\mathbf {(}{\textbf{V}})\\&=\Vert {\textbf{V}}_A{\textbf{B}}^{\top }{\textbf{X}}_t+{\textbf{A}}{\textbf{V}}_B^{\top }{\textbf{X}}_t\Vert _{F}^2+\langle {\textbf{A}}{\textbf{B}}^{\top }{\textbf{X}}_t-{\textbf{Y}}_t,{\textbf{V}}_A{\textbf{V}}_B^{\top }\rangle +\langle {\textbf{A}}{\textbf{B}}^{\top }{\textbf{X}}_t-{\textbf{Y}}_t,{\textbf{V}}_A{\textbf{V}}_B^{\top }{\textbf{X}}_t\rangle \\&\quad +\langle (\varvec{\mu }-\varvec{\mu }_{*}){\varvec{1}}_m^{\top },{\textbf{V}}_A{\textbf{V}}_B^{\top }\rangle +\langle (\varvec{\mu }-\varvec{\mu }_{*}){\varvec{1}}_m^{\top },{\textbf{V}}_A{\textbf{V}}_B^{\top }{\textbf{X}}_t\rangle \\&\quad +\frac{1}{2}\langle {\textbf{A}}^{\top }{\textbf{A}}-{\textbf{B}}^{\top }{\textbf{B}},{\textbf{V}}_A^{\top }{\textbf{V}}_A-{\textbf{V}}_B^{\top }{\textbf{V}}_B\rangle +\frac{1}{4}\Vert {\textbf{V}}_A^{\top }{\textbf{A}}+{\textbf{A}}^{\top }{\textbf{V}}_A-{\textbf{B}}^{\top }{\textbf{V}}_{B}-{\textbf{V}}_{B}^{\top }{\textbf{B}}\Vert _{F}^2. \end{aligned} \end{aligned}$$

(15)

Then we can prove Lemma 1.

Proof

Define $\varvec{\Delta }_A = {\textbf{A}}-{\textbf{A}}_{*}, \varvec{\Delta }_B={\textbf{B}}-{\textbf{B}}_{*}$. The Hessian (15) can be reformulated as following:

$$\begin{aligned} \begin{aligned}&\mathbf {(}{\textbf{V}})^{\top }\nabla ^2_{{\textbf{F}}}f^{(t)}({\textbf{A}},{\textbf{B}})\mathbf {(}{\textbf{V}})\\&=\Vert {\textbf{V}}_A\varvec{\Delta }_{B}^{\top }{\textbf{X}}_t+{\textbf{V}}_A{\textbf{B}}_{*}^{\top }{\textbf{X}}_t+\varvec{\Delta }_{A}{\textbf{V}}_B^{\top }{\textbf{X}}_t+{\textbf{A}}_{*}{\textbf{V}}_B^{\top }{\textbf{X}}_t\Vert _{F}^2\\&\quad +\langle ({\textbf{A}}_{*}\varvec{\Delta }_{B}^{\top }+\varvec{\Delta }_{A}{\textbf{B}}_{*}^{\top }+\varvec{\Delta }_{A}\varvec{\Delta }_{B}^{\top }){\textbf{X}}_t,{\textbf{V}}_A{\textbf{V}}_B^{\top }\rangle +\langle ({\textbf{A}}_{*}\varvec{\Delta }_{B}^{\top }+\varvec{\Delta }_{A}{\textbf{B}}_{*}^{\top }+\varvec{\Delta }_{A}\varvec{\Delta }_{B}^{\top }),{\textbf{V}}_A{\textbf{V}}_B^{\top }{\textbf{X}}_t\rangle \\&\quad +\frac{1}{2}\langle {\textbf{A}}_{*}^{\top }\varvec{\Delta }_A+\varvec{\Delta }_A^{\top }\varvec{\Delta }_A+\varvec{\Delta }_A^{\top }{\textbf{A}}_{*}-{\textbf{B}}_{*}^{\top }\varvec{\Delta }_B-\varvec{\Delta }_B^{\top }\varvec{\Delta }_B-\varvec{\Delta }_B^{\top }{\textbf{B}}_{*},{\textbf{V}}_A^{\top }{\textbf{V}}_A-{\textbf{V}}_B^{\top }{\textbf{V}}_B\rangle \\&\quad +\frac{1}{4}\Vert {\textbf{V}}_A^{\top }{\textbf{A}}_{*}+{\textbf{V}}_A^{\top }\varvec{\Delta }_A+{\textbf{A}}_{*}^{\top }{\textbf{V}}_{A}+\varvec{\Delta }_{A}^{\top }{\textbf{V}}_{A}-{\textbf{V}}_B^{\top }{\textbf{B}}_{*}-{\textbf{V}}_B^{\top }\varvec{\Delta }_B-{\textbf{B}}_{*}^{\top }{\textbf{V}}_{B}-\varvec{\Delta }_{B}^{\top }{\textbf{V}}_{B}\Vert _{F}^2\\&\quad +\underbrace{\langle (\varvec{\mu }-\varvec{\mu }_{*}){\varvec{1}}_m^{\top },{\textbf{V}}_A{\textbf{V}}_B^{\top }\rangle +\langle (\varvec{\mu }-\varvec{\mu }_{*}){\varvec{1}}_m^{\top },{\textbf{V}}_A{\textbf{V}}_B^{\top }{\textbf{X}}_t\rangle }_{\xi _4}\\&\quad +\underbrace{\langle {\mathcal {E}}_t, {\textbf{V}}_A{\textbf{V}}_B^{\top }\rangle +\langle {\mathcal {E}}_t, {\textbf{V}}_A{\textbf{V}}_B^{\top }{\textbf{X}}_t\rangle }_{\xi _5}\\&=\Vert {\textbf{V}}_{A}{\textbf{B}}_{*}^{\top }{\textbf{X}}_t+{\textbf{A}}_{*}{\textbf{V}}_{B}^{\top }{\textbf{X}}_t\Vert _{F}^2+\frac{1}{4}\Vert {\textbf{V}}_A^{\top }{\textbf{A}}_{*}+{\textbf{A}}_{*}^{\top }{\textbf{V}}_{A}-{\textbf{V}}_B^{\top }{\textbf{B}}_{*}-{\textbf{B}}_{*}^{\top }{\textbf{V}}_{B}\Vert _{F}^2+\xi _1+\xi _4+\xi _5. \end{aligned} \end{aligned}$$

Here $\xi _1$ is the term containing $(\varvec{\Delta }_A,\varvec{\Delta }_B)$:

$$\begin{aligned} \begin{aligned} \xi _1&=\Vert {\textbf{V}}_{A}\varvec{\Delta }_B^{\top }{\textbf{X}}_t+\varvec{\Delta }_{B}{\textbf{V}}_{B}^{\top }{\textbf{X}}_t\Vert _{F}^2+2\langle {\textbf{V}}_A\varvec{\Delta }_{B}^{\top }{\textbf{X}}_t+\varvec{\Delta }_A{\textbf{V}}_B^{\top }{\textbf{X}}_t,{\textbf{V}}_A{\textbf{B}}_{*}^{\top }{\textbf{X}}_t+{\textbf{A}}_{*}{\textbf{V}}_B^{\top }{\textbf{X}}_t\rangle \\&\quad +\langle ({\textbf{A}}_{*}\varvec{\Delta }_{B}^{\top }+\varvec{\Delta }_{A}{\textbf{B}}_{*}^{\top }+\varvec{\Delta }_{A}\varvec{\Delta }_{B}^{\top }){\textbf{X}}_t,{\textbf{V}}_A{\textbf{V}}_B^{\top }\rangle +\langle ({\textbf{A}}_{*}\varvec{\Delta }_{B}^{\top }+\varvec{\Delta }_{A}{\textbf{B}}_{*}^{\top }+\varvec{\Delta }_{A}\varvec{\Delta }_{B}^{\top }),{\textbf{V}}_A{\textbf{V}}_B^{\top }{\textbf{X}}_t\rangle \\&\quad +\frac{1}{2}\langle {\textbf{A}}_{*}^{\top }\varvec{\Delta }_A+\varvec{\Delta }_A^{\top }\varvec{\Delta }_A+\varvec{\Delta }_A^{\top }{\textbf{A}}_{*}-{\textbf{B}}_{*}^{\top }\varvec{\Delta }_B-\varvec{\Delta }_B^{\top }\varvec{\Delta }_B-\varvec{\Delta }_B^{\top }{\textbf{B}}_{*},{\textbf{V}}_A^{\top }{\textbf{V}}_A-{\textbf{V}}_B^{\top }{\textbf{V}}_B\rangle \\&\quad +\frac{1}{4}\Vert {\textbf{V}}_A^{\top }\varvec{\Delta }_A+\varvec{\Delta }_A^{\top }{\textbf{V}}_A-{\textbf{V}}_B^{\top }\varvec{\Delta }_B-\varvec{\Delta }_B^{\top }{\textbf{V}}_{B}\Vert _{F}^2\\&\quad +\frac{1}{2}\langle {\textbf{V}}_A^{\top }\varvec{\Delta }_A+\varvec{\Delta }_A^{\top }{\textbf{V}}_A-{\textbf{V}}_B^{\top }\varvec{\Delta }_B-\varvec{\Delta }_B^{\top }{\textbf{V}}_{B},{\textbf{V}}_A^{\top }{\textbf{A}}_{*}+{\textbf{A}}_{*}^{\top }{\textbf{V}}_A-{\textbf{V}}_B^{\top }{\textbf{B}}_{*}-{\textbf{B}}_{*}^{\top }{\textbf{V}}_{B}\rangle . \end{aligned} \end{aligned}$$

We then replace ${\textbf{A}}_{*},{\textbf{B}}_{*}$ with ${\textbf{A}}_{*}-{\textbf{A}}_2+{\textbf{A}}_2,{\textbf{B}}_{*}-{\textbf{B}}_2+{\textbf{B}}_2$. The Hessian of $f^{(t)}$ can be written as:

$$\begin{aligned} \begin{aligned}&\mathbf {(}{\textbf{V}})^{\top }\nabla ^2_{{\textbf{F}}}f^{(t)}(\varvec{\mu },{\textbf{A}},{\textbf{B}})\mathbf {(}{\textbf{V}})\\&=\Vert {\textbf{V}}_{A}{\textbf{B}}_{*}^{\top }{\textbf{X}}_{t}\Vert _{F}^{2}+\Vert {\textbf{A}}_{*}{\textbf{V}}_{B}^{\top }{\textbf{X}}_t\Vert _{F}^2+\frac{1}{2}\Vert {\textbf{V}}_{A}^{\top }{\textbf{A}}_{*}-{\textbf{V}}_{B}^{\top }{\textbf{B}}_{*}\Vert _{F}^2\\&\quad +\frac{1}{2}\langle {\textbf{A}}_2^{\top }{\textbf{V}}_A+{\textbf{B}}_2^{\top }{\textbf{V}}_{B},{\textbf{V}}_A^{\top }{\textbf{A}}_2+{\textbf{V}}_B^{\top }{\textbf{B}}_2\rangle +\xi _1+\xi _2+\xi _3+\xi _4+\xi _5\\&=\Vert {\textbf{V}}_{A}{\textbf{B}}_{*}^{\top }{\textbf{X}}_{t}\Vert _{F}^{2}+\Vert {\textbf{A}}_{*}{\textbf{V}}_{B}^{\top }{\textbf{X}}_t\Vert _{F}^2+\frac{1}{2}\Vert {\textbf{V}}_{A}^{\top }{\textbf{A}}_{*}-{\textbf{V}}_{B}^{\top }{\textbf{B}}_{*}\Vert _{F}^2+\frac{1}{2}\Vert {\textbf{A}}_2^{\top }{\textbf{V}}_A+{\textbf{B}}_2^{\top }{\textbf{V}}_{B}\Vert _{F}^2\\&\quad +\xi _1+\xi _2+\xi _3+\xi _4+\xi _5. \end{aligned} \end{aligned}$$

(16)

Here

$$\begin{aligned} \begin{aligned} \xi _2&=2\left[ \langle {\textbf{V}}_A{\textbf{B}}_{*}^{\top }{\textbf{X}}_t,{\textbf{A}}_{*}{\textbf{V}}_{B}^{\top }{\textbf{X}}_t\rangle -\langle {\textbf{V}}_A{\textbf{B}}_{*}^{\top },{\textbf{A}}_{*}{\textbf{V}}_{B}^{\top }\rangle \right] ,\\ \xi _3&= \frac{1}{2}\langle ({\textbf{A}}_2-{\textbf{A}}_{*})^{\top }{\textbf{V}}_{A}+({\textbf{B}}_2-{\textbf{B}}_{*})^{\top }{\textbf{V}}_{B},{\textbf{V}}_{A}^{\top }{\textbf{A}}_2+{\textbf{V}}_{B}^{\top }{\textbf{B}}_2\rangle \\&\quad +\frac{1}{2}\langle {\textbf{A}}_2^{\top }{\textbf{V}}_{A}+{\textbf{B}}_2^{\top }{\textbf{V}}_{B},{\textbf{V}}_{A}^{\top }({\textbf{A}}_{*}-{\textbf{A}}_2)+{\textbf{V}}_{B}^{\top }({\textbf{B}}_{*}-{\textbf{B}}_2)\rangle \\&\quad +\frac{1}{2}\langle ({\textbf{A}}_2-{\textbf{A}}_{*})^{\top }{\textbf{V}}_{A}+({\textbf{B}}_2-{\textbf{B}}_{*})^{\top }{\textbf{V}}_{B},{\textbf{V}}_{A}^{\top }({\textbf{A}}_{*}-{\textbf{A}}_2)+{\textbf{V}}_{B}^{\top }({\textbf{B}}_{*}-{\textbf{B}}_2)\rangle . \end{aligned} \end{aligned}$$

According to the form of ${\textbf{V}}$ in (8) and Lemma 35 in Ma et al. (2018), we can conclude that ${\textbf{A}}_2^{\top }{\textbf{V}}_A+{\textbf{B}}_2^{\top }{\textbf{V}}_{B}$ is symmetric, which is why the last equality of (16) holds. By Cauchy-Schwarz inequality and basic inequalities of the spectral norm, we have:

$$\begin{aligned} \begin{aligned}&|\xi _1+\xi _2+\xi _3|\le 10\left[ (\Vert \varvec{\Delta }_A\Vert +\Vert \varvec{\Delta }_{B}\Vert )\Vert {\textbf{X}}_t\Vert (\Vert {\textbf{A}}_{*}\Vert +\Vert {\textbf{B}}_{*}\Vert )+(\Vert \varvec{\Delta }_A\Vert +\varvec{\Delta }_{B})\Vert )^2\Vert {\textbf{X}}_t\Vert ^2\right] \Vert {\textbf{V}}\Vert _{F}^2\\&\quad +10\left[ (\Vert {\textbf{A}}_2-{\textbf{A}}_{*}\Vert +\Vert {\textbf{B}}_2-{\textbf{B}}_{*}\Vert )(\Vert {\textbf{A}}_{*}\Vert +\Vert {\textbf{B}}_{*}\Vert )+(\Vert {\textbf{A}}_2-{\textbf{A}}_{*}\Vert +\Vert {\textbf{B}}_2-{\textbf{B}}_{*}\Vert )^2\right] \Vert {\textbf{V}}\Vert _{F}^2,\\&|\xi _4|\le |\langle (\varvec{\mu }-\varvec{\mu }_{*}){\varvec{1}}_m^{\top },{\textbf{V}}_A{\textbf{V}}_B^{\top }\rangle |+|\langle (\varvec{\mu }-\varvec{\mu }_{*}){\varvec{1}}_m^{\top },{\textbf{V}}_A{\textbf{V}}_B^{\top }{\textbf{X}}_t\rangle |\\&\le 2\Vert \varvec{\mu }-\varvec{\mu }_*\Vert _2\times \sqrt{m}\Vert {\textbf{X}}_t\Vert \Vert {\textbf{V}}\Vert _{F}^2,\\&|\xi _5|\le |\langle {\mathcal {E}}_t,{\textbf{V}}_A{\textbf{V}}_B^{\top }\rangle |+|\langle {\mathcal {E}}_t,{\textbf{V}}_A{\textbf{V}}_B^{\top }{\textbf{X}}_t\rangle |\le 2\Vert {\mathcal {E}}_t\Vert \times \Vert {\textbf{X}}_t\Vert \Vert {\textbf{V}}\Vert _{F}^2.\\ \end{aligned} \end{aligned}$$

According to the Theorem 6.1 (Theorem 6.5 for the sub-Gaussian case) in Wainwright (2019) and the restrictions of our region i.e. (9), with probability $1-O(e^{-m})$ we have:

$$\begin{aligned} \begin{aligned} \Vert {\textbf{X}}_t\Vert&\le 4\sqrt{m\Vert \varvec{\Sigma }_x\Vert }, \\ \left\| \left[ \begin{array}{c}{\textbf{A}}-{\textbf{A}}_{*}\\ {\textbf{B}}-{\textbf{B}}_{*}\end{array}\right] \right\|&\le \frac{\sqrt{\sigma _{\max }({\textbf{C}}_{*})}}{400\kappa _{c}\kappa _{x}},\left\| \left[ \begin{array}{c}{\textbf{A}}_2-{\textbf{A}}_{*}\\ {\textbf{B}}_2-{\textbf{B}}_{*}\end{array}\right] \right\| \le \frac{\sqrt{\sigma _{\max }({\textbf{C}}_{*})}}{400\kappa _{c}\kappa _{x}}. \end{aligned} \end{aligned}$$

Under Assumption 4 we can conclude:

$$\begin{aligned} |\xi _1+\xi _2+\xi _3+\xi _4+\xi _5|\le \frac{\sigma _{\max }({\textbf{C}}_{*})\sigma _{\max }(\varvec{\Sigma }_x)m}{10\kappa _c\kappa _x}\Vert {\textbf{V}}\Vert _{F}^2 = \frac{m}{10}\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_x)\Vert {\textbf{V}}\Vert _{F}^2. \end{aligned}$$

(17)

For the lower bound, notice that:

$$\begin{aligned} \begin{aligned} \Vert {\textbf{V}}_{A}{\textbf{B}}_{*}^{\top }{\textbf{X}}_t\Vert _{F}^2&\ge \sigma _{\min }({\textbf{C}}_{*})^2\sigma _{\min }({\textbf{X}}_t)^2\Vert {\textbf{V}}_{A}\Vert _{F}^2\ge \frac{m}{2}\sigma _{\min }({\textbf{C}}_{*})^2\sigma _{\min }(\varvec{\Sigma }_x)^2\Vert {\textbf{V}}_{A}\Vert _{F}^2,\\ \Vert {\textbf{A}}_{*}{\textbf{V}}_{B}^{\top }{\textbf{X}}_t\Vert _{F}^2&\ge \sigma _{\min }({\textbf{C}}_{*})^2\sigma _{\min }({\textbf{X}}_t)^2\Vert {\textbf{V}}_{B}\Vert _{F}^2\ge \frac{m}{2}\sigma _{\min }({\textbf{C}}_{*})^2\sigma _{\min }(\varvec{\Sigma }_x)^2\Vert {\textbf{V}}_{B}\Vert _{F}^2. \end{aligned} \end{aligned}$$

(18)

Under the Assumptions 1 and 2 the inequalities in (18) hold with probability $1-O(e^{-m})$ because Theorem 6.1 (Theorem 6.5 for the sub-Gaussian case) in Wainwright (2019) and the fact that $m\gtrsim {\text {tr}}(\varvec{\Sigma }_x)/\sigma _{\min }(\varvec{\Sigma }_x)$. When we focus on the small-batch-size regime, instead, we need to use Theorem 5.58 in Vershynin (2012) under Assumptions 1’ and 2’. Combine (16), (17) and (18), we then have:

$$\begin{aligned} \mathbf {(}{\textbf{V}})^{\top }\nabla ^2_{{\textbf{F}}} f^{(t)}(\varvec{\mu },{\textbf{A}},{\textbf{B}})\mathbf {(}{\textbf{V}})\ge \frac{m}{5}\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_{x})\Vert {\textbf{V}}\Vert _{F}^2. \end{aligned}$$

In a similar way, we can prove $\Vert \nabla ^2_{{\textbf{F}}} f^{(t)}(\varvec{\mu },{\textbf{A}},{\textbf{B}})\Vert \le 5\,m\sigma _{\max }({\textbf{C}}_{*})\sigma _{\max }(\varvec{\Sigma }_{x})$ by the upper bound of $\mathbf {(}{\textbf{V}})^{\top }\nabla ^2_{{\textbf{F}}}f^{(t)}(\varvec{\mu },{\textbf{A}},{\textbf{B}})\mathbf {(}{\textbf{V}})$ and the definition of spectral norm. $\square$

Appendix 2: The Proof of Lemma 2

The proof of Lemma 2 is completed in an induction manner and can be decomposed into three steps: the error contraction of ${\textbf{F}}_{*}=[{\textbf{A}}_{*}^{\top },{\textbf{B}}_{*}^{\top }]$, the error contraction of $\varvec{\mu }_{*}$, and the properties of initial values.

1.1 Step 1: The error contraction of ${\textbf{A}}_{}$ and ${\textbf{B}}_{}$

Proposition 1

Under Assumptions 1–5, there exists an event which is independent of t and has probability $1-O(e^{-m})$, such that when

$$\begin{aligned} \begin{aligned} \Vert {\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*}\Vert _{F}&\le \rho _1^{(t)}\Vert {\textbf{F}}_1{\textbf{H}}_1-{\textbf{F}}_{*}\Vert _{F}+C_0\frac{\sigma \sqrt{r\sigma _{\max }(\varvec{\Sigma }_x)}}{\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_x)}\Vert {\textbf{F}}_{*}\Vert _{F},\\ \Vert {\textbf{F}}_1{\textbf{H}}_1-{\textbf{F}}_{*}\Vert&\lesssim \frac{1}{\kappa _c\kappa _x}\Vert {\textbf{F}}_{*}\Vert ,\Vert \varvec{\mu }_t-\varvec{\mu }_{*}\Vert _2\lesssim \frac{\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_{x})}{\sqrt{\sigma _{\max }(\varvec{\Sigma }_x)}}, \end{aligned} \end{aligned}$$

(19)

hold for the t th iteration, then we have:

$$\begin{aligned} \begin{aligned} \Vert {\textbf{F}}_{t+1}{\textbf{H}}_{t+1}-{\textbf{F}}_{*}\Vert _{F}&\le \rho _1^{(t+1)}\Vert {\textbf{F}}_1{\textbf{H}}_1-{\textbf{F}}_{*}\Vert _{F}+C_0\frac{\sigma \sqrt{r\sigma _{\max }(\varvec{\Sigma }_x)}}{\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_x)}\Vert {\textbf{F}}_{*}\Vert _{F}. \end{aligned} \end{aligned}$$

provided $\rho _1^{(t)}=\prod _{i=1}^{t}[1-(1/10)m\eta _c^{(i)}\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_{x})]<1$.

Proof

For convenience, we denote:

$$\begin{aligned} {\textbf{F}}_t = \left[ \begin{array}{c}{\textbf{A}}_t\\ {\textbf{B}}_t\end{array}\right] ,{\textbf{F}}_* = \left[ \begin{array}{c}{\textbf{A}}_*\\ {\textbf{B}}_*\end{array}\right] ,\nabla _{{\textbf{F}}}f^{(t)}({\textbf{F}}_t,\varvec{\mu }_t)= \left[ \begin{array}{c}\nabla _{{\textbf{A}}} f^{(t)}(\varvec{\mu }_t,{\textbf{F}}_t)\\ \nabla _{{\textbf{B}}} f^{(t)}(\varvec{\mu }_t,{\textbf{F}}_t) \end{array}\right] . \end{aligned}$$

By definition of ${\textbf{H}}_t$ and the updating rule in Algorithm 1, we have:

$$\begin{aligned} \begin{aligned}&\Vert {\textbf{F}}_{t+1}{\textbf{H}}_{t+1}-{\textbf{F}}_{*}\Vert _{F} \le \Vert {\textbf{F}}_{t+1}{\textbf{H}}_{t}-{\textbf{F}}_{*}\Vert _{F}= \Vert ({\textbf{F}}_t-\eta _c^{(t)}\nabla _{{\textbf{F}}}f^{(t)}(\varvec{\mu }_t,{\textbf{F}}_t)){\textbf{H}}_t-{\textbf{F}}_{*}\Vert _{F}\\&=\Vert {\textbf{F}}_t{\textbf{H}}_t-\eta _c^{(t)}\nabla _{{\textbf{F}}}f^{(t)}(\varvec{\mu }_t,{\textbf{F}}_t{\textbf{H}}_t)-{\textbf{F}}_{*}\Vert _{F}\\&\le \underbrace{\Vert {\textbf{F}}_t{\textbf{H}}_t-\eta _c^{(t)}\nabla _{{\textbf{F}}}f^{(t)}(\varvec{\mu }_t,{\textbf{F}}_t{\textbf{H}}_t)+\eta _c^{(t)}\nabla _{{\textbf{F}}}f^{(t)}({\textbf{F}}_*,\varvec{\mu }_{*})-{\textbf{F}}_{*}\Vert _{F}}_{\alpha _1}\\ &+\underbrace{\eta _c^{(t)}\Vert \nabla _{{\textbf{F}}}f^{(t)}(\varvec{\mu }_{*},{\textbf{F}}_{*})\Vert _{F}}_{\alpha _2}. \end{aligned} \end{aligned}$$

The second equality holds because ${\textbf{H}}_t$ is orthogonal. For $\alpha _1$, notice that

$$\begin{aligned} \begin{aligned}&\mathbf {(}{\textbf{F}}_t{\textbf{H}}_t-\eta _c^{(t)}\nabla _{{\textbf{F}}}f^{(t)}(\varvec{\mu }_t,{\textbf{F}}_t{\textbf{H}}_t)+\eta _c^{(t)}\nabla _{{\textbf{F}}}f^{(t)}(\varvec{\mu }_{*},{\textbf{F}}_*)-{\textbf{F}}_{*})\\ &= \left[ {\textbf{I}}_{p+ q}-\eta _c^{(t)}\underbrace{\int _{0}^{1}\nabla ^2_{F} f^{(t)}(\varvec{\mu }(\theta ),{\textbf{F}}(\theta ))d\theta }_{={\textbf{M}}}\right] \mathbf {(}{\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*}), \end{aligned} \end{aligned}$$

here ${\textbf{I}}_{p+q}$ is $(p+q)$ dimensional identity matrix and ${\textbf{F}}(\theta )={\textbf{F}}_{*}+\theta ({\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*}),\varvec{\mu }(\theta )=\varvec{\mu }_{*}+\theta (\varvec{\mu }_t-\varvec{\mu }_{*})\text { for }\theta \in [0,1]$. Then we have:

$$\begin{aligned} \begin{aligned}&\Vert {\textbf{F}}_t{\textbf{H}}_t-\eta _c^{(t)}\nabla _{{\textbf{F}}}f^{(t)}(\varvec{\mu }_t,{\textbf{F}}_t{\textbf{H}}_t)+\eta _c^{(t)}\nabla _{{\textbf{F}}}f^{(t)}(\varvec{\mu }_{*},{\textbf{F}}_*)-{\textbf{F}}_{*}\Vert _{F}^2\\&=\mathbf {(}{\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*})^{\top }({\textbf{I}}_{p+q}-\eta _c^{(t)}{\textbf{M}})^2\mathbf {(}{\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*})\\ \le&\Vert {\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*}\Vert _{F}^2+\eta _c^{(t)2}\Vert {\textbf{M}}\Vert ^2\Vert {\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*}\Vert _{F}^2-\frac{2m\eta _c^{(t)}}{5}\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_{x})\Vert {\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*}\Vert _{F}^2\\ \le&\left[ 1+25\eta _c^{(t)2} m^2\sigma _{\max }({\textbf{C}}_{*})^2\sigma _{\max }(\varvec{\Sigma }_{x})^2-\frac{2m}{5}\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_{x})\right] \Vert {\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*}\Vert _{F}^2. \end{aligned} \end{aligned}$$

The final inequality holds when we let ${\textbf{V}}={\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*},{\textbf{F}}={\textbf{F}}(\theta ),\varvec{\mu }=\varvec{\mu }(\theta )$ in Lemma 1. Provided $\eta _c^{(t)}\lesssim 1/(m\sigma _{\max }({\textbf{C}}_{*})\sigma _{\max }(\varvec{\Sigma }_{x})\kappa _x\kappa _c)$, we have $\alpha _1\le [1-(1/10)m\eta _c^{(t)}\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_{x})]\Vert {\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*}\Vert _{F}$.

For $\alpha _2$, we denote ${\mathcal {E}}_t = [\varvec{\epsilon }_{t_1},\cdots ,\varvec{\epsilon }_{t_m}]$ and notice that:

$$\begin{aligned} \begin{aligned} \nabla _{{\textbf{F}}}f^{(t)}({\textbf{F}}_{*},\varvec{\mu }_{*})&=\left[ \begin{array}{c}{\mathcal {E}}_t{\textbf{X}}_t^{\top }{\textbf{B}}_{*}\\ {\textbf{X}}_t{\mathcal {E}}_t^{\top }{\textbf{A}}_{*}\end{array}\right] ,\\ \Vert \nabla _{{\textbf{F}}}f^{(t)}({\textbf{F}}_{*},\varvec{\mu }_{*})\Vert&\le \Vert {\mathcal {E}}_t{\textbf{X}}_t{\textbf{B}}_{*}\Vert +\Vert {\textbf{X}}_t{\mathcal {E}}_t^{\top }{\textbf{A}}_{*}\Vert \\&\le 2\Vert {\mathcal {E}}_t\Vert \cdot \Vert {\textbf{X}}_t\Vert \Vert {\textbf{F}}_{*}\Vert _{F}\\&\lesssim \sigma \sqrt{m}\cdot \sqrt{m}\sqrt{\sigma _{\max }(\varvec{\Sigma }_x)}\Vert {\textbf{F}}_{*}\Vert _{F}=m\sigma \sqrt{\sigma _{\max }(\varvec{\Sigma }_x)}\Vert {\textbf{F}}_{*}\Vert _{F}. \end{aligned} \end{aligned}$$

The final inequality holds because of the Assumption 3 and the basic properties for sub-Gaussian random variables. In conclusion, there exists constant $C_0,{\widetilde{C}}$ such that:

$$\begin{aligned} \begin{aligned}&\Vert {\textbf{F}}_{t+1}{\textbf{H}}_{t+1}-{\textbf{F}}_{*}\Vert _{F}\\&\le [1-\frac{1}{10}m\eta _c^{(t)}\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_{x})]\Vert {\textbf{F}}_{t}{\textbf{H}}_{t}-{\textbf{F}}_{*}\Vert _{F}+{\widetilde{C}}\eta _c^{(t)} m\sigma \sqrt{r\sigma _{\max }(C_{*})}\Vert {\textbf{F}}_{*}\Vert _{F}\\&\le [1-\frac{1}{10}m\eta _c^{(t)}\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_{x})]\left[ \rho _{1}^{(t)}\Vert {\textbf{F}}_1{\textbf{H}}_1-{\textbf{F}}_{*}\Vert _{F}+C_0\frac{\sigma \sqrt{r\sigma _{\max }(\varvec{\Sigma }_x)}}{\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_{x})}\Vert {\textbf{F}}_{*}\Vert _{F}\right] \\&+{\widetilde{C}}\eta _c^{(t)} m\sigma \sqrt{r\sigma _{\max }(\varvec{\Sigma }_x)}\Vert {\textbf{F}}_{*}\Vert _{F}\\&\le \rho _1^{(t+1)}\Vert {\textbf{F}}_1{\textbf{H}}_1-{\textbf{F}}_{*}\Vert _{F}+C_0\frac{\sigma \sqrt{r\sigma _{\max }(\varvec{\Sigma }_x)}}{\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_{x})}\Vert {\textbf{F}}_{*}\Vert _{F}, \end{aligned} \end{aligned}$$

holds with probability $1-O(e^{-m})$, provided $C_0>>10{\widetilde{C}}$ and $\rho _1^{(t)}=\prod _{i=1}^{t}[1-(1/10)m\eta _c^{(i)}\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_{x})]<1$. $\square$

1.2 Step 2: The error contraction of $\varvec{\mu }_{*}$

Because $f^{(t)}$ is a strongly convex function with respect to $\varvec{\mu }$, the error contraction of $\varvec{\mu }_{*}$ is relatively easy. We proposed the following proposition utilizing traditional analysis for the gradient descent algorithm.

Proposition 2

Under Assumptions 3 to 5, there exists an event which is independent of t and has probability $1-O(e^{-m})$, such that when

$$\begin{aligned} \Vert \varvec{\mu }_t-\varvec{\mu }_{*}\Vert _2\le \prod _{i=1}^{t}(1-\frac{1}{2}\eta _{\mu }^{(i-1)} m)\Vert \varvec{\mu }_1-\varvec{\mu }_{*}\Vert _2+C_1\sigma . \end{aligned}$$

holds for the t th iteration, then we have:

$$\begin{aligned} \Vert \varvec{\mu }_{t+1}-\varvec{\mu }_{*}\Vert _2\le \prod _{i=1}^{t+1}(1-\frac{1}{2}\eta _{\mu }^{(i-1)} m)\Vert \varvec{\mu }_1-\varvec{\mu }_{*}\Vert _2+C_1\sigma . \end{aligned}$$

Proof

Notice that $f^{(t)}$ is strongly convex with respect to $\varvec{\mu }$. For any $\varvec{\mu }_1,\varvec{\mu }_2,\varvec{\mu },\in {\mathbb {R}}^{p}$ and ${\textbf{F}}$ we have

$$\begin{aligned} \begin{aligned} (\varvec{\mu }_1-\varvec{\mu }_2)^{\top }\nabla ^2_\mu f^{(t)}(\varvec{\mu },{\textbf{F}})(\varvec{\mu }_1-\varvec{\mu }_2)&=m\Vert \varvec{\mu }_1-\varvec{\mu }_2\Vert _2^2,\\ \Vert \nabla ^2_\mu f^{(t)}(\varvec{\mu },{\textbf{F}})\Vert&=m. \end{aligned} \end{aligned}$$

According to the update rule in Algorithm 1, we can derive that

$$\begin{aligned} \begin{aligned}&\Vert \varvec{\mu }_{t+1}-\varvec{\mu }_{*}\Vert _2 = \Vert \varvec{\mu }_t-\eta _{\mu }^{(t)}\nabla _{\varvec{\mu }}f^{(t)}(\varvec{\mu }_t,{\textbf{F}}_t)-\varvec{\mu }_{*}\Vert _{2}\\&\le \Vert \varvec{\mu }_t-\eta _{\mu }^{(t)}\nabla _{\varvec{\mu }}f^{(t)}(\varvec{\mu }_t,{\textbf{F}}_t)+\eta _{\mu }^{(t)}\nabla _{\varvec{\mu }}f^{(t)}(\varvec{\mu }_{*},{\textbf{F}}_{*})-\varvec{\mu }_{*}\Vert _{2}+\Vert \eta _{\mu }^{(t)}\nabla _{\varvec{\mu }}f^{(t)}(\varvec{\mu }_{*},{\textbf{F}}_{*})\Vert _2\\&\le (1-\frac{1}{2}\eta _{\mu }^{(t)} m)\Vert \varvec{\mu }_t-\varvec{\mu }_{*}\Vert _2+\Vert {\mathcal {E}}_t{\varvec{1}}_{m}^{\top }\Vert _2. \end{aligned} \end{aligned}$$

The last inequality holds because of the strong convexity and smoothness. It can be proved similarly to what we have discussed in Step 1. Besides, $\Vert {\mathcal {E}}_t{\varvec{1}}_{m}\Vert _2\lesssim \sigma m$ with probability $1-O(e^{-m})$ under Assumption 3. The proof is completed by combining all the bounds and choosing $C_1$ big enough. $\square$

1.3 Step 3: The spectral initialization

In order to make the induction proceed correctly, we should first make the starting value fall into the temperate region defined in Lemma 1 through a reasonable initial method. In Algorithm 2, Denote ${\widehat{{\textbf{C}}}}_r = {\textbf{A}}_1{\textbf{B}}_1^{\top }$. According to Theorem 2.2 in Velu and Reinsel (2013), for any $\varvec{\mu }\in {\mathbb {R}}^{p}$, the matrix ${\widehat{{\textbf{C}}}}_r$ is the “best rank r least square estimation” of ${\textbf{C}}_{*}$ i.e.

$$\begin{aligned} {\widehat{{\textbf{C}}}}_r = \underset{\text {rank}({\textbf{C}})=r}{\arg \min }\Vert {\textbf{Y}}_0-{\textbf{C}}{\textbf{X}}_0-\varvec{\mu }{\varvec{1}}_{m_0}^{\top }\Vert _{F}^2. \end{aligned}$$

(20)

Then we can control the error of initial value by the next proposition.

Proposition 3

Let $\varvec{\mu }_1,{\textbf{A}}_1, \text { and }{\textbf{B}}_1$ be the estimations generated by Algorithm 2. Denote ${\textbf{F}}_1 = [{\textbf{A}}_1^{\top },{\textbf{B}}_1^{\top }]^{\top }$. Under Assumptions 1 to 5, we have:

$$\begin{aligned} \begin{aligned} \Vert {\textbf{F}}_1{\textbf{H}}_1-{\textbf{F}}_{*}\Vert&\lesssim \frac{1}{\kappa _c\kappa _x}\Vert {\textbf{F}}_{*}\Vert ,\\ \Vert \varvec{\mu }_1-\varvec{\mu }\Vert _2&\lesssim \frac{\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_{x})}{\sqrt{\sigma _{\max }(\varvec{\Sigma }_x)}}, \end{aligned} \end{aligned}$$

(21)

with probability $1-O(e^{-m})$.

We can divide the proof of Proposition 3 into the proof of the following three sub-propositions.

Proposition 4

Denote ${\textbf{C}}_{*}={\textbf{A}}_{*}{\textbf{B}}_{*}^{\top },{\hat{{\textbf{C}}}}_r = {\textbf{A}}_1{\textbf{B}}_1^{\top }$. Under Assumptions 1,3 and 4, with probability $1-O(m^{-10})$, we have:

$$\begin{aligned} \Vert {\hat{{\textbf{C}}}}_r-{\textbf{C}}_{*}\Vert _{F}\lesssim \sqrt{\frac{r}{\sigma _{\text{min}}(\varvec{\Sigma }_x)}}\sigma . \end{aligned}$$

Proof

Recall that ${\textbf{Y}}_0={\textbf{C}}_{*}{\textbf{X}}_0+{\mathcal {E}}_0$. From (20) we know that, for every matrix ${\textbf{C}}$ with rank r, we have:

$$\begin{aligned} \begin{aligned}&\Vert {\textbf{Y}}_0-{\hat{{\textbf{C}}}}_r{\textbf{X}}_0+\varvec{\mu }_{*}{\varvec{1}}_{m_0}\Vert _{F}^2\le \Vert {\textbf{Y}}_0-{\textbf{C}}{\textbf{X}}_0+\varvec{\mu }_{*}{\varvec{1}}_{m_0}\Vert _{F}^2\\ \Rightarrow&\Vert {\hat{{\textbf{C}}}}_r{\textbf{X}}_0-{\textbf{C}}_{*}{\textbf{X}}_0\Vert _{F}^2\le \Vert {\textbf{C}}{\textbf{X}}_0-{\textbf{C}}_{*}{\textbf{X}}_0\Vert _{F}^2+2\langle {\mathcal {E}}_0, {\hat{{\textbf{C}}}}_r{\textbf{X}}_0-{\textbf{C}}{\textbf{X}}_0\rangle \\ \le&\Vert {\textbf{C}}{\textbf{X}}_0-{\textbf{C}}_{*}{\textbf{X}}_0\Vert _{F}^2+2\sqrt{2r}\Vert {\mathcal {E}}_0\Vert \left\{ \Vert {\hat{{\textbf{C}}}}_r{\textbf{X}}_0-{\textbf{C}}_{*}{\textbf{X}}_0\Vert _{F}+\Vert {\textbf{C}}{\textbf{X}}_0-{\textbf{C}}_{*}{\textbf{X}}_0\Vert _{F}\right\} \\ \le&\frac{b+1}{b}\Vert {\textbf{C}}{\textbf{X}}_0-{\textbf{C}}_{*}{\textbf{X}}_0\Vert _{F}^2+\frac{1}{a}\Vert {\hat{{\textbf{C}}}}_r{\textbf{X}}_0-{\textbf{C}}_{*}{\textbf{X}}_0\Vert _{F}^2+2r(a+b)\Vert {\mathcal {E}}_0\Vert ^2, \end{aligned} \end{aligned}$$

for every $a,b>0$. The last line holds because of the basic algebraic inequality. Notice that $\Vert {\mathcal {E}}\Vert \le 2\sigma \sqrt{m_0}$ with probability $1-O(\exp \{-m_0\})$ (ref. Lemma 15 in Bunea et al. 2011). Choose $a=2,b=1$ we have $\Vert {\hat{{\textbf{C}}}}_r{\textbf{X}}_0-{\textbf{C}}_{*}{\textbf{X}}_0\Vert _{F}^2\le 6\{\Vert {\textbf{C}}{\textbf{X}}_0-{\textbf{C}}_{*}{\textbf{X}}_0\Vert _{F}^2+4rm\sigma ^2\}$. Notice that the former inequality holds for every matrix ${\textbf{C}}$ with rank r. In particular, we choose ${\textbf{C}}={\textbf{C}}_{*}$ then obtain:

$$\begin{aligned} \Vert {\hat{{\textbf{C}}}}_r{\textbf{X}}_0-{\textbf{C}}_{*}{\textbf{X}}_0\Vert _{F}^2\lesssim r m_0 \sigma ^2. \end{aligned}$$

By Theorem 6.1 in Wainwright (2019), with probability $1-O(e^{-m})$, we have:

$$\begin{aligned} \Vert {\hat{{\textbf{C}}}}_r{\textbf{X}}_0-{\textbf{C}}_{*}{\textbf{X}}_0\Vert _{F}\ge \sigma _{\min }({\textbf{X}}_0)\Vert {\hat{{\textbf{C}}}}_r-{\textbf{C}}_{*}\Vert _{F}\ge \frac{1}{2}\sqrt{m_0\sigma _{\min }(\varvec{\Sigma }_x)}\Vert {\hat{{\textbf{C}}}}_r-{\textbf{C}}_{*}\Vert _{F}. \end{aligned}$$

Finally,the proof is completed by combining two above inequalities. $\square$

Proposition 5

Let $\varvec{\mu }_1,{\textbf{A}}_1, \text { and }{\textbf{B}}_1$ be the estimations generated by Algorithm 2. Denote ${\textbf{F}}_1 = [{\textbf{A}}_1^{\top },{\textbf{B}}_1^{\top }]^{\top }$. Then we have:

$$\begin{aligned} \Vert {\textbf{F}}_1{\textbf{H}}_1-{\textbf{F}}_{*}\Vert \lesssim \sqrt{\kappa _c^{5/2}}\Vert {\hat{{\textbf{C}}}}_r-{\textbf{C}}_{*}\Vert _{F}. \end{aligned}$$

(22)

Here $\kappa _c$ is the condition number of ${\textbf{C}}_{*}$.

Proof

Denote ${\widetilde{{\textbf{C}}}}_{*} = \left[ \begin{array}{cc} {\textbf{O}} & {\textbf{C}}_{*}\\ {\textbf{C}}_{*}^{\top }& {\textbf{O}}\end{array}\right] ,{\widetilde{{\textbf{C}}}}_r= \left[ \begin{array}{cc} {\textbf{O}} & {\hat{{\textbf{C}}}}_r\\ {\hat{{\textbf{C}}}}_r^{\top }& {\textbf{O}}\end{array}\right] \in {\mathbb {R}}^{(p+q)\times (p+q)}$. We then reformulate

$$\begin{aligned} \begin{aligned}&({\textbf{U}}_r,\varvec{\Sigma }_r,{\textbf{V}}_r) = {\text {SVD}}({\hat{{\textbf{C}}}}_r), ({\textbf{U}}_*,\varvec{\Sigma }_*,{\textbf{V}}_*) = {\text {SVD}}({\textbf{C}}_*),\\&{\textbf{F}}_1 = \frac{1}{\sqrt{2}}\left[ \begin{array}{c}{\textbf{A}}_1\\ {\textbf{B}}_1\end{array}\right] , {\widetilde{{\textbf{F}}}}_1 = \frac{1}{\sqrt{2}}\left[ \begin{array}{c}{\textbf{U}}_r\\ {\textbf{V}}_r\end{array}\right] ,{\textbf{F}}_{*}=\frac{1}{\sqrt{2}}\left[ \begin{array}{c}{\textbf{A}}_*\\ {\textbf{B}}_*\end{array}\right] , {\widetilde{{\textbf{F}}}}_* = \frac{1}{\sqrt{2}}\left[ \begin{array}{c}{\textbf{U}}_*\\ {\textbf{V}}_*\end{array}\right] ,\\&{\textbf{H}}_1 = \underset{{\textbf{R}}\in {\mathcal {O}}_r}{\arg \min }\Vert {\textbf{F}}_1{\textbf{R}}-{\textbf{F}}_*\Vert _{F}, {\widetilde{{\textbf{H}}}}_1 = \underset{{\textbf{R}}\in {\mathcal {O}}_r}{\arg \min }\Vert {\widetilde{{\textbf{F}}}}_1{\textbf{R}}-{\widetilde{{\textbf{F}}}}_*\Vert _{F}. \end{aligned} \end{aligned}$$

According to the Lemma B.2-Lemma B.4 in Chen et al. (2020), we have:

$$\begin{aligned} \begin{aligned}&\Vert {\textbf{F}}_1{\textbf{H}}_1-{\textbf{F}}_*\Vert =\Vert {\widetilde{{\textbf{F}}}}_1(\varvec{\Sigma }_r)^{1/2}({\textbf{H}}_1-{\widetilde{{\textbf{H}}}}_1)+{\widetilde{{\textbf{F}}}}_1(\varvec{\Sigma }_r^{1/2}{\widetilde{{\textbf{H}}}}_1-{\widetilde{{\textbf{H}}}}_1\varvec{\Sigma }_{*}^{1/2})+({\widetilde{{\textbf{F}}}}_1{\widetilde{{\textbf{H}}}}_1-{\widetilde{{\textbf{F}}}}_*)\varvec{\Sigma }_{*}^{1/2}\Vert \\&\le \Vert {\widetilde{{\textbf{F}}}}_1\Vert \Vert \varvec{\Sigma }_r^{1/2}\Vert \Vert {\widetilde{{\textbf{H}}}}_1-{\widetilde{{\textbf{H}}}}_1\Vert + \Vert {\widetilde{{\textbf{F}}}}_1\Vert \Vert \varvec{\Sigma }_r^{1/2}{\widetilde{{\textbf{H}}}}_1-{\widetilde{{\textbf{H}}}}_1\varvec{\Sigma }_{*}^{1/2}\Vert +\Vert {\widetilde{{\textbf{F}}}}_1{\widetilde{{\textbf{H}}}}_1-{\widetilde{{\textbf{F}}}}_*\Vert \Vert \varvec{\Sigma }_{*}^{1/2}\Vert \\&\lesssim \sqrt{\sigma _{\max }({\textbf{C}}_*)}\left[ \frac{\sqrt{\kappa _c^3}}{\sigma _{\min }({\textbf{C}}_*)}+\frac{\kappa _c}{\sqrt{\sigma _{\min }({\textbf{C}}_*)}}+\frac{1}{\sqrt{\sigma _{\min }({\textbf{C}}_{*})}}\right] \Vert {\hat{{\textbf{C}}}}_r-{\textbf{C}}_{*}\Vert \\&\lesssim \frac{\kappa _c^{5/2}}{\sqrt{\sigma _{\max }({\textbf{C}}_{*})}}\Vert {\hat{{\textbf{C}}}}_r-{\textbf{C}}_{*}\Vert \le \frac{\kappa _c^{5/2}}{\sqrt{\sigma _{\max }({\textbf{C}}_{*})}}\Vert {\hat{{\textbf{C}}}}_r-{\textbf{C}}_{*}\Vert _{F}. \end{aligned} \end{aligned}$$

$\square$

In conclusion, the first inequality in (21) can be proved as:

$$\begin{aligned} \begin{aligned} \Vert {\textbf{F}}_1{\textbf{H}}_1-{\textbf{F}}_{*}\Vert&\le \frac{\kappa _c^{5/2}}{\sqrt{\sigma _{\max }({\textbf{C}}_{*})}}\times \sqrt{\frac{r}{\sigma _{\min }(\varvec{\Sigma }_x)}}\sigma \\&\le \sqrt{\frac{\kappa _c^{5}r}{\sigma _{\max }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_x)}}\sigma \lesssim \frac{1}{\kappa _c\kappa _x}\Vert {\textbf{F}}_{*}\Vert . \end{aligned} \end{aligned}$$

Here the final inequality holds because of the Assumption 4. The remainder of Proposition (3) is proved by the following statement.

Proposition 6

For intercept term $\varvec{\mu }_1$ generated in Algorithm 2, we have:

$$\begin{aligned} \Vert \varvec{\mu }_1-\varvec{\mu }_{*}\Vert _2\lesssim \frac{\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_{x})}{\sqrt{\sigma _{\max }(\varvec{\Sigma }_x)}}. \end{aligned}$$

under Assumptions 3 and 4, with probability $1-O(e^{-m})$.

Proof

$$\begin{aligned} \begin{aligned} \Vert \varvec{\mu }_1-\varvec{\mu }_*\Vert _2&=\Vert (\bar{{\textbf{y}}}-{\hat{{\textbf{C}}}}_r\bar{{\textbf{x}}})-(\bar{{\textbf{y}}}-{\textbf{C}}_{*}\bar{{\textbf{x}}}-\bar{\varvec{\epsilon }})\Vert _2\\&\le \Vert {\hat{{\textbf{C}}}}_r-{\textbf{C}}_{*}\Vert \Vert \bar{{\textbf{x}}}\Vert _2+\Vert \bar{\varvec{\epsilon }}\Vert _2\\&\lesssim \sqrt{\frac{r}{\sigma _{\min}(\varvec{\Sigma }_x)}}\sigma \times \sqrt{\frac{\sigma _{\max }(\varvec{\Sigma }_x)q}{m_0}}+\sqrt{\frac{p}{m_0}}\sigma \\&\lesssim \frac{\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_{x})}{\sqrt{\sigma _{\max }(\varvec{\Sigma }_x)}}. \end{aligned} \end{aligned}$$

The penult inequality holds because of the properties of sub-Gaussian random vector. The last inequality holds under the Assumption 4 and the choice of $m_0\gtrsim q$. $\square$

Appendix 3: The Proof of Theorem 1

In Appendix 2, we illustrate the upper bounds of the parameter estimation error. More importantly, we find that the parameters always stay in the region defined by Lemma 1 as long as the step size of each time point is small enough. This facilitates our analysis of regret using online convex optimization techniques.

Proof

According to Lemma 1 and Lemma 2, if Assumption 5 is satisfied, we can conclude that for the t the step:

$$\begin{aligned} \begin{aligned} \mathbf {(}{\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*})^{\top }\nabla _{{\textbf{F}}}^{2}f^{(t)}(\varvec{\mu }_t,{\textbf{F}}_t{\textbf{H}}_t)\mathbf {(}{\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*})&\ge \frac{m}{5}\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_{x})\Vert {\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*}\Vert _{F}^2,\\ (\varvec{\mu }_t-\varvec{\mu }_{*})^{\top }\nabla ^2_{\varvec{\mu }} f^{(t)}(\varvec{\mu }_t,{\textbf{F}}_t{\textbf{H}}_t)(\varvec{\mu }_t-\varvec{\mu }_{*})&=m\Vert \varvec{\mu }_t-\varvec{\mu }_{*}\Vert _2^2, \end{aligned} \end{aligned}$$

with probability $1-O(e^{-m})$. Recall that $\nabla _{{\textbf{F}}}^2 f^{(t)}$ is the Hessian of $f^{(t)}$ with respect to ${\textbf{F}}$ and $\nabla ^2_{\varvec{\mu }} f^{(t)}$ is the Hessian of $f^{(t)}$ with respect to $\varvec{\mu }$. For convenient, we abbreviate the gradient $\nabla _{{\textbf{F}}}f^{(t)}(\varvec{\mu }_t,{\textbf{F}}_t{\textbf{H}}_t),\nabla _{\varvec{\mu }}f^{(t)}(\varvec{\mu }_t,{\textbf{F}}_t{\textbf{H}}_t)$ as $\nabla _{{\textbf{F}}}f^{(t)}$ and $\nabla _{\varvec{\mu }}f^{(t)}$. Choose $\eta _c^{(t)}=1/[\alpha _m(t+\kappa _c^2\kappa _x^2)], \eta _{\mu }^{(t)}=1/m(t+1)$ then we have:

$$\begin{aligned} \begin{aligned} 2{\mathcal {R}}(T)&= 2\sum _{t=1}^{T}\left[ f^{(t)}(\varvec{\mu }_t,{\textbf{A}}_t{\textbf{H}}_t,{\textbf{B}}_t{\textbf{H}}_t)-f^{(t)}(\varvec{\mu }_{*},{\textbf{A}}_{*},{\textbf{B}}_{*})\right] \\ &\le 2\sum _{t=1}^{T}\left[ \langle \nabla _{{\textbf{F}}}f^{(t)},{\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*}\rangle -\alpha _m\Vert {\textbf{F}}_{t}{\textbf{H}}_t-{\textbf{F}}_{*}\Vert _{F}^2\right. \\ &\quad \left. +\langle \nabla _{\varvec{\mu }}f^{(t)},\varvec{\mu }_t-\varvec{\mu }_{*}\rangle -m\Vert \varvec{\mu }_t-\varvec{\mu }_{*}\Vert _2^2\right] \\&\le \sum _{t=1}^{T}\left[ \Vert {\textbf{F}}_{t}{\textbf{H}}_t-{\textbf{F}}_{*}\Vert _{F}^2(\frac{1}{\eta _c^{(t)}}-\frac{1}{\eta _c^{(t-1)}}-\alpha _m)\right. \\ &\quad \left. +\Vert \varvec{\mu }_t-\varvec{\mu }_{*}\Vert _2^2(\frac{1}{\eta _\mu ^{(t)}}-\frac{1}{\eta _\mu ^{(t-1)}}-m)\right] \\&\quad +\sum _{t=1}^{T}\frac{1}{\min \{m,\alpha _m\}}\times \frac{1}{t}\times \left( \Vert \nabla _{{\textbf{F}}}f^{(t)}\Vert _{F}^2+\Vert \nabla _{\varvec{\mu }}f^{(t)}\Vert _2^2\right) . \end{aligned} \end{aligned}$$

Here the first inequality hold because of the local strong convexity and the second inequality holds because of the fact that:

$$\begin{aligned} \begin{aligned} \Vert {\textbf{F}}_{t+1}{\textbf{H}}_{t+1}-{\textbf{F}}_{*}\Vert _{F}^2&\le \Vert {\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*}\Vert _{F}^2+\eta _{c}^{(t)2}\Vert \nabla _{{\textbf{F}}}f^{(t)}\Vert _{F}^2-2\eta _c^{(t)}\langle \nabla _{{\textbf{F}}}f^{(t)},{\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*}\rangle ,\\ \Vert \varvec{\mu }_{t+1}-\varvec{\mu }_{*}\Vert _{2}^2&= \Vert \varvec{\mu }_{t+1}-\varvec{\mu }_{*}\Vert _{2}^2+\eta _{\mu }^{(t)2}\Vert \nabla _{\varvec{\mu }}f^{(t)}\Vert _{2}^2-2\eta _c^{(t)}\langle \nabla _{\varvec{\mu }}f^{(t)},\varvec{\mu }_{t}-\varvec{\mu }_{*}\rangle . \end{aligned} \end{aligned}$$

We then need to control $\Vert \nabla _{{\textbf{F}}}f^{(t)}\Vert _{F}$ and $\Vert \nabla _{\varvec{\mu }}f^{(t)}\Vert _{F}$. Take $\Vert \nabla _{{\textbf{A}}}f^{(t)}(\varvec{\mu }_t,{\textbf{F}}_t{\textbf{H}}_t)\Vert _{F}$ as an example, according to Lemma 2 we have:

$$\begin{aligned} \begin{aligned}&\Vert \nabla _{{\textbf{A}}}f^{(t)}(\varvec{\mu }_t,{\textbf{F}}_t{\textbf{H}}_t)\Vert _{F}=\Vert ({\textbf{A}}_t{\textbf{B}}_t^{\top }{\textbf{X}}_t+\varvec{\mu }_t{\varvec{1}}_m^{\top }-{\textbf{Y}}_t){\textbf{X}}_t^{\top }{\textbf{B}}_t{\textbf{H}}_t+\frac{1}{2}{\textbf{A}}_t{\textbf{H}}_t({\textbf{A}}_t^{\top }{\textbf{A}}_t-{\textbf{B}}_t^{\top }{\textbf{B}}_t)\Vert _{F}\\&\lesssim \Vert {\textbf{A}}_t{\textbf{B}}_t^{\top }{\textbf{X}}_t+\varvec{\mu }{\varvec{1}}_m^{\top }-{\textbf{Y}}_t\Vert \Vert {\textbf{X}}_t\Vert \Vert {\textbf{B}}_t{\textbf{H}}_t\Vert _{F}+\Vert {\textbf{A}}_t^{\top }{\textbf{A}}_t-{\textbf{B}}_t^{\top }{\textbf{B}}_t\Vert \Vert {\textbf{A}}_t{\textbf{H}}_t\Vert _{F}\\&\lesssim \left[ \Vert \left( {\textbf{A}}_t{\textbf{B}}_t^{\top }-{\textbf{A}}_{*}{\textbf{B}}_{*}^{\top }\right) {\textbf{X}}_t+\left( \varvec{\mu }_t-\varvec{\mu }_{*}\right) {\varvec{1}}_m^{\top }\Vert +\Vert {\mathcal {E}}_t\Vert \right] \Vert {\textbf{X}}_t\Vert \Vert {\textbf{B}}_t{\textbf{H}}_t\Vert _{F}\\ &\quad +\Vert {\textbf{A}}_t^{\top }{\textbf{A}}_t-{\textbf{B}}_t^{\top }{\textbf{B}}_t\Vert \Vert {\textbf{A}}_t{\textbf{H}}_t\Vert _{F}\\&\lesssim \Vert {\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*}\Vert \Vert {\textbf{F}}_{*}\Vert \Vert {\textbf{X}}_t\Vert ^2\times \sqrt{r}\Vert {\textbf{F}}_{*}\Vert +\sigma \sqrt{m}\Vert {\textbf{X}}_t\Vert \Vert {\textbf{F}}_{*}\Vert \\&\quad +\Vert {\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*}\Vert \Vert {\textbf{F}}_{*}\Vert ^2\\&\lesssim \sqrt{r}m\sigma _{\max }(\varvec{\Sigma }_x)\sigma _{\max }({\textbf{C}}_{*}). \end{aligned} \end{aligned}$$

The upper bounds for $\Vert \nabla _{{\textbf{B}}}f^{(t)}(\varvec{\mu }_t,{\textbf{F}}_t{\textbf{H}}_t)\Vert _{F}$ and $\Vert \nabla _{\varvec{\mu }}f^{(t)}\Vert _{F}$ are similar. Finally, using the fact that $\sum _{t=1}^{T}(1/t)\le (1+\log (T))$ to complete the proof. $\square$

Appendix 4: The Proof of Theorem 2

Proof

We denote ${\mathcal {E}}_t = [\varvec{\epsilon }_{t_1},\cdots ,\varvec{\epsilon }_{t_m}]$ then we have:

$$\begin{aligned} \begin{aligned}&\sum _{t=1}^{T} \Vert {\textbf{A}}_t{\textbf{B}}_t^{\top }{\textbf{X}}_t+\varvec{\mu }_t{\varvec{1}}_m^{\top }-{\textbf{A}}_{*}{\textbf{B}}_{*}^{\top }{\textbf{X}}_t-\varvec{\mu }_{*}{\varvec{1}}_m^{\top }\Vert _{F}^2\\&= \sum _{t=1}^{T}f^{(t)}(\varvec{\mu }_t,{\textbf{A}}_t{\textbf{H}}_t,{\textbf{B}}_t{\textbf{H}}_t)-f^{(t)}(\varvec{\mu }_{*}{\textbf{A}}_{*},{\textbf{B}}_{*})\\ &\quad +2\langle {\textbf{A}}_t{\textbf{B}}_t^{\top }{\textbf{X}}_t+\varvec{\mu }_t{\varvec{1}}_m^{\top }-{\textbf{A}}_{*}{\textbf{B}}_{*}^{\top }{\textbf{X}}_t-\varvec{\mu }_{*}{\varvec{1}}_m^{\top },{\mathcal {E}}_t\rangle \\ \le&{\mathcal {R}}(T)+\underbrace{\sum _{t=1}^{T}\langle {\textbf{A}}_t{\textbf{B}}_t^{\top }{\textbf{X}}_t+\varvec{\mu }_t{\varvec{1}}_m^{\top }-{\textbf{A}}_{*}{\textbf{B}}_{*}^{\top }{\textbf{X}}_t-\varvec{\mu }_{*}{\varvec{1}}_m^{\top },{\mathcal {E}}_t\rangle }_{\beta _1}. \end{aligned} \end{aligned}$$

We next just need to control $\beta _1$. For convenience, we denote ${\widetilde{{\textbf{X}}}}_t =[{\varvec{1}}_{m},{\textbf{X}}_{t}^{\top }]^{\top }$ and ${\widetilde{{\textbf{C}}}}_t = [\varvec{\mu }_t,{\textbf{A}}_t{\textbf{B}}_t^{\top }]$. The we have:

$$\begin{aligned} \begin{aligned} \beta _1&=\left\langle \underbrace{\left( \begin{array}{ccc}{\widetilde{{\textbf{C}}}}_{1}-{\widetilde{{\textbf{C}}}}_{*} & & \\ & \ddots & \\ & & {\widetilde{{\textbf{C}}}}_{T}-{\widetilde{{\textbf{C}}}}_{*}\end{array}\right) }_{={\widetilde{{\textbf{C}}}}}\underbrace{\left( \begin{array}{c}{\widetilde{{\textbf{X}}}}_1 \\ \vdots \\ {\widetilde{{\textbf{X}}}}_T\end{array}\right) }_{={\widetilde{{\textbf{X}}}}},\underbrace{\left( \begin{array}{c}{\mathcal {E}}_1 \\ \vdots \\ {\mathcal {E}}_T\end{array}\right) }_{=\widetilde{{\mathcal {E}}}}\right\rangle \\&=\langle {\widetilde{{\textbf{C}}}}{\widetilde{{\textbf{X}}}},\widetilde{{\mathcal {E}}}\rangle =\langle {\widetilde{{\textbf{C}}}}{\widetilde{{\textbf{X}}}},\widetilde{{\mathcal {E}}}{\textbf{P}}\rangle \\&\le \Vert \widetilde{{\mathcal {E}}}{\textbf{P}}\Vert \times \sqrt{2r}\Vert {\widetilde{{\textbf{C}}}}{\widetilde{{\textbf{X}}}}\Vert _{F}\le \frac{1}{2}\Vert {\widetilde{{\textbf{C}}}}{\widetilde{{\textbf{X}}}}\Vert _{F}^2+r\Vert \widetilde{{\mathcal {E}}}{\textbf{P}}\Vert ^2\\&\lesssim \frac{1}{2}\sum _{t=1}^{T} \Vert {\textbf{A}}_t{\textbf{B}}_t^{\top }{\textbf{X}}_t+\varvec{\mu }_t{\varvec{1}}_m^{\top }-{\textbf{A}}_{*}{\textbf{B}}_{*}^{\top }{\textbf{X}}_t-\varvec{\mu }_{*}{\varvec{1}}_m^{\top }\Vert _{F}^2 + r(p+q)\sigma ^2. \end{aligned} \end{aligned}$$

Here ${\textbf{P}} = {\widetilde{{\textbf{X}}}}^{\top }({\widetilde{{\textbf{X}}}}{\widetilde{{\textbf{X}}}}^{\top })^{-1}{\widetilde{{\textbf{X}}}}$ is the projection to the row space of ${\widetilde{{\textbf{X}}}}$. The final inequality holds with probability $1-O(e^{-(p+q)/2})$ according to the Lemma 3 in Bunea et al. (2011). Acording to Assumption 2, we have $1-O(e^{-(p+q)/2}) = 1-O(e^{-m})$. Finally, plug in the upper bound of $\beta _1$ and combine the like items to complete the proof. $\square$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, X., Liu, W. & Mao, X. Efficient and provable online reduced rank regression via online gradient descent. Mach Learn 113, 8711–8748 (2024). https://doi.org/10.1007/s10994-024-06622-y

Download citation

Received: 18 November 2022
Revised: 02 July 2024
Accepted: 30 August 2024
Published: 23 September 2024
Issue Date: December 2024
DOI: https://doi.org/10.1007/s10994-024-06622-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient and provable online reduced rank regression via online gradient descent

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Provable accelerated gradient method for nonconvex low rank optimization

Online optimization for max-norm regularization

Provably Accelerating Ill-Conditioned Low-Rank Estimation via Scaled Gradient Descent, Even with Overparameterization

Explore related subjects

Availability of data and materials

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendices

Appendix 1: The Proof of Lemma 1

Proof

Appendix 2: The Proof of Lemma 2

1.1 Step 1: The error contraction of \({\textbf{A}}_{*}\) and \({\textbf{B}}_{*}\)

Proposition 1

Proof

1.2 Step 2: The error contraction of \(\varvec{\mu }_{*}\)

Proposition 2

Proof

1.3 Step 3: The spectral initialization

Proposition 3

Proposition 4

Proof

Proposition 5

Proof

Proposition 6

Proof

Appendix 3: The Proof of Theorem 1

Proof

Appendix 4: The Proof of Theorem 2

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation

1.1 Step 1: The error contraction of \({\textbf{A}}_{}\) and \({\textbf{B}}_{}\)