The Reduced Rank Regression (RRR) model is frequently employed in machine learning. It increases efficiency and interpretability by adding a low-rank restriction to the coefficient matrix, which can be used to cut down on the number of parameters. In this paper, we study the RRR issue in an online setting. Only a small batch of data can be utilized each time, arriving in a stream. Previous analogous methods have relied on conventional least squares estimation, which is inefficient and does not theoretically guarantee convergence rate or build connections with offline strategies. We proposed an efficient online RRR algorithm based on non-convex online gradient descent. More importantly, based on a constant order batch size and appropriate initialization, we theoretically prove the convergence result of the mean estimation error generated by our algorithm. Our result achieves an optimal rate of up to a logarithmic factor. We also propose an accelerated version of our algorithm. Our methods compete with the existing method in terms of accuracy and calculation speed in numerical simulations and real applications.
Only public datasets, in https://stats.oecd.org/ and https://www.nytimes.com/article/coronavirus-county-data-us.html, are used.
Code availability
The demo codes of our simulations and real applications can be found at https://github.com/shawstat/ORRR_code_and_data.git.
Weidong Liu’s research is supported by NSFC Grant No. 11825104. Xiaojun Mao’s research is supported by NSFC Grant No. 12422111 and 12371273, the Shanghai Rising-Star Program 23QA1404600 and Young Elite Scientists Sponsorship Program by CAST (2023QNRC001).
Xiao Liu developed the theory, performed the computations, wrote the original preparation draft, and edited the writing. Weidong Liu and Xiaojun Mao conceived the presented idea, verified the analytical methods, supervised the findings of this work, and reviewed and edited the writing. All authors discussed the results and contributed to the final manuscript.
Appendix 1: The Proof of Lemma 1
For convenience, we divide the direction matrix \({\textbf{V}}\) into two parts by rows. Let \({\textbf{V}}_{A}\in {\mathbb {R}}^{p\times r}\) contain the first p rows of \({\textbf{V}}\) and \({\textbf{V}}_{B}\in {\mathbb {R}}^{q\times r}\) be consist of the rest q rows. By algebraic calculation, the Hessian of \(f^{(t)}\) with respect to \([{\textbf{A}}^{\top },{\textbf{B}}^{\top }]^{\top }\) can be expressed as:
Then we can prove Lemma 1.
Define \(\varvec{\Delta }_A = {\textbf{A}}-{\textbf{A}}_{*}, \varvec{\Delta }_B={\textbf{B}}-{\textbf{B}}_{*}\). The Hessian (15) can be reformulated as following:
Here \(\xi _1\) is the term containing \((\varvec{\Delta }_A,\varvec{\Delta }_B)\):
We then replace \({\textbf{A}}_{*},{\textbf{B}}_{*}\) with \({\textbf{A}}_{*}-{\textbf{A}}_2+{\textbf{A}}_2,{\textbf{B}}_{*}-{\textbf{B}}_2+{\textbf{B}}_2\). The Hessian of \(f^{(t)}\) can be written as:
According to the form of \({\textbf{V}}\) in (8) and Lemma 35 in Ma et al. (2018), we can conclude that \({\textbf{A}}_2^{\top }{\textbf{V}}_A+{\textbf{B}}_2^{\top }{\textbf{V}}_{B}\) is symmetric, which is why the last equality of (16) holds. By Cauchy-Schwarz inequality and basic inequalities of the spectral norm, we have:
According to the Theorem 6.1 (Theorem 6.5 for the sub-Gaussian case) in Wainwright (2019) and the restrictions of our region i.e. (9), with probability \(1-O(e^{-m})\) we have:
Under Assumption 4 we can conclude:
For the lower bound, notice that:
Under the Assumptions 1 and 2 the inequalities in (18) hold with probability \(1-O(e^{-m})\) because Theorem 6.1 (Theorem 6.5 for the sub-Gaussian case) in Wainwright (2019) and the fact that \(m\gtrsim {\text {tr}}(\varvec{\Sigma }_x)/\sigma _{\min }(\varvec{\Sigma }_x)\). When we focus on the small-batch-size regime, instead, we need to use Theorem 5.58 in Vershynin (2012) under Assumptions 1’ and 2’. Combine (16), (17) and (18), we then have:
In a similar way, we can prove \(\Vert \nabla ^2_{{\textbf{F}}} f^{(t)}(\varvec{\mu },{\textbf{A}},{\textbf{B}})\Vert \le 5\,m\sigma _{\max }({\textbf{C}}_{*})\sigma _{\max }(\varvec{\Sigma }_{x})\) by the upper bound of \(\mathbf {(}{\textbf{V}})^{\top }\nabla ^2_{{\textbf{F}}}f^{(t)}(\varvec{\mu },{\textbf{A}},{\textbf{B}})\mathbf {(}{\textbf{V}})\) and the definition of spectral norm. \(\square\)
Appendix 2: The Proof of Lemma 2
The proof of Lemma 2 is completed in an induction manner and can be decomposed into three steps: the error contraction of \({\textbf{F}}_{*}=[{\textbf{A}}_{*}^{\top },{\textbf{B}}_{*}^{\top }]\), the error contraction of \(\varvec{\mu }_{*}\), and the properties of initial values.
1.1 Step 1: The error contraction of \({\textbf{A}}_{*}\) and \({\textbf{B}}_{*}\)
Proposition 1
Under Assumptions 1–5, there exists an event which is independent of t and has probability \(1-O(e^{-m})\), such that when
hold for the t th iteration, then we have:
provided \(\rho _1^{(t)}=\prod _{i=1}^{t}[1-(1/10)m\eta _c^{(i)}\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_{x})]<1\).
For convenience, we denote:
By definition of \({\textbf{H}}_t\) and the updating rule in Algorithm 1, we have:
The second equality holds because \({\textbf{H}}_t\) is orthogonal. For \(\alpha _1\), notice that
here \({\textbf{I}}_{p+q}\) is \((p+q)\) dimensional identity matrix and \({\textbf{F}}(\theta )={\textbf{F}}_{*}+\theta ({\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*}),\varvec{\mu }(\theta )=\varvec{\mu }_{*}+\theta (\varvec{\mu }_t-\varvec{\mu }_{*})\text { for }\theta \in [0,1]\). Then we have:
The final inequality holds when we let \({\textbf{V}}={\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*},{\textbf{F}}={\textbf{F}}(\theta ),\varvec{\mu }=\varvec{\mu }(\theta )\) in Lemma 1. Provided \(\eta _c^{(t)}\lesssim 1/(m\sigma _{\max }({\textbf{C}}_{*})\sigma _{\max }(\varvec{\Sigma }_{x})\kappa _x\kappa _c)\), we have \(\alpha _1\le [1-(1/10)m\eta _c^{(t)}\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_{x})]\Vert {\textbf{F}}_t{\textbf{H}}_t-{\textbf{F}}_{*}\Vert _{F}\).
For \(\alpha _2\), we denote \({\mathcal {E}}_t = [\varvec{\epsilon }_{t_1},\cdots ,\varvec{\epsilon }_{t_m}]\) and notice that:
The final inequality holds because of the Assumption 3 and the basic properties for sub-Gaussian random variables. In conclusion, there exists constant \(C_0,{\widetilde{C}}\) such that:
holds with probability \(1-O(e^{-m})\), provided \(C_0>>10{\widetilde{C}}\) and \(\rho _1^{(t)}=\prod _{i=1}^{t}[1-(1/10)m\eta _c^{(i)}\sigma _{\min }({\textbf{C}}_{*})\sigma _{\min }(\varvec{\Sigma }_{x})]<1\). \(\square\)
1.2 Step 2: The error contraction of \(\varvec{\mu }_{*}\)
Because \(f^{(t)}\) is a strongly convex function with respect to \(\varvec{\mu }\), the error contraction of \(\varvec{\mu }_{*}\) is relatively easy. We proposed the following proposition utilizing traditional analysis for the gradient descent algorithm.
Proposition 2
Under Assumptions 3 to 5, there exists an event which is independent of t and has probability \(1-O(e^{-m})\), such that when
holds for the t th iteration, then we have:
Notice that \(f^{(t)}\) is strongly convex with respect to \(\varvec{\mu }\). For any \(\varvec{\mu }_1,\varvec{\mu }_2,\varvec{\mu },\in {\mathbb {R}}^{p}\) and \({\textbf{F}}\) we have
According to the update rule in Algorithm 1, we can derive that
The last inequality holds because of the strong convexity and smoothness. It can be proved similarly to what we have discussed in Step 1. Besides, \(\Vert {\mathcal {E}}_t{\varvec{1}}_{m}\Vert _2\lesssim \sigma m\) with probability \(1-O(e^{-m})\) under Assumption 3. The proof is completed by combining all the bounds and choosing \(C_1\) big enough. \(\square\)
1.3 Step 3: The spectral initialization
In order to make the induction proceed correctly, we should first make the starting value fall into the temperate region defined in Lemma 1 through a reasonable initial method. In Algorithm 2, Denote \({\widehat{{\textbf{C}}}}_r = {\textbf{A}}_1{\textbf{B}}_1^{\top }\). According to Theorem 2.2 in Velu and Reinsel (2013), for any \(\varvec{\mu }\in {\mathbb {R}}^{p}\), the matrix \({\widehat{{\textbf{C}}}}_r\) is the “best rank r least square estimation” of \({\textbf{C}}_{*}\) i.e.
Then we can control the error of initial value by the next proposition.
Proposition 3
Let \(\varvec{\mu }_1,{\textbf{A}}_1, \text { and }{\textbf{B}}_1\) be the estimations generated by Algorithm 2. Denote \({\textbf{F}}_1 = [{\textbf{A}}_1^{\top },{\textbf{B}}_1^{\top }]^{\top }\). Under Assumptions 1 to 5, we have:
with probability \(1-O(e^{-m})\).
We can divide the proof of Proposition 3 into the proof of the following three sub-propositions.
Proposition 4
Denote \({\textbf{C}}_{*}={\textbf{A}}_{*}{\textbf{B}}_{*}^{\top },{\hat{{\textbf{C}}}}_r = {\textbf{A}}_1{\textbf{B}}_1^{\top }\). Under Assumptions 1,3 and 4, with probability \(1-O(m^{-10})\), we have:
Recall that \({\textbf{Y}}_0={\textbf{C}}_{*}{\textbf{X}}_0+{\mathcal {E}}_0\). From (20) we know that, for every matrix \({\textbf{C}}\) with rank r, we have:
for every \(a,b>0\). The last line holds because of the basic algebraic inequality. Notice that \(\Vert {\mathcal {E}}\Vert \le 2\sigma \sqrt{m_0}\) with probability \(1-O(\exp \{-m_0\})\) (ref. Lemma 15 in Bunea et al. 2011). Choose \(a=2,b=1\) we have \(\Vert {\hat{{\textbf{C}}}}_r{\textbf{X}}_0-{\textbf{C}}_{*}{\textbf{X}}_0\Vert _{F}^2\le 6\{\Vert {\textbf{C}}{\textbf{X}}_0-{\textbf{C}}_{*}{\textbf{X}}_0\Vert _{F}^2+4rm\sigma ^2\}\). Notice that the former inequality holds for every matrix \({\textbf{C}}\) with rank r. In particular, we choose \({\textbf{C}}={\textbf{C}}_{*}\) then obtain:
By Theorem 6.1 in Wainwright (2019), with probability \(1-O(e^{-m})\), we have:
Finally,the proof is completed by combining two above inequalities. \(\square\)
Proposition 5
Let \(\varvec{\mu }_1,{\textbf{A}}_1, \text { and }{\textbf{B}}_1\) be the estimations generated by Algorithm 2. Denote \({\textbf{F}}_1 = [{\textbf{A}}_1^{\top },{\textbf{B}}_1^{\top }]^{\top }\). Then we have:
Here \(\kappa _c\) is the condition number of \({\textbf{C}}_{*}\).
Denote \({\widetilde{{\textbf{C}}}}_{*} = \left[ \begin{array}{cc} {\textbf{O}} & {\textbf{C}}_{*}\\ {\textbf{C}}_{*}^{\top }& {\textbf{O}}\end{array}\right] ,{\widetilde{{\textbf{C}}}}_r= \left[ \begin{array}{cc} {\textbf{O}} & {\hat{{\textbf{C}}}}_r\\ {\hat{{\textbf{C}}}}_r^{\top }& {\textbf{O}}\end{array}\right] \in {\mathbb {R}}^{(p+q)\times (p+q)}\). We then reformulate
According to the Lemma B.2-Lemma B.4 in Chen et al. (2020), we have:
In conclusion, the first inequality in (21) can be proved as:
Here the final inequality holds because of the Assumption 4. The remainder of Proposition (3) is proved by the following statement.
Proposition 6
For intercept term \(\varvec{\mu }_1\) generated in Algorithm 2, we have:
under Assumptions 3 and 4, with probability \(1-O(e^{-m})\).
The penult inequality holds because of the properties of sub-Gaussian random vector. The last inequality holds under the Assumption 4 and the choice of \(m_0\gtrsim q\). \(\square\)
Appendix 3: The Proof of Theorem 1
In Appendix 2, we illustrate the upper bounds of the parameter estimation error. More importantly, we find that the parameters always stay in the region defined by Lemma 1 as long as the step size of each time point is small enough. This facilitates our analysis of regret using online convex optimization techniques.
According to Lemma 1 and Lemma 2, if Assumption 5 is satisfied, we can conclude that for the t the step:
with probability \(1-O(e^{-m})\). Recall that \(\nabla _{{\textbf{F}}}^2 f^{(t)}\) is the Hessian of \(f^{(t)}\) with respect to \({\textbf{F}}\) and \(\nabla ^2_{\varvec{\mu }} f^{(t)}\) is the Hessian of \(f^{(t)}\) with respect to \(\varvec{\mu }\). For convenient, we abbreviate the gradient \(\nabla _{{\textbf{F}}}f^{(t)}(\varvec{\mu }_t,{\textbf{F}}_t{\textbf{H}}_t),\nabla _{\varvec{\mu }}f^{(t)}(\varvec{\mu }_t,{\textbf{F}}_t{\textbf{H}}_t)\) as \(\nabla _{{\textbf{F}}}f^{(t)}\) and \(\nabla _{\varvec{\mu }}f^{(t)}\). Choose \(\eta _c^{(t)}=1/[\alpha _m(t+\kappa _c^2\kappa _x^2)], \eta _{\mu }^{(t)}=1/m(t+1)\) then we have:
Here the first inequality hold because of the local strong convexity and the second inequality holds because of the fact that:
We then need to control \(\Vert \nabla _{{\textbf{F}}}f^{(t)}\Vert _{F}\) and \(\Vert \nabla _{\varvec{\mu }}f^{(t)}\Vert _{F}\). Take \(\Vert \nabla _{{\textbf{A}}}f^{(t)}(\varvec{\mu }_t,{\textbf{F}}_t{\textbf{H}}_t)\Vert _{F}\) as an example, according to Lemma 2 we have:
The upper bounds for \(\Vert \nabla _{{\textbf{B}}}f^{(t)}(\varvec{\mu }_t,{\textbf{F}}_t{\textbf{H}}_t)\Vert _{F}\) and \(\Vert \nabla _{\varvec{\mu }}f^{(t)}\Vert _{F}\) are similar. Finally, using the fact that \(\sum _{t=1}^{T}(1/t)\le (1+\log (T))\) to complete the proof. \(\square\)
Appendix 4: The Proof of Theorem 2
We denote \({\mathcal {E}}_t = [\varvec{\epsilon }_{t_1},\cdots ,\varvec{\epsilon }_{t_m}]\) then we have:
We next just need to control \(\beta _1\). For convenience, we denote \({\widetilde{{\textbf{X}}}}_t =[{\varvec{1}}_{m},{\textbf{X}}_{t}^{\top }]^{\top }\) and \({\widetilde{{\textbf{C}}}}_t = [\varvec{\mu }_t,{\textbf{A}}_t{\textbf{B}}_t^{\top }]\). The we have:
Here \({\textbf{P}} = {\widetilde{{\textbf{X}}}}^{\top }({\widetilde{{\textbf{X}}}}{\widetilde{{\textbf{X}}}}^{\top })^{-1}{\widetilde{{\textbf{X}}}}\) is the projection to the row space of \({\widetilde{{\textbf{X}}}}\). The final inequality holds with probability \(1-O(e^{-(p+q)/2})\) according to the Lemma 3 in Bunea et al. (2011). Acording to Assumption 2, we have \(1-O(e^{-(p+q)/2}) = 1-O(e^{-m})\). Finally, plug in the upper bound of \(\beta _1\) and combine the like items to complete the proof. \(\square\)
Liu, X., Liu, W. & Mao, X. Efficient and provable online reduced rank regression via online gradient descent. Mach Learn 113, 8711–8748 (2024). https://doi.org/10.1007/s10994-024-06622-y
DOI: https://doi.org/10.1007/s10994-024-06622-y