Skip to main content

Random perturbation subsampling for rank regression with massive data

  • Original Paper
  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Rank regression plays a fundamental role in statistical data analysis due to its robustness and high efficiency, and has been widely used in various scientific fields. However, when the data size is huge or massive, but the computing resource is limited, it can lead to an unacceptably computational cost for rank regression estimation. To handle this issue, this paper applies a random perturbation subsampling method to rank regression models. Specifically, we develop two repeatedly random perturbation subsampling algorithms to find the estimation of parameters. Two different weighting strategies with product weights and additive weights are examined in the objective function. Differing from the existing optimal and Poisson subsampling methods, our methods do not require the explicit calculation of subsampling probabilities for all data points, in which some unknown parameters often need to be estimated, thus making the implementation of our methods easier. Theoretically, statistical justifications are further provided for the proposed estimators including consistency and asymptotic normality. Extensive simulation studies and an empirical application are carried out to illustrate the effectiveness of the proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

Data is provided within the manuscript or supplementary information files.

References

Download references

Acknowledgements

The authors thank the Editor and anonymous reviewers for their constructive comments and suggestions, which greatly improves the quality of the current work. Xia’s work was supported by National Natural Science Foundation of China (Grant Number 11801202) and Fundamental Research Funds for the Central Universities (Grant Number 2021CDJQY-047).

Author information

Authors and Affiliations

Authors

Contributions

Sijin He: Conceptualization, Methodology, Software, Validation, Writing - original draft. Xiaochao Xia: Supervision, Methodology, Writing - review & editing.

Corresponding author

Correspondence to Xiaochao Xia.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proofs

We denote the cumulative distribution function (CDF) and probability density function (PDF) of \(\varepsilon _{ij} = \varepsilon _i - \varepsilon _j\) by G and g, respectively. Simple algebra yields \(g\left( s\right) = \int f(t)f(t-s)dt\) and \(g(0) = \int f^2(t)dt = \omega \). To facilitate our proof, we first make some notations. Let \(Z_i = (W_i,Y_i,X_i^T)^T \) and

$$\begin{aligned} h(Z_i,Z_j)&= W_iW_j X_{ij}[\frac{1}{2}-I(\varepsilon _{ij}<0)], \\ U_n&= \frac{2}{n\left( n-1\right) }\sum _{1\le i<j\le n}h(Z_i,Z_j), \\ h_1(Z_i)&= E\left[ h(Z_i,Z_j|Z_i)\right] \\&= W_i\left( X_i - EX \right) \left[ F(\varepsilon _i)-1/2\right] ,\\ A_{2ij}&= W_iW_j\int _{0}^{a_nx_{ij}^Tu}\left[ I(\varepsilon _{ij}\le t)-I(\varepsilon _{ij}\le 0)\right] dt. \end{aligned}$$

To prove the main results stated in Sect. 3, we need to prove the following lemmas.

Lemma A.1

Under Assumptions 1 and 4, as \(n \longrightarrow \infty \)

$$\begin{aligned} a_n^{-1}U_n {\mathop {\longrightarrow }\limits ^{d}} N\left( {\varvec{0}}, \frac{1}{3}\varvec{\Sigma }_{X}\right) . \end{aligned}$$
(A1)

Proof

First note that, \(U_n\) is a U-statistic of degree 2 with the kernel h, and \(E\left[ h\left( Z_i,Z_j\right) \right] = 0\). Recall the definition of \(h_1\left( Z_i\right) \) and denote \(\delta _1= Var\left( h_1\left( Z_i\right) \right) \) and \(\delta _2 = Var\left( h\left( Z_i,Z_j\right) \right) \). It can be shown that

$$\begin{aligned} Var\left( U_n\right) = \frac{4\left( n-2\right) }{n\left( n-1\right) }\delta _1 + \frac{2}{n\left( n-1\right) }\delta _2, \end{aligned}$$
(A2)

where

$$\begin{aligned} \delta _1 =&Var\left( h_1\left( Z_i\right) \right) =E\left[ h_1\left( Z_i\right) h_1^T\left( Z_i\right) \right] \\ =&E\left[ W_i^2\left( F\left( \varepsilon _i\right) -\frac{1}{2}\right) ^2 \right. \\&\left. \times \left( X_i-E\left[ X\right] \right) \left( X_i-E\left[ X\right] \right) ^T\right] \\ =&\frac{t_n}{12}\varvec{\Sigma }_{X},\\ \delta _2 =&Var\left( h\left( Z_i,Z_j\right) \right) =E\left[ h\left( Z_i,Z_j\right) h^T\left( Z_i,Z_j\right) \right] \\ =&E\left[ W_i^2W_j^2\left( \frac{1}{2}-I\left( \varepsilon _{ij} <0\right) \right) ^2X_{ij}X_{ij}^T\right] \\ =&\frac{t_n^2}{2}\varvec{\Sigma }_{X}. \end{aligned}$$

Thus, we have

$$\begin{aligned} Var\left( U_n\right) =&\frac{t_n}{n}\left( \frac{n-2}{3\left( n-1\right) } + \frac{t_n}{n-1} \right) \varvec{\Sigma }_{X} \nonumber \\ =&\frac{t_n}{3n}\varvec{\Sigma }_{X} + o\left( \frac{t_n}{n}\right) . \end{aligned}$$
(A3)

Next, we have the following decomposition

$$\begin{aligned} U_n = \frac{2}{n}\sum _{i=1}^{n}h_1\left( Z_i\right) + R_n . \end{aligned}$$
(A4)

In what follows, we are going to show that \( R_n = o_P\left( a_n\right) \). By a direct calculations, it can be seen that the decomposition (A4) is an orthogonal decomposition, namely,

$$\begin{aligned} Cov\left[ R_n, h_1\left( Z_i\right) \right] = 0, \, i = 1, \dots , n . \end{aligned}$$

As \(Var\left( \frac{2}{n}\sum _{i=1}^{n}h_1\left( Z_i\right) \right) = \frac{t_n}{3n}\varvec{\Sigma }_{X}\) and \(E\left[ R_n\right] = 0\), thus

$$\begin{aligned} Var\left( R_n\right)&= Var\left( U_n\right) - Var\left( \frac{2}{n}\sum _{i=1}^{n}h_1\left( Z_i\right) \right) \\&= o\left( \frac{t_n}{n}\right) . \end{aligned}$$

Therefore, an application of the multivariate Lindeberg-Lévy central limit theorem (Hansen 2022) yields

$$\begin{aligned} a_n^{-1}U_n{\mathop {\longrightarrow }\limits ^{d}}N\left( {\varvec{0}},\frac{1}{3}\varvec{\Sigma }_{X}\right) . \end{aligned}$$
(A5)

\(\square \)

Lemma A.2

Under Assumptions 2 and 3, we have

$$\begin{aligned} E\left[ A_{2ij}\right] = a_n^2g\left( 0\right) u^T\varvec{\Sigma }_{X}u + O\left( a_n^3\right) . \end{aligned}$$
(A6)

Proof

A direct calculation gives

$$\begin{aligned}&E\left[ A_{2ij}\right] \\&\quad =E\left[ \int _0^{a_nX_{ij}^{T}u} \left( G\left( t\right) -G\left( 0\right) \right) dt\right] \\&\quad =E\left[ \int _0^{a_nX_{ij}^{T}u} \left( g\left( 0\right) t + \frac{1}{2}g'\left( \zeta \right) t^2\right) dt\right] \\&\quad = a_n^2g\left( 0\right) u^T\varvec{\Sigma }_{X}u + \frac{1}{2}E\left[ \int _0^{a_nX_{ij}^{T}u} g'\left( \zeta \right) t^2dt\right] , \end{aligned}$$

where \(\zeta \in \left( 0,t\right) \). Then it follows that

$$\begin{aligned}&\left| EA_{2ij} - a_n^2g\left( 0\right) u^T\varvec{\Sigma }_{X}u\right| \\&\quad \le \frac{1}{2}E\left| \int _0^{a_nX_{ij}^{T}u} \left| g'\left( \zeta \right) \right| t^2dt\right| \\&\quad \le \frac{c_1a_n^3}{6}\Vert u \Vert ^3 E\Vert X_{ij} \Vert ^3 = O\left( a_n^3\right) , \end{aligned}$$

where \(c_1\) is the constant and the last equality is due to Assumption 2. Hence,

$$\begin{aligned} EA_{2ij} = a_n^2g\left( 0\right) u^T\varvec{\Sigma }_{X}u + O\left( a_n^3\right) . \end{aligned}$$
(A7)

\(\square \)

Lemma A.3

Under Assumptions 2 and 3, we have

$$\begin{aligned} E\left[ A_{2ij}^2\right] = O\left( t_n^2a_n^3\right) . \end{aligned}$$
(A8)

Proof

First, using Hölder’s inequality, we have

$$\begin{aligned}&E\left[ A_{2ij}^2\right] \\&\quad \le t_n^2E\left[ \left( \int _{0}^{\left| a_nX_{ij}^{T}u\right| }\left| I\left( \epsilon _{ij}\le t\right) -I\left( \epsilon _{ij}\le 0\right) \right| dt\right) ^2\right] \\&\quad \le t_n^2E\left[ \int _{0}^{\left| a_nX_{ij}^{T}u\right| }\left| I\left( \epsilon _{ij}\le t\right) -I\left( \epsilon _{ij}\le 0\right) \right| ^2 dt \right. \\&\quad \quad \left. \times \int _{0}^{\left| a_nX_{ij}^{T}u\right| }1 dt \right] \\&\quad \le t_n^2E\left[ \left| a_nX_{ij}^{T}u\right| \int _{0}^{\left| a_nX_{ij}^{T}u\right| }\left( G\left( t\right) -G\left( 0\right) \right) dt \right] \end{aligned}$$

Using the same arguments as in Lemma A.1 and under Assumptions 2 and 3, we have

$$\begin{aligned}&E\left[ \left| a_nX_{ij}^{T}u\right| \int _{0}^{\left| a_nX_{ij}^{T}u\right| }\left( G\left( t\right) -G\left( 0\right) \right) dt \right] \\&\quad = E\left[ \left| a_nX_{ij}^{T}u\right| \int _{0}^{\left| a_nX_{ij}^{T}u\right| }\left( g\left( 0\right) t + \frac{1}{2}g'\left( \zeta ^*\right) t^2\right) dt \right] \\&\quad \le \frac{g\left( 0\right) a_n^3}{2}E\left| X_{ij}^{T}u\right| ^3 + \frac{c_1a_n^4}{6}E\left| X_{ij}^{T}u\right| ^4 = O\left( a_n^3 \right) . \end{aligned}$$

Thus, the result follows from the fact

$$\begin{aligned}&E\left[ \left| a_nX_{ij}^{T}u\right| \int _{0}^{\left| a_nX_{ij}^{T}u\right| }\left( G\left( t\right) -G\left( 0\right) \right) dt \right] \\&\quad = O\left( a_n^3 \right) . \end{aligned}$$

\(\square \)

We are now ready to prove the consistency and the asymptotic normality of \(\widetilde{\varvec{\beta }}\) stated in Theorem 3.1.

Proof of Theorem 3.1

First note that the estimator \(\widetilde{\varvec{\beta }}\) satisfies

$$\begin{aligned} Q_n\left( \varvec{\beta }\right) = \frac{2}{n\left( n-1\right) }\sum _{1\le i<j\le n}W_iW_j\left| Y_{ij}-X_{ij}^{T}\varvec{\beta }\right| . \end{aligned}$$
(A9)

To show \(\widetilde{\varvec{\beta }} - \varvec{\beta }_0 = O_P\left( a_n\right) \), it is sufficient to show that, for any given \(\xi >0\) and \(u \in {\mathbb {R}}^d\), there exist a large constant C such that

$$\begin{aligned} {\mathbb {P}}\left\{ \inf \limits _{\Vert u \Vert =C }Q_n\left( \varvec{\beta }_0+a_nu\right) >Q_n\left( \varvec{\beta }_0\right) \right\} \ge 1-\xi . \end{aligned}$$
(A10)

Let \(A\left( u\right) =Q_n\left( \varvec{\beta }_0+a_nu\right) -Q_n\left( \varvec{\beta }_0\right) \), we have

$$\begin{aligned}&A\left( u\right) \\&\quad =\frac{2}{n\left( n-1\right) }\sum _{1\le i<j\le n}W_iW_j\left\{ \left| \varepsilon _{ij}-a_nX_{ij}^{T}u\right| -\left| \varepsilon _{ij}\right| \right\} . \end{aligned}$$

Due to Knight’s identity (Knight 1998)

$$\begin{aligned} \left( \left| r-s\right| -\left| r\right| \right) /2&=-s\left( \frac{1}{2}-I\left( r<0\right) \right) \\&\quad +\int _0^s \left( I\left( r\le t\right) -I\left( r\le 0\right) \right) dt, \end{aligned}$$

we have

$$\begin{aligned} A\left( u\right) =2A_1\left( u\right) +2A_2\left( u\right) , \end{aligned}$$

where

$$\begin{aligned} A_1\left( u\right) =&-\frac{2}{n\left( n-1\right) }\sum _{1\le i<j\le n}W_iW_ja_nX_{ij}^{T}u \nonumber \\&\times \left( \frac{1}{2}-I\left( \varepsilon _{ij}<0\right) \right) \nonumber \\ =&-a_nu^TU_n, \end{aligned}$$
(A11)
$$\begin{aligned} A_2\left( u\right) =&\frac{2}{n\left( n-1\right) }\sum _{1\le i<j\le n}W_iW_j \nonumber \\&\times \int _0^{a_nX_{ij}^{T}u} \left( I\left( \varepsilon _{ij}\le t\right) -I\left( \varepsilon _{ij}\le 0\right) \right) dt \nonumber \\ =&\frac{2}{n\left( n-1\right) }\sum _{1\le i<j\le n}A_{2ij}. \end{aligned}$$
(A12)

The first part of \(A\left( u\right) \) is a direct result of Lemma A.1. Therefore we only need to show the second part \(A_2\left( u\right) = a_n^2\,g\left( 0\right) u^T\varvec{\Sigma }_{X}u + o_P\left( a_n^2\right) \). Note that

$$\begin{aligned}&A_2\left( u\right) \nonumber \\&\quad = \frac{2}{n\left( n-1\right) }\sum _{1\le i,j\le n}\left\{ \left( A_{2ij} - E\left[ A_{2ij}\right] \right) + E\left[ A_{2ij}\right] \right\} \nonumber \\&\quad \triangleq A_2^{\left( 1\right) } +A_2^{\left( 2\right) }, \end{aligned}$$
(A13)

where

$$\begin{aligned} A_2^{\left( 1\right) }&= \frac{2}{n\left( n-1\right) }\sum _{1\le i,j\le n} E\left[ A_{2ij}\right] ,\\ A_2^{\left( 2\right) }&= \frac{2}{n\left( n-1\right) }\sum _{1\le i,j\le n}\left[ A_{2ij} - E\left[ A_{2ij}\right] \right] \\&\triangleq \frac{2}{n\left( n-1\right) }\sum _{1\le i,j\le n} {\widetilde{h}}\left( Z_i,Z_j\right) . \end{aligned}$$

According to Hoeffding (1948) and Lemma A.3,

$$\begin{aligned} Var\left( A_2^{\left( 2\right) }\right)&\le \frac{2}{n}E\left[ {\widetilde{h}}\left( Z_i,Z_j\right) \right] ^2 \\&\le \frac{2}{n}E\left[ A_{2ij}^2\right] = O\left( \frac{t_n^2a_n^3}{n} \right) . \end{aligned}$$

Thus, it follows from Chebyshev’s inequality and Assumption 5 that

$$\begin{aligned} A_2^{\left( 2\right) } = O_P\left( a_n^2\right) . \end{aligned}$$
(A14)

Combining Lemma A.2 and (A14), we have

$$\begin{aligned} A_2\left( u\right) = a_n^2g\left( 0\right) u^T\varvec{\Sigma }_{X}u + o_P\left( a_n^2\right) . \end{aligned}$$
(A15)

Furthermore, by Lemma A.1, (A11) and (A15), we can obtain

$$\begin{aligned} A\left( u\right)&= -2a_nu^TU_n + 2g\left( 0\right) a_n^2u^T\varvec{\Sigma }_{X}u + o_P\left( a_n^2\right) \\&= 2a_n^2\left[ -a_n^{-1}u^TU_n + g\left( 0\right) u^T\varvec{\Sigma }_{X}u + o_P\left( 1\right) \right] \\&= 2a_n^2\left[ -O_P\left( \Vert u \Vert \right) + O\left( \Vert u \Vert ^2\right) + o_P\left( 1\right) \right] . \end{aligned}$$

So, by choosing a sufficiently large C, the term \(A_1\left( u\right) \) is dominated by \(A_2\left( u\right) \) with probability \(1-\xi \). Hence, \( Q_n\left( \varvec{\beta }_0+a_nu\right) -Q_n\left( \varvec{\beta }_0\right) \>0 \) with probability \(1-\xi \). This implies that, with the probability approaching 1, there exists a local minimizer \(\widetilde{\varvec{\beta }}\) such that \(\widetilde{\varvec{\beta }} - \varvec{\beta }_0 = O_P\left( a_n\right) \). Note that \(\widetilde{\varvec{\beta }}\) is also the global minimizer due to the convexity of the objective function.

It remains to show the asymptotic normality. To this end, note that \(a_n^{-1}\left( \widetilde{\varvec{\beta }}-\varvec{\beta }_0\right) \) is the minimizer of convex function \(a_n^{-1}A\left( u\right) \). Thus, according to the corollary of Lid and Pollard (2011)(page 3), \(a_n^{-1}\left( \widetilde{\varvec{\beta }}-\varvec{\beta }_0\right) \) satisfies that

$$\begin{aligned} a_n^{-1}\left( \widetilde{\varvec{\beta }}-\varvec{\beta }_0\right) = \left( 2g\left( 0\right) \right) ^{-1}\varvec{\Sigma }_{X}^{-1}\left( a_n^{-1}U_n\right) + o_P\left( 1\right) . \end{aligned}$$
(A16)

Invoking Lemma A.1 and Slusky’s Theorem, we have

$$\begin{aligned} a_n^{-1}\left( \widetilde{\varvec{\beta }}-\varvec{\beta }_0\right) {\mathop {\longrightarrow }\limits ^{d}}N\left( {\varvec{0}},\frac{1}{12g^2\left( 0\right) }\varvec{\Sigma }_{X}^{-1}\right) . \end{aligned}$$
(A17)

This completes the proof of the theorem. \(\square \)

Proof of Theorem 3.1

Firstly, We show the case when \(m >t_n\). By the triangle inequality and the definition of \(\widetilde{\varvec{\beta }}^{\left( m\right) }\), we have

$$\begin{aligned} \Vert \widetilde{\varvec{\beta }}^{\left( m\right) } - \varvec{\beta }_0 \Vert \le \frac{1}{m}\sum _{j=1}^{m} \Vert \widetilde{\varvec{\beta }}_j - \varvec{\beta }_0 \Vert . \end{aligned}$$
(A18)

Note that \(\Vert \widetilde{\varvec{\beta }}_j - \varvec{\beta }_0 \Vert = O_p\left( a_n\right) \) implies \(\frac{1}{m}\sum _{j=1}^{m} \Vert \widetilde{\varvec{\beta }}_j - \varvec{\beta }_0 \Vert = O_p\left( a_nm^{-1/2}\right) \). Therefore, when \(a_n^{-2}m >n\), \(\Vert \widetilde{\varvec{\beta }}^{\left( m\right) }- \varvec{\beta }_0 \Vert = O_p\left( n^{-1/2}\right) \). This shows that \(\widetilde{\varvec{\beta }}^{\left( m\right) }\) is \(\sqrt{n}\)-consistent. Next, we prove the asymptotic property of \(\widetilde{\varvec{\beta }}^{\left( m\right) }\). Denote that \(\left( 2\,g\left( 0\right) \right) ^{-1}\varvec{\Sigma }_X^{-1} = H^{-1} \). Then

$$\begin{aligned}&n^{1/2}\left( \widetilde{\varvec{\beta }}^{\left( m\right) } - \varvec{\beta }_0\right) \\&\quad = \frac{n^{1/2}a_n}{m}\sum _{j=1}^{m}a_n^{-1}\left( \widetilde{\varvec{\beta }}_j - \varvec{\beta }_0\right) \\&\quad = H^{-1}\sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) \\&\quad \quad + o_P\left( \Vert n^{1/2}\left( \widetilde{\varvec{\beta }}^{\left( m\right) } - \varvec{\beta }_0\right) \Vert \right. \\&\quad \quad +\left. \Vert \sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) \Vert \right) \\&\quad = H^{-1}\sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) \\&\quad \quad + o_P\left( 1 + \Vert \sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) \Vert \right) , \end{aligned}$$

where \(W_{i,j}\) represents the stochastic weight.

Now we show that \( \sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) = O_P\left( 1\right) \), which is the sum of n i.i.d. elements. Conditioning on the ith observation \(\left( Y_i,X_i\right) \), \(W_{i,j}, \hspace{5.0pt}j=1,\dots ,m\) are independent stochastic weights. Thus, it follows from \(E\left[ \sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}\right. \) \(\left. h_{1,j}\left( Z_i\right) \right] = 0\) that

$$\begin{aligned}&\Vert \sum _{i=1}^{n}Var\left( \frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) \right) \Vert \\&\quad = \Vert \sum _{i=1}^{n}\frac{4}{nm^2}\left\{ Var\left( E\left[ \sum _{j=1}^{m}h_{1,j}\left( Z_i\right) |Z_i\right] \right) \right. \\&\quad \quad \left. + E\left[ Var\left( \sum _{j=1}^{m}h_{1,j}\left( Z_i\right) |Z_i\right) \right] \right\} \Vert \\&\quad = \Vert \sum _{i=1}^{n}\frac{4}{nm^2} \left\{ \frac{m^2}{12}\varvec{\Sigma }_X + \frac{m\left( t_n-1\right) }{12}\varvec{\Sigma }_X\right\} \Vert \\&\quad = \frac{m+t_n-1}{3m}\Vert \varvec{\Sigma }_X\Vert . \end{aligned}$$

By Assumption 4 and \(m >t_n\), we have \(\lim \sup _{n\rightarrow \infty }\Vert \sum _{i=1}^{n}\) \(Var\left( \frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) \right) \Vert < \infty \). An application of the multivariate Lindeberg-Lévy central limit theorem (Hansen 2022) gives

$$\begin{aligned}&\left( h_{n,m}\right) ^{-1/2}\sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) \nonumber \\&\quad {\mathop {\longrightarrow }\limits ^{d}}N\left( 0,\frac{1}{3}\varvec{\Sigma }_X\right) , \end{aligned}$$
(A19)

as \(r,n \rightarrow \infty \), where \(h_{n,m} = \left( m+t_n-1\right) /m\). Furthermore, it can be easily verify that

$$\begin{aligned} \sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) = O_P\left( 1\right) , \end{aligned}$$
(A20)

and \(o_P\left( 1 + \Vert \sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) \Vert \right) =o_P(1)\). Furthermore, as \(m > t_n\), we have \(0< t_n/m <1\) and \(1<h_{n,m} < 2\). Then, by Slusky’s theorem, we can obtain the asymptotic distribution of \(n^{1/2}\left( \widetilde{\varvec{\beta }}^{\left( m\right) } - \varvec{\beta }_0\right) \), i.e.

$$\begin{aligned}&\left( \frac{h_{n.m}}{12g^2\left( 0\right) }\varvec{\Sigma }_{X}^{-1}\right) ^{-1/2}\sqrt{n}\left( \widetilde{\varvec{\beta }}^{\left( m\right) } - \varvec{\beta }_0\right) \nonumber \\&\quad {\mathop {\longrightarrow }\limits ^{d}} N\left( {\varvec{0}},{\varvec{I}}_p\right) . \end{aligned}$$
(A21)

When \(m <t_n\), the proposed estimator is \(\sqrt{nm/t_n}\)-consistent by the triangle inequality. The asymptotic distribution of \(\sqrt{a_n^{-2}m}\left( \widetilde{\varvec{\beta }}^{\left( m\right) } - \varvec{\beta }_0\right) \) is

$$\begin{aligned}&\left( \frac{d_{n.m}}{12g^2\left( 0\right) }\varvec{\Sigma }_{X}^{-1}\right) ^{-1/2}\sqrt{\frac{nm}{t_n}}\left( \widetilde{\varvec{\beta }}^{\left( m\right) } - \varvec{\beta }_0\right) \nonumber \\&\quad {\mathop {\longrightarrow }\limits ^{d}} N\left( {\varvec{0}},{\varvec{I}}_p\right) , \end{aligned}$$
(A22)

where \(d_{n.m} = \left( m+t_n-1\right) /t_n\) and \(1<d_n<2\). This completes the proof of the corollary. \(\square \)

Proof of Theorem 3.2

The proof of Theorem 3.2 is similar to that of Theorem 3.1. The only difference is in dealing with the second-order moments of the stochastic weights.

Consider the weighted loss function

$$\begin{aligned} Q_n^*\left( \varvec{\beta }\right) =&\frac{2}{n\left( n-1\right) }\sum _{1\le i<j\le n} \nonumber \\&\frac{\left( W_i+W_j\right) }{2}\left| Y_{ij}-X_{ij}^{T}\varvec{\beta }\right| . \end{aligned}$$
(A23)

Let \(B\left( u\right) = Q_n^*\left( \varvec{\beta }_0+b_nu\right) -Q_n^*\left( \varvec{\beta }_0\right) \). Then, we can divide \(B\left( u\right) \) into two parts \(B_1\left( u\right) \) and \(B_2\left( u\right) \), where

$$\begin{aligned} B_1\left( u\right) =&-\frac{2}{n\left( n-1\right) }\sum _{1\le i<j\le n}\frac{\left( W_i+W_j\right) }{2}\\&\times b_nX_{ij}^{T}u\left( \frac{1}{2}-I\left( \varepsilon _{ij}<0\right) \right) \\ =&-b_nu^TU_n^*,\\ B_2\left( u\right) =&\frac{2}{n\left( n-1\right) }\sum _{1\le i<j\le n}\frac{\left( W_i+W_j\right) }{2}\\&\times \int _0^{b_nX_{ij}^{T}u} \left( I\left( \varepsilon _{ij}\le t\right) -I\left( \varepsilon _{ij}\le 0\right) \right) dt. \end{aligned}$$

For the first part of \(B\left( u\right) \), we can show the asymptotic normality of \(b_n^{-1}U_n^*\) similarly. To this end, let \(U_n^* = \frac{2}{n}\sum _{i=1}^{n}h_1^*\left( Z_i\right) + R_n^*\), where \(h_1^*\left( Z_i\right) =\frac{\left( W_i+1\right) }{2}\left( X_i - E\left[ X\right] \right) \) \(\left( F\left( \varepsilon _i\right) -1/2\right) \). We can verify that

$$\begin{aligned}&Var\left( \frac{2}{n}\sum _{i=1}^{n}h_1^*\left( Z_i\right) \right) = \frac{4}{n}E\left[ h_1^{*2}\left( Z_i\right) \right] \\&\quad = \frac{4}{n}E\left[ \frac{\left( W_i+1\right) ^2}{4}\left( F\left( \varepsilon _i\right) -\frac{1}{2}\right) ^2 \right. \\&\quad \quad \left. \times \left( X_i-E\left[ X\right] \right) \left( X_i-E\left[ X\right] \right) ^T\right] \\&\quad = \frac{t_n+3}{12n}\varvec{\Sigma }_{X}, \end{aligned}$$

and \(Var\left( R_n^*\right) = o\left( \frac{t_n+3}{n}\right) \). Since \(\left\{ h_1^*\left( Z_i\right) \right\} _{i=1}^{n}\) are i.i.d. random vectors, by the multivariate Lindeberg-Lévy central limit theorem, we have

$$\begin{aligned} b_n^{-1}U_n^*{\mathop {\longrightarrow }\limits ^{d}}N\left( {\varvec{0}},\frac{1}{12}\varvec{\Sigma }_{X}\right) . \end{aligned}$$
(A24)
Fig. 7
figure 7

Simulation results for the comparison of various methods (full, unif, Lopt, perL, and poisR) for different r over \(N = 1000\) simulations in a 10-dimensional setting

Next, we show that \(B_2\left( u\right) = b_n^2\,g\left( 0\right) u^T\varvec{\Sigma }_{X}u + o_P\left( b_n^2\right) \). Note that \(B_2\left( u\right) \) can be rewritten as

$$\begin{aligned}&B_2\left( u\right) \nonumber \\&\quad = \frac{2}{n\left( n-1\right) }\sum _{1\le i,j\le n}\left\{ E\left[ B_{2ij}+ \left( B_{2ij} - E\left[ B_{2ij}\right] \right) \right] \right\} \nonumber \\&\quad \triangleq B_2^{\left( 1\right) } +B_2^{\left( 2\right) }. \end{aligned}$$
(A25)

Similar to the proof of Lemmas A.2 and A.3, we can obtain

$$\begin{aligned}&E\left[ B_{2ij}\right] = b_n^2g\left( 0\right) u^T\varvec{\Sigma }_{X}u + O\left( b_n^3\right) ,\\&E\left[ B_{2ij}^2\right] = O\left( t_nb_n^3\right) , \end{aligned}$$

by Assumptions 2 and  3. Thus, \(B_2\left( u\right) = b_n^2\,g\left( 0\right) u^T\varvec{\Sigma }_{X}u + o_P\left( b_n^2\right) \) follows from the fact that

$$\begin{aligned}&B_2^{\left( 1\right) } = b_n^2g\left( 0\right) u^T\varvec{\Sigma }_{X}u + O\left( b_n^3\right) ,\\&B_2^{\left( 2\right) } = o_P\left( b_n^2\right) . \end{aligned}$$

By the arguments similar to those used in Theorem 3.1, we can prove that

$$\begin{aligned} \widetilde{\varvec{\beta }}^* - \varvec{\beta }_0 = O_P\left( b_n\right) , \end{aligned}$$
(A26)

and

$$\begin{aligned} b_n^{-1}\left( \widetilde{\varvec{\beta }}^*-\varvec{\beta }_0\right) {\mathop {\longrightarrow }\limits ^{d}}N\left( {\varvec{0}},\frac{1}{48g^2\left( 0\right) }\varvec{\Sigma }_{X}^{-1}\right) . \end{aligned}$$
(A27)

This completes the proof of Theorem 3.2.

Table 3 The MSEs for the global estimator (Full), the two proposed estimators with products weights (Product) and additive weights (Additive), respectively, for the 10-dimensional setting with \(m=10\) and \(n=10,000\)

The proof of Corollary 3.2 is similar, so we omit the details. \(\square \)

Appendix B: Additional numerical results

To further evaluate the performance of our proposed methods in higher-dimensional settings. We extend our analysis to a scenario where the dimension of \(\beta _0\) is increased to 10, with \(\varvec{\beta }_0=(-4,0.11,-0.9,-0.5,1,1,1,1,1,1)^T\). The measures of performance and other simulation settings are the same as Sect. 4.3. The simulation results for MSEs are reported in Fig. 7 and Table 3. From Fig. 7, we can observe similar trends and performance for various subsampling methods as in Fig. 3. We can find that the MSEs decrease as r increases, and the L-optimal subsampling method outperforms the uniform subsampling method. Additionally, our method with additive weights performs the best compared to the product weights method and the perL method. Table 3 shows that the results are in accordance with the findings in the 5-dimensional setting. Furthermore, the ratio of the MSE of Algorithm 4 to that of Algorithm 2 also tends to increase as r increases, and the different error distributions have a slight effect on this ratio.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

He, S., Xia, X. Random perturbation subsampling for rank regression with massive data. Stat Comput 35, 14 (2025). https://doi.org/10.1007/s11222-024-10548-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11222-024-10548-9

Keywords