Random perturbation subsampling for rank regression with massive data

He, Sijin; Xia, Xiaochao

doi:10.1007/s11222-024-10548-9

Random perturbation subsampling for rank regression with massive data

Original Paper
Published: 14 December 2024

Volume 35, article number 14, (2025)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Sijin He¹ &
Xiaochao Xia¹

162 Accesses
Explore all metrics

Abstract

Rank regression plays a fundamental role in statistical data analysis due to its robustness and high efficiency, and has been widely used in various scientific fields. However, when the data size is huge or massive, but the computing resource is limited, it can lead to an unacceptably computational cost for rank regression estimation. To handle this issue, this paper applies a random perturbation subsampling method to rank regression models. Specifically, we develop two repeatedly random perturbation subsampling algorithms to find the estimation of parameters. Two different weighting strategies with product weights and additive weights are examined in the objective function. Differing from the existing optimal and Poisson subsampling methods, our methods do not require the explicit calculation of subsampling probabilities for all data points, in which some unknown parameters often need to be estimated, thus making the implementation of our methods easier. Theoretically, statistical justifications are further provided for the proposed estimators including consistency and asymptotic normality. Extensive simulation studies and an empirical application are carried out to illustrate the effectiveness of the proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributed smoothed rank regression with heterogeneous errors for massive data

Article 23 October 2023

Robust distributed estimation and variable selection for massive datasets via rank regression

Article 20 June 2021

Monotone rank estimation of transformation models with length-biased and right-censored data

Article 10 August 2015

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

Data is provided within the manuscript or supplementary information files.

References

Ai, M., Wang, F., Yu, J., Zhang, H.: Optimal subsampling for large-scale quantile regression. J. Complex. 62, 101512 (2021). https://doi.org/10.1016/j.jco.2020.101512
Article MathSciNet MATH Google Scholar
Ai, M., Yu, J., Zhang, H., Wang, H.: Optimal subsampling algorithms for big data regressions. Stat. Sin. 31(2), 749–772 (2021). https://doi.org/10.5705/ss.202018.0439
Article MathSciNet MATH Google Scholar
Bose, A., Chatterjee, S.: U-Statistics, Mm-Estimators and Resampling. Springer, Singapore (2018)
Book MATH Google Scholar
Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z.: Distributed testing and estimation under sparse high dimensional models. Ann. Stat. 46(3), 1352–1382 (2018). https://doi.org/10.1214/17-AOS1587
Article MathSciNet MATH Google Scholar
Chao, Y., Huang, L., Ma, X., Sun, J.: Optimal subsampling for modal regression in massive data. Metrika 87(4), 379–409 (2024). https://doi.org/10.1007/s00184-023-00916-2
Article MathSciNet MATH Google Scholar
Efron, B.: Bootstrap methods: Another look at the jackknife. Ann. Stat. 7(1), 1–26 (1979). https://doi.org/10.1007/978-1-4612-4380-9_41
Article MathSciNet MATH Google Scholar
Hansen, B.: Econometrics. Princeton University Press, Princeton (2022)
MATH Google Scholar
Huang, B., Liu, Y., Peng, L.: Weighted bootstrap for two-sample u-statistics. J. Stat. Plan. Inference 226, 86–99 (2023). https://doi.org/10.1016/j.jspi.2023.02.004
Article MathSciNet MATH Google Scholar
Hoeffding, W.: A class of statistics with asymptotically normal distribution. Ann. Math. Stat. 19(3), 293–325 (1948). https://doi.org/10.1214/aoms/1177730196
Article MathSciNet MATH Google Scholar
Jaeckel, L.A.: Estimating regression coefficients by minimizing the dispersion of the residuals. Ann. Math. Stat. 43(5), 1449–1458 (1972). https://doi.org/10.1214/aoms/1177692377
Article MathSciNet MATH Google Scholar
Jordan, M.I., Lee, J.D., Yang, Y.: Communication-efficient distributed statistical inference. J. Am. Stat. Assoc. 114(526), 668–681 (2019). https://doi.org/10.1080/01621459.2018.1429274
Article MathSciNet MATH Google Scholar
Ju, J., Wang, M., Zhao, S.: Subsampling for big data linear models with measurement errors. arXiv:2403.04361 (2024)
Knight, K.: Limiting distributions for $l_1$ regression estimators under general conditions. Ann. Stat. 26(2), 755–770 (1998). https://doi.org/10.1214/aos/1028144858
Article MATH Google Scholar
Leng, C.: Variable selection and coefficient estimation via regularized rank regression. Stat. Sin. 20(1), 167–181 (2010). https://doi.org/10.1051/epjconf/20100402005
Article MathSciNet MATH Google Scholar
Lid Hjort, N., Pollard, D.: Asymptotics for minimisers of convex processes. arXiv:1107.3806 (2011)
Lee, J., Wang, H., Schifano, E.D.: Online updating method to correct for measurement error in big data streams. Comput. Stat. Data Anal. 149, 106976 (2020). https://doi.org/10.1016/j.csda.2020.106976
Article MathSciNet MATH Google Scholar
Luan, J., Wang, H., Wang, K., Zhang, B.: Robust distributed estimation and variable selection for massive datasets via rank regression. Ann. Inst. Stat. Math. 74, 435–450 (2021). https://doi.org/10.1007/s10463-021-00803-5
Article MathSciNet MATH Google Scholar
Li, X., Xia, X., Zhang, Z.: Distributed subsampling for multiplicative regression. Stat. Comput. 34(5), 1–20 (2024). https://doi.org/10.1007/s11222-024-10477-7
Article MathSciNet MATH Google Scholar
Li, X., Xia, X., Zhang, Z.: Poisson subsampling-based estimation for growing-dimensional expectile regression in massive data. Stat. Comput. 34, 133 (2024). https://doi.org/10.1007/s11222-024-10449-x
Ma, P., Mahoney, M.W., Yu, B.: A statistical perspective on algorithmic leveraging. J. Mach. Learn. Res. 16(27), 861–911 (2015). https://doi.org/10.48550/arXiv.1306.5362
Article MathSciNet MATH Google Scholar
Portnoy, S., Koenker, R.: The gaussian hare and the Laplacian tortoise: computability of squared- error versus absolute-error estimators. Stat. Sci. 12, 279–296 (1997). https://doi.org/10.1214/ss/1030037960
Article MathSciNet MATH Google Scholar
Ren, M., Zhao, S., Wang, M.: Optimal subsampling for least absolute relative error estimators with massive data. J. Complex. 74, 101694 (2023). https://doi.org/10.1016/j.jco.2022.101694
Article MathSciNet MATH Google Scholar
Schifano, E.D., Wu, J., Wang, C., Yan, J., Chen, M.-H.: Online updating of statistical inference in the big data setting. Technometrics 58(3), 393–403 (2016). https://doi.org/10.1080/00401706.2016.1142900
Article MathSciNet MATH Google Scholar
Tüfekci, P.: Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. Int. J. Electr. Power Energy Syst. 60, 126–140 (2014). https://doi.org/10.1016/j.ijepes.2014.02.027
Wang, H.: More efficient estimation for logistic regression with optimal subsamples. J. Mach. Learn. Res. 20(132), 1–59 (2018). https://doi.org/10.48550/arXiv.1802.02698
Wang, H., Ma, Y.: Optimal subsampling for quantile regression in big data. Biometrika 108(1), 99–112 (2020). https://doi.org/10.1093/biomet/asaa043
Wang, L., Peng, B., Bradic, J., Li, R., Wu, Y.: A tuning-free robust and efficient approach to high-dimensional regression (with discussion). J. Am. Stat. Assoc. 115, 1700–1714 (2020). https://doi.org/10.1080/01621459.2020.1840989
Article MATH Google Scholar
Wang, H., Zhu, R., Ma, P.: Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc. 113(522), 829–844 (2018). https://doi.org/10.1080/01621459.2017.1292914
Article MathSciNet MATH Google Scholar
Yu, J., Ai, M., Ye, Z.: A review on design inspired subsampling for big data. Stat. Pap. 65(2), 467–510 (2024). https://doi.org/10.1007/s00362-022-01386-w
Article MathSciNet MATH Google Scholar
Yao, Y., Jin, Z.: A perturbation subsampling for large scale data. Stat. Sin. 34(20), 911–932 (2024). https://doi.org/10.5705/ss.202022.0020
Article MathSciNet MATH Google Scholar
Yao, Y., Wang, H.: A review on optimal subsampling methods for massive datasets. J. Data Sci. 19(1), 151–172 (2021). https://doi.org/10.6339/21-JDS999
Article MATH Google Scholar
Yu, J., Wang, H., Ai, M., Zhang, H.: Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. J. Am. Stat. Assoc. 117, 265–276 (2020). https://doi.org/10.1080/01621459.2020.1773832
Article MathSciNet MATH Google Scholar
Zhou, L., Wang, B., Zou, H.: Sparse convoluted rank regression in high dimensions. J. Am. Stat. Assoc. 119(546), 1500–1512 (2024). https://doi.org/10.1080/01621459.2023.2202433
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors thank the Editor and anonymous reviewers for their constructive comments and suggestions, which greatly improves the quality of the current work. Xia’s work was supported by National Natural Science Foundation of China (Grant Number 11801202) and Fundamental Research Funds for the Central Universities (Grant Number 2021CDJQY-047).

Author information

Authors and Affiliations

College of Mathematics and Statistics, Chongqing University, Chongqing, 401331, China
Sijin He & Xiaochao Xia

Authors

Sijin He
View author publications
You can also search for this author in PubMed Google Scholar
Xiaochao Xia
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Sijin He: Conceptualization, Methodology, Software, Validation, Writing - original draft. Xiaochao Xia: Supervision, Methodology, Writing - review & editing.

Corresponding author

Correspondence to Xiaochao Xia.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proofs

We denote the cumulative distribution function (CDF) and probability density function (PDF) of $\varepsilon _{ij} = \varepsilon _i - \varepsilon _j$ by G and g, respectively. Simple algebra yields $g\left( s\right) = \int f(t)f(t-s)dt$ and $g(0) = \int f^2(t)dt = \omega $. To facilitate our proof, we first make some notations. Let $Z_i = (W_i,Y_i,X_i^T)^T $ and

$$\begin{aligned} h(Z_i,Z_j)&= W_iW_j X_{ij}[\frac{1}{2}-I(\varepsilon _{ij}<0)], \\ U_n&= \frac{2}{n\left( n-1\right) }\sum _{1\le i<j\le n}h(Z_i,Z_j), \\ h_1(Z_i)&= E\left[ h(Z_i,Z_j|Z_i)\right] \\&= W_i\left( X_i - EX \right) \left[ F(\varepsilon _i)-1/2\right] ,\\ A_{2ij}&= W_iW_j\int _{0}^{a_nx_{ij}^Tu}\left[ I(\varepsilon _{ij}\le t)-I(\varepsilon _{ij}\le 0)\right] dt. \end{aligned}$$

To prove the main results stated in Sect. 3, we need to prove the following lemmas.

Lemma A.1

Under Assumptions 1 and 4, as $n \longrightarrow \infty $

$$\begin{aligned} a_n^{-1}U_n {\mathop {\longrightarrow }\limits ^{d}} N\left( {\varvec{0}}, \frac{1}{3}\varvec{\Sigma }_{X}\right) . \end{aligned}$$

(A1)

Proof

First note that, $U_n$ is a U-statistic of degree 2 with the kernel h, and $E\left[ h\left( Z_i,Z_j\right) \right] = 0$. Recall the definition of $h_1\left( Z_i\right) $ and denote $\delta _1= Var\left( h_1\left( Z_i\right) \right) $ and $\delta _2 = Var\left( h\left( Z_i,Z_j\right) \right) $. It can be shown that

$$\begin{aligned} Var\left( U_n\right) = \frac{4\left( n-2\right) }{n\left( n-1\right) }\delta _1 + \frac{2}{n\left( n-1\right) }\delta _2, \end{aligned}$$

(A2)

where

$$\begin{aligned} \delta _1 =&Var\left( h_1\left( Z_i\right) \right) =E\left[ h_1\left( Z_i\right) h_1^T\left( Z_i\right) \right] \\ =&E\left[ W_i^2\left( F\left( \varepsilon _i\right) -\frac{1}{2}\right) ^2 \right. \\&\left. \times \left( X_i-E\left[ X\right] \right) \left( X_i-E\left[ X\right] \right) ^T\right] \\ =&\frac{t_n}{12}\varvec{\Sigma }_{X},\\ \delta _2 =&Var\left( h\left( Z_i,Z_j\right) \right) =E\left[ h\left( Z_i,Z_j\right) h^T\left( Z_i,Z_j\right) \right] \\ =&E\left[ W_i^2W_j^2\left( \frac{1}{2}-I\left( \varepsilon _{ij} <0\right) \right) ^2X_{ij}X_{ij}^T\right] \\ =&\frac{t_n^2}{2}\varvec{\Sigma }_{X}. \end{aligned}$$

Thus, we have

$$\begin{aligned} Var\left( U_n\right) =&\frac{t_n}{n}\left( \frac{n-2}{3\left( n-1\right) } + \frac{t_n}{n-1} \right) \varvec{\Sigma }_{X} \nonumber \\ =&\frac{t_n}{3n}\varvec{\Sigma }_{X} + o\left( \frac{t_n}{n}\right) . \end{aligned}$$

(A3)

Next, we have the following decomposition

$$\begin{aligned} U_n = \frac{2}{n}\sum _{i=1}^{n}h_1\left( Z_i\right) + R_n . \end{aligned}$$

(A4)

In what follows, we are going to show that $ R_n = o_P\left( a_n\right) $. By a direct calculations, it can be seen that the decomposition (A4) is an orthogonal decomposition, namely,

$$\begin{aligned} Cov\left[ R_n, h_1\left( Z_i\right) \right] = 0, \, i = 1, \dots , n . \end{aligned}$$

As $Var\left( \frac{2}{n}\sum _{i=1}^{n}h_1\left( Z_i\right) \right) = \frac{t_n}{3n}\varvec{\Sigma }_{X}$ and $E\left[ R_n\right] = 0$, thus

$$\begin{aligned} Var\left( R_n\right)&= Var\left( U_n\right) - Var\left( \frac{2}{n}\sum _{i=1}^{n}h_1\left( Z_i\right) \right) \\&= o\left( \frac{t_n}{n}\right) . \end{aligned}$$

Therefore, an application of the multivariate Lindeberg-Lévy central limit theorem (Hansen 2022) yields

$$\begin{aligned} a_n^{-1}U_n{\mathop {\longrightarrow }\limits ^{d}}N\left( {\varvec{0}},\frac{1}{3}\varvec{\Sigma }_{X}\right) . \end{aligned}$$

(A5)

$\square $

Lemma A.2

Under Assumptions 2 and 3, we have

$$\begin{aligned} E\left[ A_{2ij}\right] = a_n^2g\left( 0\right) u^T\varvec{\Sigma }_{X}u + O\left( a_n^3\right) . \end{aligned}$$

(A6)

Proof

A direct calculation gives

$$\begin{aligned}&E\left[ A_{2ij}\right] \\&\quad =E\left[ \int _0^{a_nX_{ij}^{T}u} \left( G\left( t\right) -G\left( 0\right) \right) dt\right] \\&\quad =E\left[ \int _0^{a_nX_{ij}^{T}u} \left( g\left( 0\right) t + \frac{1}{2}g'\left( \zeta \right) t^2\right) dt\right] \\&\quad = a_n^2g\left( 0\right) u^T\varvec{\Sigma }_{X}u + \frac{1}{2}E\left[ \int _0^{a_nX_{ij}^{T}u} g'\left( \zeta \right) t^2dt\right] , \end{aligned}$$

where $\zeta \in \left( 0,t\right) $. Then it follows that

$$\begin{aligned}&\left| EA_{2ij} - a_n^2g\left( 0\right) u^T\varvec{\Sigma }_{X}u\right| \\&\quad \le \frac{1}{2}E\left| \int _0^{a_nX_{ij}^{T}u} \left| g'\left( \zeta \right) \right| t^2dt\right| \\&\quad \le \frac{c_1a_n^3}{6}\Vert u \Vert ^3 E\Vert X_{ij} \Vert ^3 = O\left( a_n^3\right) , \end{aligned}$$

where $c_1$ is the constant and the last equality is due to Assumption 2. Hence,

$$\begin{aligned} EA_{2ij} = a_n^2g\left( 0\right) u^T\varvec{\Sigma }_{X}u + O\left( a_n^3\right) . \end{aligned}$$

(A7)

$\square $

Lemma A.3

Under Assumptions 2 and 3, we have

$$\begin{aligned} E\left[ A_{2ij}^2\right] = O\left( t_n^2a_n^3\right) . \end{aligned}$$

(A8)

Proof

First, using Hölder’s inequality, we have

$$\begin{aligned}&E\left[ A_{2ij}^2\right] \\&\quad \le t_n^2E\left[ \left( \int _{0}^{\left| a_nX_{ij}^{T}u\right| }\left| I\left( \epsilon _{ij}\le t\right) -I\left( \epsilon _{ij}\le 0\right) \right| dt\right) ^2\right] \\&\quad \le t_n^2E\left[ \int _{0}^{\left| a_nX_{ij}^{T}u\right| }\left| I\left( \epsilon _{ij}\le t\right) -I\left( \epsilon _{ij}\le 0\right) \right| ^2 dt \right. \\&\quad \quad \left. \times \int _{0}^{\left| a_nX_{ij}^{T}u\right| }1 dt \right] \\&\quad \le t_n^2E\left[ \left| a_nX_{ij}^{T}u\right| \int _{0}^{\left| a_nX_{ij}^{T}u\right| }\left( G\left( t\right) -G\left( 0\right) \right) dt \right] \end{aligned}$$

Using the same arguments as in Lemma A.1 and under Assumptions 2 and 3, we have

$$\begin{aligned}&E\left[ \left| a_nX_{ij}^{T}u\right| \int _{0}^{\left| a_nX_{ij}^{T}u\right| }\left( G\left( t\right) -G\left( 0\right) \right) dt \right] \\&\quad = E\left[ \left| a_nX_{ij}^{T}u\right| \int _{0}^{\left| a_nX_{ij}^{T}u\right| }\left( g\left( 0\right) t + \frac{1}{2}g'\left( \zeta ^*\right) t^2\right) dt \right] \\&\quad \le \frac{g\left( 0\right) a_n^3}{2}E\left| X_{ij}^{T}u\right| ^3 + \frac{c_1a_n^4}{6}E\left| X_{ij}^{T}u\right| ^4 = O\left( a_n^3 \right) . \end{aligned}$$

Thus, the result follows from the fact

$$\begin{aligned}&E\left[ \left| a_nX_{ij}^{T}u\right| \int _{0}^{\left| a_nX_{ij}^{T}u\right| }\left( G\left( t\right) -G\left( 0\right) \right) dt \right] \\&\quad = O\left( a_n^3 \right) . \end{aligned}$$

$\square $

We are now ready to prove the consistency and the asymptotic normality of $\widetilde{\varvec{\beta }}$ stated in Theorem 3.1.

Proof of Theorem 3.1

First note that the estimator $\widetilde{\varvec{\beta }}$ satisfies

$$\begin{aligned} Q_n\left( \varvec{\beta }\right) = \frac{2}{n\left( n-1\right) }\sum _{1\le i<j\le n}W_iW_j\left| Y_{ij}-X_{ij}^{T}\varvec{\beta }\right| . \end{aligned}$$

(A9)

To show $\widetilde{\varvec{\beta }} - \varvec{\beta }_0 = O_P\left( a_n\right) $, it is sufficient to show that, for any given $\xi >0$ and $u \in {\mathbb {R}}^d$, there exist a large constant C such that

$$\begin{aligned} {\mathbb {P}}\left\{ \inf \limits _{\Vert u \Vert =C }Q_n\left( \varvec{\beta }_0+a_nu\right) >Q_n\left( \varvec{\beta }_0\right) \right\} \ge 1-\xi . \end{aligned}$$

(A10)

Let $A\left( u\right) =Q_n\left( \varvec{\beta }_0+a_nu\right) -Q_n\left( \varvec{\beta }_0\right) $, we have

$$\begin{aligned}&A\left( u\right) \\&\quad =\frac{2}{n\left( n-1\right) }\sum _{1\le i<j\le n}W_iW_j\left\{ \left| \varepsilon _{ij}-a_nX_{ij}^{T}u\right| -\left| \varepsilon _{ij}\right| \right\} . \end{aligned}$$

Due to Knight’s identity (Knight 1998)

$$\begin{aligned} \left( \left| r-s\right| -\left| r\right| \right) /2&=-s\left( \frac{1}{2}-I\left( r<0\right) \right) \\&\quad +\int _0^s \left( I\left( r\le t\right) -I\left( r\le 0\right) \right) dt, \end{aligned}$$

we have

$$\begin{aligned} A\left( u\right) =2A_1\left( u\right) +2A_2\left( u\right) , \end{aligned}$$

where

$$\begin{aligned} A_1\left( u\right) =&-\frac{2}{n\left( n-1\right) }\sum _{1\le i<j\le n}W_iW_ja_nX_{ij}^{T}u \nonumber \\&\times \left( \frac{1}{2}-I\left( \varepsilon _{ij}<0\right) \right) \nonumber \\ =&-a_nu^TU_n, \end{aligned}$$

(A11)

$$\begin{aligned} A_2\left( u\right) =&\frac{2}{n\left( n-1\right) }\sum _{1\le i<j\le n}W_iW_j \nonumber \\&\times \int _0^{a_nX_{ij}^{T}u} \left( I\left( \varepsilon _{ij}\le t\right) -I\left( \varepsilon _{ij}\le 0\right) \right) dt \nonumber \\ =&\frac{2}{n\left( n-1\right) }\sum _{1\le i<j\le n}A_{2ij}. \end{aligned}$$

(A12)

The first part of $A\left( u\right) $ is a direct result of Lemma A.1. Therefore we only need to show the second part $A_2\left( u\right) = a_n^2\,g\left( 0\right) u^T\varvec{\Sigma }_{X}u + o_P\left( a_n^2\right) $. Note that

$$\begin{aligned}&A_2\left( u\right) \nonumber \\&\quad = \frac{2}{n\left( n-1\right) }\sum _{1\le i,j\le n}\left\{ \left( A_{2ij} - E\left[ A_{2ij}\right] \right) + E\left[ A_{2ij}\right] \right\} \nonumber \\&\quad \triangleq A_2^{\left( 1\right) } +A_2^{\left( 2\right) }, \end{aligned}$$

(A13)

where

$$\begin{aligned} A_2^{\left( 1\right) }&= \frac{2}{n\left( n-1\right) }\sum _{1\le i,j\le n} E\left[ A_{2ij}\right] ,\\ A_2^{\left( 2\right) }&= \frac{2}{n\left( n-1\right) }\sum _{1\le i,j\le n}\left[ A_{2ij} - E\left[ A_{2ij}\right] \right] \\&\triangleq \frac{2}{n\left( n-1\right) }\sum _{1\le i,j\le n} {\widetilde{h}}\left( Z_i,Z_j\right) . \end{aligned}$$

According to Hoeffding (1948) and Lemma A.3,

$$\begin{aligned} Var\left( A_2^{\left( 2\right) }\right)&\le \frac{2}{n}E\left[ {\widetilde{h}}\left( Z_i,Z_j\right) \right] ^2 \\&\le \frac{2}{n}E\left[ A_{2ij}^2\right] = O\left( \frac{t_n^2a_n^3}{n} \right) . \end{aligned}$$

Thus, it follows from Chebyshev’s inequality and Assumption 5 that

$$\begin{aligned} A_2^{\left( 2\right) } = O_P\left( a_n^2\right) . \end{aligned}$$

(A14)

Combining Lemma A.2 and (A14), we have

$$\begin{aligned} A_2\left( u\right) = a_n^2g\left( 0\right) u^T\varvec{\Sigma }_{X}u + o_P\left( a_n^2\right) . \end{aligned}$$

(A15)

Furthermore, by Lemma A.1, (A11) and (A15), we can obtain

$$\begin{aligned} A\left( u\right)&= -2a_nu^TU_n + 2g\left( 0\right) a_n^2u^T\varvec{\Sigma }_{X}u + o_P\left( a_n^2\right) \\&= 2a_n^2\left[ -a_n^{-1}u^TU_n + g\left( 0\right) u^T\varvec{\Sigma }_{X}u + o_P\left( 1\right) \right] \\&= 2a_n^2\left[ -O_P\left( \Vert u \Vert \right) + O\left( \Vert u \Vert ^2\right) + o_P\left( 1\right) \right] . \end{aligned}$$

So, by choosing a sufficiently large C, the term $A_1\left( u\right) $ is dominated by $A_2\left( u\right) $ with probability $1-\xi $. Hence, $ Q_n\left( \varvec{\beta }_0+a_nu\right) -Q_n\left( \varvec{\beta }_0\right) \>0 $ with probability $1-\xi $. This implies that, with the probability approaching 1, there exists a local minimizer $\widetilde{\varvec{\beta }}$ such that $\widetilde{\varvec{\beta }} - \varvec{\beta }_0 = O_P\left( a_n\right) $. Note that $\widetilde{\varvec{\beta }}$ is also the global minimizer due to the convexity of the objective function.

It remains to show the asymptotic normality. To this end, note that $a_n^{-1}\left( \widetilde{\varvec{\beta }}-\varvec{\beta }_0\right) $ is the minimizer of convex function $a_n^{-1}A\left( u\right) $. Thus, according to the corollary of Lid and Pollard (2011)(page 3), $a_n^{-1}\left( \widetilde{\varvec{\beta }}-\varvec{\beta }_0\right) $ satisfies that

$$\begin{aligned} a_n^{-1}\left( \widetilde{\varvec{\beta }}-\varvec{\beta }_0\right) = \left( 2g\left( 0\right) \right) ^{-1}\varvec{\Sigma }_{X}^{-1}\left( a_n^{-1}U_n\right) + o_P\left( 1\right) . \end{aligned}$$

(A16)

Invoking Lemma A.1 and Slusky’s Theorem, we have

$$\begin{aligned} a_n^{-1}\left( \widetilde{\varvec{\beta }}-\varvec{\beta }_0\right) {\mathop {\longrightarrow }\limits ^{d}}N\left( {\varvec{0}},\frac{1}{12g^2\left( 0\right) }\varvec{\Sigma }_{X}^{-1}\right) . \end{aligned}$$

(A17)

This completes the proof of the theorem. $\square $

Proof of Theorem 3.1

Firstly, We show the case when $m >t_n$. By the triangle inequality and the definition of $\widetilde{\varvec{\beta }}^{\left( m\right) }$, we have

$$\begin{aligned} \Vert \widetilde{\varvec{\beta }}^{\left( m\right) } - \varvec{\beta }_0 \Vert \le \frac{1}{m}\sum _{j=1}^{m} \Vert \widetilde{\varvec{\beta }}_j - \varvec{\beta }_0 \Vert . \end{aligned}$$

(A18)

Note that $\Vert \widetilde{\varvec{\beta }}_j - \varvec{\beta }_0 \Vert = O_p\left( a_n\right) $ implies $\frac{1}{m}\sum _{j=1}^{m} \Vert \widetilde{\varvec{\beta }}_j - \varvec{\beta }_0 \Vert = O_p\left( a_nm^{-1/2}\right) $. Therefore, when $a_n^{-2}m >n$, $\Vert \widetilde{\varvec{\beta }}^{\left( m\right) }- \varvec{\beta }_0 \Vert = O_p\left( n^{-1/2}\right) $. This shows that $\widetilde{\varvec{\beta }}^{\left( m\right) }$ is $\sqrt{n}$-consistent. Next, we prove the asymptotic property of $\widetilde{\varvec{\beta }}^{\left( m\right) }$. Denote that $\left( 2\,g\left( 0\right) \right) ^{-1}\varvec{\Sigma }_X^{-1} = H^{-1} $. Then

$$\begin{aligned}&n^{1/2}\left( \widetilde{\varvec{\beta }}^{\left( m\right) } - \varvec{\beta }_0\right) \\&\quad = \frac{n^{1/2}a_n}{m}\sum _{j=1}^{m}a_n^{-1}\left( \widetilde{\varvec{\beta }}_j - \varvec{\beta }_0\right) \\&\quad = H^{-1}\sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) \\&\quad \quad + o_P\left( \Vert n^{1/2}\left( \widetilde{\varvec{\beta }}^{\left( m\right) } - \varvec{\beta }_0\right) \Vert \right. \\&\quad \quad +\left. \Vert \sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) \Vert \right) \\&\quad = H^{-1}\sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) \\&\quad \quad + o_P\left( 1 + \Vert \sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) \Vert \right) , \end{aligned}$$

where $W_{i,j}$ represents the stochastic weight.

Now we show that $ \sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) = O_P\left( 1\right) $, which is the sum of n i.i.d. elements. Conditioning on the ith observation $\left( Y_i,X_i\right) $, $W_{i,j}, \hspace{5.0pt}j=1,\dots ,m$ are independent stochastic weights. Thus, it follows from $E\left[ \sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}\right. $ $\left. h_{1,j}\left( Z_i\right) \right] = 0$ that

$$\begin{aligned}&\Vert \sum _{i=1}^{n}Var\left( \frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) \right) \Vert \\&\quad = \Vert \sum _{i=1}^{n}\frac{4}{nm^2}\left\{ Var\left( E\left[ \sum _{j=1}^{m}h_{1,j}\left( Z_i\right) |Z_i\right] \right) \right. \\&\quad \quad \left. + E\left[ Var\left( \sum _{j=1}^{m}h_{1,j}\left( Z_i\right) |Z_i\right) \right] \right\} \Vert \\&\quad = \Vert \sum _{i=1}^{n}\frac{4}{nm^2} \left\{ \frac{m^2}{12}\varvec{\Sigma }_X + \frac{m\left( t_n-1\right) }{12}\varvec{\Sigma }_X\right\} \Vert \\&\quad = \frac{m+t_n-1}{3m}\Vert \varvec{\Sigma }_X\Vert . \end{aligned}$$

By Assumption 4 and $m >t_n$, we have $\lim \sup _{n\rightarrow \infty }\Vert \sum _{i=1}^{n}$ $Var\left( \frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) \right) \Vert < \infty $. An application of the multivariate Lindeberg-Lévy central limit theorem (Hansen 2022) gives

$$\begin{aligned}&\left( h_{n,m}\right) ^{-1/2}\sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) \nonumber \\&\quad {\mathop {\longrightarrow }\limits ^{d}}N\left( 0,\frac{1}{3}\varvec{\Sigma }_X\right) , \end{aligned}$$

(A19)

as $r,n \rightarrow \infty $, where $h_{n,m} = \left( m+t_n-1\right) /m$. Furthermore, it can be easily verify that

$$\begin{aligned} \sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) = O_P\left( 1\right) , \end{aligned}$$

(A20)

and $o_P\left( 1 + \Vert \sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) \Vert \right) =o_P(1)$. Furthermore, as $m > t_n$, we have $0< t_n/m <1$ and $1<h_{n,m} < 2$. Then, by Slusky’s theorem, we can obtain the asymptotic distribution of $n^{1/2}\left( \widetilde{\varvec{\beta }}^{\left( m\right) } - \varvec{\beta }_0\right) $, i.e.

$$\begin{aligned}&\left( \frac{h_{n.m}}{12g^2\left( 0\right) }\varvec{\Sigma }_{X}^{-1}\right) ^{-1/2}\sqrt{n}\left( \widetilde{\varvec{\beta }}^{\left( m\right) } - \varvec{\beta }_0\right) \nonumber \\&\quad {\mathop {\longrightarrow }\limits ^{d}} N\left( {\varvec{0}},{\varvec{I}}_p\right) . \end{aligned}$$

(A21)

When $m <t_n$, the proposed estimator is $\sqrt{nm/t_n}$-consistent by the triangle inequality. The asymptotic distribution of $\sqrt{a_n^{-2}m}\left( \widetilde{\varvec{\beta }}^{\left( m\right) } - \varvec{\beta }_0\right) $ is

$$\begin{aligned}&\left( \frac{d_{n.m}}{12g^2\left( 0\right) }\varvec{\Sigma }_{X}^{-1}\right) ^{-1/2}\sqrt{\frac{nm}{t_n}}\left( \widetilde{\varvec{\beta }}^{\left( m\right) } - \varvec{\beta }_0\right) \nonumber \\&\quad {\mathop {\longrightarrow }\limits ^{d}} N\left( {\varvec{0}},{\varvec{I}}_p\right) , \end{aligned}$$

(A22)

where $d_{n.m} = \left( m+t_n-1\right) /t_n$ and $1<d_n<2$. This completes the proof of the corollary. $\square $

Proof of Theorem 3.2

The proof of Theorem 3.2 is similar to that of Theorem 3.1. The only difference is in dealing with the second-order moments of the stochastic weights.

Consider the weighted loss function

$$\begin{aligned} Q_n^*\left( \varvec{\beta }\right) =&\frac{2}{n\left( n-1\right) }\sum _{1\le i<j\le n} \nonumber \\&\frac{\left( W_i+W_j\right) }{2}\left| Y_{ij}-X_{ij}^{T}\varvec{\beta }\right| . \end{aligned}$$

(A23)

Let $B\left( u\right) = Q_n^*\left( \varvec{\beta }_0+b_nu\right) -Q_n^*\left( \varvec{\beta }_0\right) $. Then, we can divide $B\left( u\right) $ into two parts $B_1\left( u\right) $ and $B_2\left( u\right) $, where

$$\begin{aligned} B_1\left( u\right) =&-\frac{2}{n\left( n-1\right) }\sum _{1\le i<j\le n}\frac{\left( W_i+W_j\right) }{2}\\&\times b_nX_{ij}^{T}u\left( \frac{1}{2}-I\left( \varepsilon _{ij}<0\right) \right) \\ =&-b_nu^TU_n^*,\\ B_2\left( u\right) =&\frac{2}{n\left( n-1\right) }\sum _{1\le i<j\le n}\frac{\left( W_i+W_j\right) }{2}\\&\times \int _0^{b_nX_{ij}^{T}u} \left( I\left( \varepsilon _{ij}\le t\right) -I\left( \varepsilon _{ij}\le 0\right) \right) dt. \end{aligned}$$

For the first part of $B\left( u\right) $, we can show the asymptotic normality of $b_n^{-1}U_n^*$ similarly. To this end, let $U_n^* = \frac{2}{n}\sum _{i=1}^{n}h_1^*\left( Z_i\right) + R_n^*$, where $h_1^*\left( Z_i\right) =\frac{\left( W_i+1\right) }{2}\left( X_i - E\left[ X\right] \right) $ $\left( F\left( \varepsilon _i\right) -1/2\right) $. We can verify that

$$\begin{aligned}&Var\left( \frac{2}{n}\sum _{i=1}^{n}h_1^*\left( Z_i\right) \right) = \frac{4}{n}E\left[ h_1^{*2}\left( Z_i\right) \right] \\&\quad = \frac{4}{n}E\left[ \frac{\left( W_i+1\right) ^2}{4}\left( F\left( \varepsilon _i\right) -\frac{1}{2}\right) ^2 \right. \\&\quad \quad \left. \times \left( X_i-E\left[ X\right] \right) \left( X_i-E\left[ X\right] \right) ^T\right] \\&\quad = \frac{t_n+3}{12n}\varvec{\Sigma }_{X}, \end{aligned}$$

and $Var\left( R_n^*\right) = o\left( \frac{t_n+3}{n}\right) $. Since $\left\{ h_1^*\left( Z_i\right) \right\} _{i=1}^{n}$ are i.i.d. random vectors, by the multivariate Lindeberg-Lévy central limit theorem, we have

$$\begin{aligned} b_n^{-1}U_n^*{\mathop {\longrightarrow }\limits ^{d}}N\left( {\varvec{0}},\frac{1}{12}\varvec{\Sigma }_{X}\right) . \end{aligned}$$

(A24)

Next, we show that $B_2\left( u\right) = b_n^2\,g\left( 0\right) u^T\varvec{\Sigma }_{X}u + o_P\left( b_n^2\right) $. Note that $B_2\left( u\right) $ can be rewritten as

$$\begin{aligned}&B_2\left( u\right) \nonumber \\&\quad = \frac{2}{n\left( n-1\right) }\sum _{1\le i,j\le n}\left\{ E\left[ B_{2ij}+ \left( B_{2ij} - E\left[ B_{2ij}\right] \right) \right] \right\} \nonumber \\&\quad \triangleq B_2^{\left( 1\right) } +B_2^{\left( 2\right) }. \end{aligned}$$

(A25)

Similar to the proof of Lemmas A.2 and A.3, we can obtain

$$\begin{aligned}&E\left[ B_{2ij}\right] = b_n^2g\left( 0\right) u^T\varvec{\Sigma }_{X}u + O\left( b_n^3\right) ,\\&E\left[ B_{2ij}^2\right] = O\left( t_nb_n^3\right) , \end{aligned}$$

by Assumptions 2 and 3. Thus, $B_2\left( u\right) = b_n^2\,g\left( 0\right) u^T\varvec{\Sigma }_{X}u + o_P\left( b_n^2\right) $ follows from the fact that

$$\begin{aligned}&B_2^{\left( 1\right) } = b_n^2g\left( 0\right) u^T\varvec{\Sigma }_{X}u + O\left( b_n^3\right) ,\\&B_2^{\left( 2\right) } = o_P\left( b_n^2\right) . \end{aligned}$$

By the arguments similar to those used in Theorem 3.1, we can prove that

$$\begin{aligned} \widetilde{\varvec{\beta }}^* - \varvec{\beta }_0 = O_P\left( b_n\right) , \end{aligned}$$

(A26)

and

$$\begin{aligned} b_n^{-1}\left( \widetilde{\varvec{\beta }}^*-\varvec{\beta }_0\right) {\mathop {\longrightarrow }\limits ^{d}}N\left( {\varvec{0}},\frac{1}{48g^2\left( 0\right) }\varvec{\Sigma }_{X}^{-1}\right) . \end{aligned}$$

(A27)

This completes the proof of Theorem 3.2.

Table 3 The MSEs for the global estimator (Full), the two proposed estimators with products weights (Product) and additive weights (Additive), respectively, for the 10-dimensional setting with $m=10$ and $n=10,000$

Full size table

The proof of Corollary 3.2 is similar, so we omit the details. $\square $

Appendix B: Additional numerical results

To further evaluate the performance of our proposed methods in higher-dimensional settings. We extend our analysis to a scenario where the dimension of $\beta _0$ is increased to 10, with $\varvec{\beta }_0=(-4,0.11,-0.9,-0.5,1,1,1,1,1,1)^T$. The measures of performance and other simulation settings are the same as Sect. 4.3. The simulation results for MSEs are reported in Fig. 7 and Table 3. From Fig. 7, we can observe similar trends and performance for various subsampling methods as in Fig. 3. We can find that the MSEs decrease as r increases, and the L-optimal subsampling method outperforms the uniform subsampling method. Additionally, our method with additive weights performs the best compared to the product weights method and the perL method. Table 3 shows that the results are in accordance with the findings in the 5-dimensional setting. Furthermore, the ratio of the MSE of Algorithm 4 to that of Algorithm 2 also tends to increase as r increases, and the different error distributions have a slight effect on this ratio.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

He, S., Xia, X. Random perturbation subsampling for rank regression with massive data. Stat Comput 35, 14 (2025). https://doi.org/10.1007/s11222-024-10548-9

Download citation

Received: 09 September 2024
Accepted: 05 December 2024
Published: 14 December 2024
DOI: https://doi.org/10.1007/s11222-024-10548-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Random perturbation subsampling for rank regression with massive data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Distributed smoothed rank regression with heterogeneous errors for massive data

Robust distributed estimation and variable selection for massive datasets via rank regression

Monotone rank estimation of transformation models with length-biased and right-censored data

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A: Proofs

Lemma A.1

Proof

Lemma A.2

Proof

Lemma A.3

Proof

Proof of Theorem 3.1

Proof of Theorem 3.1

Proof of Theorem 3.2

Appendix B: Additional numerical results

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Random perturbation subsampling for rank regression with massive data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Distributed smoothed rank regression with heterogeneous errors for massive data

Robust distributed estimation and variable selection for massive datasets via rank regression

Monotone rank estimation of transformation models with length-biased and right-censored data

Explore related subjects

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A: Proofs

Lemma A.1

Proof

Lemma A.2

Proof

Lemma A.3

Proof

Proof of Theorem 3.1

Proof of Theorem 3.1

Proof of Theorem 3.2

Appendix B: Additional numerical results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation