Abstract
Rank regression plays a fundamental role in statistical data analysis due to its robustness and high efficiency, and has been widely used in various scientific fields. However, when the data size is huge or massive, but the computing resource is limited, it can lead to an unacceptably computational cost for rank regression estimation. To handle this issue, this paper applies a random perturbation subsampling method to rank regression models. Specifically, we develop two repeatedly random perturbation subsampling algorithms to find the estimation of parameters. Two different weighting strategies with product weights and additive weights are examined in the objective function. Differing from the existing optimal and Poisson subsampling methods, our methods do not require the explicit calculation of subsampling probabilities for all data points, in which some unknown parameters often need to be estimated, thus making the implementation of our methods easier. Theoretically, statistical justifications are further provided for the proposed estimators including consistency and asymptotic normality. Extensive simulation studies and an empirical application are carried out to illustrate the effectiveness of the proposed methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
Data is provided within the manuscript or supplementary information files.
References
Ai, M., Wang, F., Yu, J., Zhang, H.: Optimal subsampling for large-scale quantile regression. J. Complex. 62, 101512 (2021). https://doi.org/10.1016/j.jco.2020.101512
Ai, M., Yu, J., Zhang, H., Wang, H.: Optimal subsampling algorithms for big data regressions. Stat. Sin. 31(2), 749–772 (2021). https://doi.org/10.5705/ss.202018.0439
Bose, A., Chatterjee, S.: U-Statistics, Mm-Estimators and Resampling. Springer, Singapore (2018)
Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z.: Distributed testing and estimation under sparse high dimensional models. Ann. Stat. 46(3), 1352–1382 (2018). https://doi.org/10.1214/17-AOS1587
Chao, Y., Huang, L., Ma, X., Sun, J.: Optimal subsampling for modal regression in massive data. Metrika 87(4), 379–409 (2024). https://doi.org/10.1007/s00184-023-00916-2
Efron, B.: Bootstrap methods: Another look at the jackknife. Ann. Stat. 7(1), 1–26 (1979). https://doi.org/10.1007/978-1-4612-4380-9_41
Hansen, B.: Econometrics. Princeton University Press, Princeton (2022)
Huang, B., Liu, Y., Peng, L.: Weighted bootstrap for two-sample u-statistics. J. Stat. Plan. Inference 226, 86–99 (2023). https://doi.org/10.1016/j.jspi.2023.02.004
Hoeffding, W.: A class of statistics with asymptotically normal distribution. Ann. Math. Stat. 19(3), 293–325 (1948). https://doi.org/10.1214/aoms/1177730196
Jaeckel, L.A.: Estimating regression coefficients by minimizing the dispersion of the residuals. Ann. Math. Stat. 43(5), 1449–1458 (1972). https://doi.org/10.1214/aoms/1177692377
Jordan, M.I., Lee, J.D., Yang, Y.: Communication-efficient distributed statistical inference. J. Am. Stat. Assoc. 114(526), 668–681 (2019). https://doi.org/10.1080/01621459.2018.1429274
Ju, J., Wang, M., Zhao, S.: Subsampling for big data linear models with measurement errors. arXiv:2403.04361 (2024)
Knight, K.: Limiting distributions for \(l_1\) regression estimators under general conditions. Ann. Stat. 26(2), 755–770 (1998). https://doi.org/10.1214/aos/1028144858
Leng, C.: Variable selection and coefficient estimation via regularized rank regression. Stat. Sin. 20(1), 167–181 (2010). https://doi.org/10.1051/epjconf/20100402005
Lid Hjort, N., Pollard, D.: Asymptotics for minimisers of convex processes. arXiv:1107.3806 (2011)
Lee, J., Wang, H., Schifano, E.D.: Online updating method to correct for measurement error in big data streams. Comput. Stat. Data Anal. 149, 106976 (2020). https://doi.org/10.1016/j.csda.2020.106976
Luan, J., Wang, H., Wang, K., Zhang, B.: Robust distributed estimation and variable selection for massive datasets via rank regression. Ann. Inst. Stat. Math. 74, 435–450 (2021). https://doi.org/10.1007/s10463-021-00803-5
Li, X., Xia, X., Zhang, Z.: Distributed subsampling for multiplicative regression. Stat. Comput. 34(5), 1–20 (2024). https://doi.org/10.1007/s11222-024-10477-7
Li, X., Xia, X., Zhang, Z.: Poisson subsampling-based estimation for growing-dimensional expectile regression in massive data. Stat. Comput. 34, 133 (2024). https://doi.org/10.1007/s11222-024-10449-x
Ma, P., Mahoney, M.W., Yu, B.: A statistical perspective on algorithmic leveraging. J. Mach. Learn. Res. 16(27), 861–911 (2015). https://doi.org/10.48550/arXiv.1306.5362
Portnoy, S., Koenker, R.: The gaussian hare and the Laplacian tortoise: computability of squared- error versus absolute-error estimators. Stat. Sci. 12, 279–296 (1997). https://doi.org/10.1214/ss/1030037960
Ren, M., Zhao, S., Wang, M.: Optimal subsampling for least absolute relative error estimators with massive data. J. Complex. 74, 101694 (2023). https://doi.org/10.1016/j.jco.2022.101694
Schifano, E.D., Wu, J., Wang, C., Yan, J., Chen, M.-H.: Online updating of statistical inference in the big data setting. Technometrics 58(3), 393–403 (2016). https://doi.org/10.1080/00401706.2016.1142900
Tüfekci, P.: Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. Int. J. Electr. Power Energy Syst. 60, 126–140 (2014). https://doi.org/10.1016/j.ijepes.2014.02.027
Wang, H.: More efficient estimation for logistic regression with optimal subsamples. J. Mach. Learn. Res. 20(132), 1–59 (2018). https://doi.org/10.48550/arXiv.1802.02698
Wang, H., Ma, Y.: Optimal subsampling for quantile regression in big data. Biometrika 108(1), 99–112 (2020). https://doi.org/10.1093/biomet/asaa043
Wang, L., Peng, B., Bradic, J., Li, R., Wu, Y.: A tuning-free robust and efficient approach to high-dimensional regression (with discussion). J. Am. Stat. Assoc. 115, 1700–1714 (2020). https://doi.org/10.1080/01621459.2020.1840989
Wang, H., Zhu, R., Ma, P.: Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc. 113(522), 829–844 (2018). https://doi.org/10.1080/01621459.2017.1292914
Yu, J., Ai, M., Ye, Z.: A review on design inspired subsampling for big data. Stat. Pap. 65(2), 467–510 (2024). https://doi.org/10.1007/s00362-022-01386-w
Yao, Y., Jin, Z.: A perturbation subsampling for large scale data. Stat. Sin. 34(20), 911–932 (2024). https://doi.org/10.5705/ss.202022.0020
Yao, Y., Wang, H.: A review on optimal subsampling methods for massive datasets. J. Data Sci. 19(1), 151–172 (2021). https://doi.org/10.6339/21-JDS999
Yu, J., Wang, H., Ai, M., Zhang, H.: Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. J. Am. Stat. Assoc. 117, 265–276 (2020). https://doi.org/10.1080/01621459.2020.1773832
Zhou, L., Wang, B., Zou, H.: Sparse convoluted rank regression in high dimensions. J. Am. Stat. Assoc. 119(546), 1500–1512 (2024). https://doi.org/10.1080/01621459.2023.2202433
Acknowledgements
The authors thank the Editor and anonymous reviewers for their constructive comments and suggestions, which greatly improves the quality of the current work. Xia’s work was supported by National Natural Science Foundation of China (Grant Number 11801202) and Fundamental Research Funds for the Central Universities (Grant Number 2021CDJQY-047).
Author information
Authors and Affiliations
Contributions
Sijin He: Conceptualization, Methodology, Software, Validation, Writing - original draft. Xiaochao Xia: Supervision, Methodology, Writing - review & editing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Proofs
We denote the cumulative distribution function (CDF) and probability density function (PDF) of \(\varepsilon _{ij} = \varepsilon _i - \varepsilon _j\) by G and g, respectively. Simple algebra yields \(g\left( s\right) = \int f(t)f(t-s)dt\) and \(g(0) = \int f^2(t)dt = \omega \). To facilitate our proof, we first make some notations. Let \(Z_i = (W_i,Y_i,X_i^T)^T \) and
To prove the main results stated in Sect. 3, we need to prove the following lemmas.
Lemma A.1
Under Assumptions 1 and 4, as \(n \longrightarrow \infty \)
Proof
First note that, \(U_n\) is a U-statistic of degree 2 with the kernel h, and \(E\left[ h\left( Z_i,Z_j\right) \right] = 0\). Recall the definition of \(h_1\left( Z_i\right) \) and denote \(\delta _1= Var\left( h_1\left( Z_i\right) \right) \) and \(\delta _2 = Var\left( h\left( Z_i,Z_j\right) \right) \). It can be shown that
where
Thus, we have
Next, we have the following decomposition
In what follows, we are going to show that \( R_n = o_P\left( a_n\right) \). By a direct calculations, it can be seen that the decomposition (A4) is an orthogonal decomposition, namely,
As \(Var\left( \frac{2}{n}\sum _{i=1}^{n}h_1\left( Z_i\right) \right) = \frac{t_n}{3n}\varvec{\Sigma }_{X}\) and \(E\left[ R_n\right] = 0\), thus
Therefore, an application of the multivariate Lindeberg-Lévy central limit theorem (Hansen 2022) yields
\(\square \)
Lemma A.2
Under Assumptions 2 and 3, we have
Proof
A direct calculation gives
where \(\zeta \in \left( 0,t\right) \). Then it follows that
where \(c_1\) is the constant and the last equality is due to Assumption 2. Hence,
\(\square \)
Lemma A.3
Under Assumptions 2 and 3, we have
Proof
First, using Hölder’s inequality, we have
Using the same arguments as in Lemma A.1 and under Assumptions 2 and 3, we have
Thus, the result follows from the fact
\(\square \)
We are now ready to prove the consistency and the asymptotic normality of \(\widetilde{\varvec{\beta }}\) stated in Theorem 3.1.
Proof of Theorem 3.1
First note that the estimator \(\widetilde{\varvec{\beta }}\) satisfies
To show \(\widetilde{\varvec{\beta }} - \varvec{\beta }_0 = O_P\left( a_n\right) \), it is sufficient to show that, for any given \(\xi >0\) and \(u \in {\mathbb {R}}^d\), there exist a large constant C such that
Let \(A\left( u\right) =Q_n\left( \varvec{\beta }_0+a_nu\right) -Q_n\left( \varvec{\beta }_0\right) \), we have
Due to Knight’s identity (Knight 1998)
we have
where
The first part of \(A\left( u\right) \) is a direct result of Lemma A.1. Therefore we only need to show the second part \(A_2\left( u\right) = a_n^2\,g\left( 0\right) u^T\varvec{\Sigma }_{X}u + o_P\left( a_n^2\right) \). Note that
where
According to Hoeffding (1948) and Lemma A.3,
Thus, it follows from Chebyshev’s inequality and Assumption 5 that
Combining Lemma A.2 and (A14), we have
Furthermore, by Lemma A.1, (A11) and (A15), we can obtain
So, by choosing a sufficiently large C, the term \(A_1\left( u\right) \) is dominated by \(A_2\left( u\right) \) with probability \(1-\xi \). Hence, \( Q_n\left( \varvec{\beta }_0+a_nu\right) -Q_n\left( \varvec{\beta }_0\right) \>0 \) with probability \(1-\xi \). This implies that, with the probability approaching 1, there exists a local minimizer \(\widetilde{\varvec{\beta }}\) such that \(\widetilde{\varvec{\beta }} - \varvec{\beta }_0 = O_P\left( a_n\right) \). Note that \(\widetilde{\varvec{\beta }}\) is also the global minimizer due to the convexity of the objective function.
It remains to show the asymptotic normality. To this end, note that \(a_n^{-1}\left( \widetilde{\varvec{\beta }}-\varvec{\beta }_0\right) \) is the minimizer of convex function \(a_n^{-1}A\left( u\right) \). Thus, according to the corollary of Lid and Pollard (2011)(page 3), \(a_n^{-1}\left( \widetilde{\varvec{\beta }}-\varvec{\beta }_0\right) \) satisfies that
Invoking Lemma A.1 and Slusky’s Theorem, we have
This completes the proof of the theorem. \(\square \)
Proof of Theorem 3.1
Firstly, We show the case when \(m >t_n\). By the triangle inequality and the definition of \(\widetilde{\varvec{\beta }}^{\left( m\right) }\), we have
Note that \(\Vert \widetilde{\varvec{\beta }}_j - \varvec{\beta }_0 \Vert = O_p\left( a_n\right) \) implies \(\frac{1}{m}\sum _{j=1}^{m} \Vert \widetilde{\varvec{\beta }}_j - \varvec{\beta }_0 \Vert = O_p\left( a_nm^{-1/2}\right) \). Therefore, when \(a_n^{-2}m >n\), \(\Vert \widetilde{\varvec{\beta }}^{\left( m\right) }- \varvec{\beta }_0 \Vert = O_p\left( n^{-1/2}\right) \). This shows that \(\widetilde{\varvec{\beta }}^{\left( m\right) }\) is \(\sqrt{n}\)-consistent. Next, we prove the asymptotic property of \(\widetilde{\varvec{\beta }}^{\left( m\right) }\). Denote that \(\left( 2\,g\left( 0\right) \right) ^{-1}\varvec{\Sigma }_X^{-1} = H^{-1} \). Then
where \(W_{i,j}\) represents the stochastic weight.
Now we show that \( \sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) = O_P\left( 1\right) \), which is the sum of n i.i.d. elements. Conditioning on the ith observation \(\left( Y_i,X_i\right) \), \(W_{i,j}, \hspace{5.0pt}j=1,\dots ,m\) are independent stochastic weights. Thus, it follows from \(E\left[ \sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}\right. \) \(\left. h_{1,j}\left( Z_i\right) \right] = 0\) that
By Assumption 4 and \(m >t_n\), we have \(\lim \sup _{n\rightarrow \infty }\Vert \sum _{i=1}^{n}\) \(Var\left( \frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) \right) \Vert < \infty \). An application of the multivariate Lindeberg-Lévy central limit theorem (Hansen 2022) gives
as \(r,n \rightarrow \infty \), where \(h_{n,m} = \left( m+t_n-1\right) /m\). Furthermore, it can be easily verify that
and \(o_P\left( 1 + \Vert \sum _{i=1}^{n}\frac{2}{\sqrt{n}m}\sum _{j=1}^{m}h_{1,j}\left( Z_i\right) \Vert \right) =o_P(1)\). Furthermore, as \(m > t_n\), we have \(0< t_n/m <1\) and \(1<h_{n,m} < 2\). Then, by Slusky’s theorem, we can obtain the asymptotic distribution of \(n^{1/2}\left( \widetilde{\varvec{\beta }}^{\left( m\right) } - \varvec{\beta }_0\right) \), i.e.
When \(m <t_n\), the proposed estimator is \(\sqrt{nm/t_n}\)-consistent by the triangle inequality. The asymptotic distribution of \(\sqrt{a_n^{-2}m}\left( \widetilde{\varvec{\beta }}^{\left( m\right) } - \varvec{\beta }_0\right) \) is
where \(d_{n.m} = \left( m+t_n-1\right) /t_n\) and \(1<d_n<2\). This completes the proof of the corollary. \(\square \)
Proof of Theorem 3.2
The proof of Theorem 3.2 is similar to that of Theorem 3.1. The only difference is in dealing with the second-order moments of the stochastic weights.
Consider the weighted loss function
Let \(B\left( u\right) = Q_n^*\left( \varvec{\beta }_0+b_nu\right) -Q_n^*\left( \varvec{\beta }_0\right) \). Then, we can divide \(B\left( u\right) \) into two parts \(B_1\left( u\right) \) and \(B_2\left( u\right) \), where
For the first part of \(B\left( u\right) \), we can show the asymptotic normality of \(b_n^{-1}U_n^*\) similarly. To this end, let \(U_n^* = \frac{2}{n}\sum _{i=1}^{n}h_1^*\left( Z_i\right) + R_n^*\), where \(h_1^*\left( Z_i\right) =\frac{\left( W_i+1\right) }{2}\left( X_i - E\left[ X\right] \right) \) \(\left( F\left( \varepsilon _i\right) -1/2\right) \). We can verify that
and \(Var\left( R_n^*\right) = o\left( \frac{t_n+3}{n}\right) \). Since \(\left\{ h_1^*\left( Z_i\right) \right\} _{i=1}^{n}\) are i.i.d. random vectors, by the multivariate Lindeberg-Lévy central limit theorem, we have
Next, we show that \(B_2\left( u\right) = b_n^2\,g\left( 0\right) u^T\varvec{\Sigma }_{X}u + o_P\left( b_n^2\right) \). Note that \(B_2\left( u\right) \) can be rewritten as
Similar to the proof of Lemmas A.2 and A.3, we can obtain
by Assumptions 2 and 3. Thus, \(B_2\left( u\right) = b_n^2\,g\left( 0\right) u^T\varvec{\Sigma }_{X}u + o_P\left( b_n^2\right) \) follows from the fact that
By the arguments similar to those used in Theorem 3.1, we can prove that
and
This completes the proof of Theorem 3.2.
The proof of Corollary 3.2 is similar, so we omit the details. \(\square \)
Appendix B: Additional numerical results
To further evaluate the performance of our proposed methods in higher-dimensional settings. We extend our analysis to a scenario where the dimension of \(\beta _0\) is increased to 10, with \(\varvec{\beta }_0=(-4,0.11,-0.9,-0.5,1,1,1,1,1,1)^T\). The measures of performance and other simulation settings are the same as Sect. 4.3. The simulation results for MSEs are reported in Fig. 7 and Table 3. From Fig. 7, we can observe similar trends and performance for various subsampling methods as in Fig. 3. We can find that the MSEs decrease as r increases, and the L-optimal subsampling method outperforms the uniform subsampling method. Additionally, our method with additive weights performs the best compared to the product weights method and the perL method. Table 3 shows that the results are in accordance with the findings in the 5-dimensional setting. Furthermore, the ratio of the MSE of Algorithm 4 to that of Algorithm 2 also tends to increase as r increases, and the different error distributions have a slight effect on this ratio.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
He, S., Xia, X. Random perturbation subsampling for rank regression with massive data. Stat Comput 35, 14 (2025). https://doi.org/10.1007/s11222-024-10548-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-024-10548-9