Abstract
The advent of the big data era has brought massive datasets to the forefront of academic and industrial discussions. Due to the high communication cost and long calculation time, traditional statistical methods may be difficult to process data centrally on a single server. A robust distributed system can effectively mitigate communication costs and enhance computational efficiency. However, the classical two-sample hypothesis testing problem in statistical analysis has not yet been fully developed within a distributed system framework. This paper explores the challenges of performing two-sample mean tests in a distributed framework, especially in the presence of unequal covariance matrices. By distributing samples across various nodes, we introduce two distributed test statistics: the blockwise linear two-sample test and the distributed two-sample test. Even though the sample size of each node is less than the dimension, the proposed test statistics maintain robust statistical properties. Both statistics are designed to enhance communication efficiency and reduce communication costs compared to the full-sample statistic. Simulation experiments and empirical analyses further confirm the favorable statistical properties of the proposed test statistics.















Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
No datasets were generated or analysed during the current study.
References
Afek, Y., Giladi, G., Patt-Shamir, B.: Distributed computing with the cloud. Distrib. Comput. 37(1), 1–18 (2024). https://doi.org/10.1007/s00446-024-00460-w
Bai, Z., Saranadasa, H.: Effect of high dimension: by an example of a two sample problem. Stat. Sin. 6, 311–329 (1996)
Bayle, P., Fan, J., Lou, Z.: Communication-efficient distributed estimation and inference for Cox’s model (2023). arXiv preprint arXiv:2302.12111
Bolón-Canedo, V., Sechidis, K., Sánchez-Marono, N., Alonso-Betanzos, A., Brown, G.: Exploring the consequences of distributed feature selection in DNA microarray data. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 1665–1672. IEEE (2017)
Chen, S.X., Peng, L.: Distributed statistical inference for massive data. Ann. Stat. 49(5), 2851–2869 (2021). https://doi.org/10.1214/21-AOS2062
Chen, S., Qin, Y.: A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Stat. 38(2), 808–835 (2010). https://doi.org/10.1214/09-AOS716
Fan, J., Guo, Y., Wang, K.: Communication-efficient accurate statistical estimation. J. Am. Stat. Assoc. 118(542), 1000–1010 (2023). https://doi.org/10.1080/01621459.2021.1969238
Gregory, K.B., Carroll, R.J., Baladandayuthapani, V., Lahiri, S.N.: A two-sample test for equality of means in high dimension. J. Am. Stat. Assoc. 110(510), 837–849 (2015). https://doi.org/10.1080/01621459.2014.934826
Guestrin, C., Bodik, P., Thibaux, R., Paskin, M., Madden, S.: Distributed regression: an efficient framework for modeling sensor network data. In: Proceedings of the 3rd International Symposium on Information Processing in Sensor networks(IPSN), pp. 1–10. IEEE (2004)
Hotelling, H.: The generalization of student’s ratio. Ann. Math. Stat. 2(3), 360–378 (1931). https://doi.org/10.1007/978-1-4612-0919-5_4
Hu, J., Bai, Z., Wang, C., Wang, W.: On testing the equality of high dimensional mean vectors with unequal covariance matrices. Ann. Inst. Stat. Math. 69, 365–387 (2017). https://doi.org/10.1007/s10463-015-0543-8
Huang, B., Liu, Y., Peng, L.: Distributed inference for two-sample u-statistics in massive data analysis. Scand. J. Stat. 50(3), 1090–1115 (2023). https://doi.org/10.1111/sjos.12620
Jiang, Y., Wang, X., Wen, C., Jiang, Y., Zhang, H.: Nonparametric two-sample tests of high dimensional mean vectors via random integration. J. Am. Stat. Assoc. 119(545), 701–714 (2024). https://doi.org/10.1080/01621459.2022.2141636
Kong, X., Harrar, S.W.: High-dimensional MANOVA under weak conditions. Statistics 55(2), 321–349 (2021). https://doi.org/10.1080/02331888.2021.1918693
Kumar, N., Sonowal, S.: Email spam detection using machine learning algorithms. In: 2020 Second International Conference on Inventive Research in Computing Applications (ICIRCA), pp. 108–113. IEEE (2020). https://doi.org/10.1109/ICIRCA48905.2020.9183098
Ledoit, O., Wolf, M.: Some hypothesis tests for the covariance matrix when the dimension is large compared to the sample size. Ann. Stat. 30(4), 1081–1102 (2002). https://doi.org/10.1214/aos/1031689018
Li, J., Chen, S.: Two sample tests for high-dimensional covariance matrices. Ann. Stat. 40(2), 908–940 (2012). https://doi.org/10.1214/12-AOS993
Lopes, M., Jacob, L., Wainwright, M.J.: A more powerful two-sample test in high dimensions using random projection. Adv. Neural Inf. Process. Syst. 1(2), 1206–1214 (2011)
Mondal, P.K., Biswas, M., Ghosh, A.K.: On high dimensional two-sample tests based on nearest neighbors. J. Multivar. Anal. 141, 168–178 (2015). https://doi.org/10.1016/j.jmva.2015.07.002
Pan, R., Ren, T., Guo, B., Li, F., Li, G., Wang, H.: A note on distributed quantile regression by pilot sampling and one-step updating. J. Bus. Econ. Stat. 40(4), 1691–1700 (2022). https://doi.org/10.1080/07350015.2021.1961789
Santos, B.D.I., Hortaçsu, A., Wildenbeest, M.R.: Testing models of consumer search using data on web browsing and purchasing behavior. Am. Econ. Rev. 102(6), 2955–2980 (2012). https://doi.org/10.1257/aer.102.6.2955
Scherhag, U., Rathgeb, C., Busch, C.: Performance variation of morphed face image detection algorithms across different datasets. In: 2018 International Workshop on Biometrics and Forensics (IWBF), pp. 1–6. IEEE (2018)
Sharath, R., Nirupam, K., Sowmya, B., Srinivasa, K.: Data analytics to predict the income and economic hierarchy on census data. In: 2016 International Conference on Computation System and Information Technology for Sustainable Solutions (CSITSS), pp. 249–254. IEEE (2016)
Szabó, B., Vuursteen, L., Van Zanten, H.: Optimal high-dimensional and nonparametric distributed testing under communication constraints. Ann. Stat. 51(3), 909–934 (2023). https://doi.org/10.1214/23-AOS2269
Thulin, M.: A high-dimensional two-sample test for the mean using random subspaces. Comput. Stat. Data Anal. 74, 26–38 (2014). https://doi.org/10.1016/j.csda.2013.12.003
Wang, F., Zhu, Y., Huang, D., Qi, H., Wang, H.: Distributed one-step upgraded estimation for non-uniformly and non-randomly distributed data. Comput. Stat. Data Anal. 162, 107265 (2021). https://doi.org/10.1016/j.csda.2021.107265
Xiaoyue, X., Shi, J., Song, K.: A distributed multiple sample testing for massive data. J. Appl. Stat. 50(3), 555–573 (2023). https://doi.org/10.1080/02664763.2021.1911967
Xu, G., Lin, L., Wei, P., Pan, W.: An adaptive two-sample test for high-dimensional means. Biometrika 103(3), 609–624 (2016). https://doi.org/10.1093/biomet/asw029
Xue, K., Yao, F.: Distribution and correlation-free two-sample test of high-dimensional means. Ann. Stat. 48(3), 1304–1328 (2020). https://doi.org/10.1214/19-AOS1848
Yu, J., Wang, H., Ai, M., Zhang, H.: Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. J. Am. Stat. Assoc. 117(537), 265–276 (2022). https://doi.org/10.1080/01621459.2020.1773832
Zhang, J., Pan, M.: A high-dimension two-sample test for the mean using cluster subspaces. Comput. Stat. Data Anal. 97, 87–97 (2016). https://doi.org/10.1016/j.csda.2015.12.004
Zhang, X., Liu, J., Zhu, Z.: Learning coefficient heterogeneity over networks: a distributed spanning-tree-based fused-lasso regression. J. Am. Stat. Assoc. 119(545), 485–497 (2024). https://doi.org/10.1080/01621459.2022.2126363
Acknowledgements
The authors would like to thank the Editor and three referees for their constructive comments that have significantly improved the paper. Jiang Hu was partially supported by NSFC Grants No.12292980, No.12292982, No.12171078, No.12326606, National Key R & D Program of China No.2020YFA0714102, and Fundamental Research Funds for the Central Universities, China No.2412023YQ003.
Author information
Authors and Affiliations
Contributions
All authors discussed the results and contributed to the final manuscript. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Technical proofs
1.1 A.1 Proof of Theorem 1
Proof
On each computing node, we compute the local statistic.
Let’s prove why the above equation holds.
Similarly,
Bringing \(\dot{\varvec{I}}\), \({\text {tr}}\varvec{ S}_{x}^{(k)}\), \({\text {tr}}\varvec{S}_{y}^{(k)}\) into Eq. A2, then Eq. A2 = A1. By Chen and Qin (2010), under \(H_1\) and the Assumptions 1–4,
where under Assumption 2,
and the o(1) term disappears under \(H_0\).
On each node, there are
We need to get:
Using the method of Lagrange multipliers, under constraint \(\sum _{{k}=1}^{K}\omega _{k}=1\), there are
The function \(L_n\left( \omega _1,\dots ,\omega _K;\lambda \right) \) takes the partial derivatives for \(\omega _{k}\), \(k=1,\dots ,K\), and \(\lambda \), respectively:
Under \(H_0\), then
\(\square \)
1.2 A.2 Proof of Theorem 2
Proof
Because it contains unknown variables, we estimate it:
By Lemma 2 and Continuous Mapping Theorem:
\(\square \)
1.3 A.3 Proof of Theorem 3
Proof
It’s on every node:
Nodes exist independently of each other. Then by Theorem 2, as \( p\rightarrow \infty \) and \(M_k\rightarrow \infty \),
Because \(\sum _{k=1}^{K}\omega _{k}^*=1\), then as \( p\rightarrow \infty \) and \(M_k\rightarrow \infty \), \(k\in \left\{ 1,\dots ,K\right\} \)
i.e.
where
Under \(H_1\) and the Assumptions 1–4, \({\text {Var}}\left( T_{\textrm{dist1}}^{(k)}\right) =\sigma _\textrm{dist1}^{(k)2}\left\{ 1+o(1)\right\} ,\) and the o(1) term disappears under \(H_0\). \(\square \)
1.4 A.4 Proof of Theorem 4
Proof
Calculate \(\mathbb {E}\left( T_{\textrm{dist2}}\right) \) and \({\text {Var}}\left( T_{\textrm{dist2}}\right) \).
where
Here
Thus
Let
Then
Since samples \(\varvec{\mathcal {X}}_{n}\) and \(\varvec{\mathcal {Y}}_{m}\) are independent, the \({\text {Cov}}\left( P_1,P_4\right) =0\), \({\text {Cov}}\left( P_1,P_5\right) =0\), \({\text {Cov}}\left( P_2,P_4\right) =0\) and \({\text {Cov}}\left( P_2,P_5\right) =0\). And the samples are independent between different nodes, we have the following covariance results.
In summary,
Thus, under \(H_0\),
Under \(H_1\),
where under Assumption 2,
Asymptotic normality of \(T_{\textrm{dist2}}\). Let
we know that, as \(n\rightarrow \infty \), \(\varvec{ S}_{x}{\mathop {\rightarrow }\limits ^{p}}\varvec{\Sigma }_{x}\), and as \(n_k\rightarrow \infty \), \(\varvec{S}_{x}^{(k)}{\mathop {\rightarrow }\limits ^{p}}\varvec{\Sigma }_{x},\) then, as \(n_k\rightarrow \infty \), \(n=\sum _{k=1}^{K}n_k\),
Similarly, as \(m_\ell \rightarrow \infty \), \(m=\sum _{\ell =1}^{L}m_\ell \),
So
We know that \(\dfrac{T_{\textrm{cq}}-\Vert \varvec{\mu }_{x}-\varvec{\mu }_{y}\Vert ^{2}}{\sqrt{{\text {Var}}\left( T_{\textrm{cq}}\right) }} {\mathop {\rightarrow }\limits ^{d}} \mathcal {N}(0,1).\) Finally, by Slutsky theorem, we have
\(\square \)
1.5 A.5 Proof of Theorem 5
Proof
By Lemma 2 in Hu et al. (2017), under Model II and Assumptions 1, 2, 5, 6, as \(p\rightarrow \infty \), \(n_k\rightarrow \infty \) and \(m_\ell \rightarrow \infty \),
and
Then as \(p\rightarrow \infty \), \(n_k\rightarrow \infty \) and \(m_\ell \rightarrow \infty \),
\(\square \)
Appendix B Supplementary figures
1.1 B.1 Supplementary figures of the impact of dimension
See Figs. 14, 15, 16, 17, 18, 19, 20 and 21.
1.2 B.2 Supplementary figures of the impact of the number of nodes
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yan, L., Hu, J. & Wu, L. Distributed hypothesis testing for large dimensional two-sample mean vectors. Stat Comput 34, 187 (2024). https://doi.org/10.1007/s11222-024-10489-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-024-10489-3