Two-sample test in high dimensions through random selection
Introduction
Suppose are two random samples drawn independently from a -dimensional distribution , for respectively. Let and be the mean and the covariance of . In this article, we consider the problem of testing the hypothesis This is a fundamental problem in statistical inference (Lehmann and Romano, 2005). It is often encountered in modern genetic research, geological imaging, signal processing, astrometry and finance. One such example is to test whether two gene sets, or pathways, have equal expression levels under two different experimental conditions. In classic multivariate statistical analysis, Hotelling’s test is perhaps one of the most popular omnibus tests. It is defined as where and are the sample means, and is the sample covariance matrix with (Anderson, 2003). However, the classic Hotelling’s test suffers from a singularity problem for the inverse of the sample covariance matrix when the sample dimension is larger than the sample size. Moreover, the power of Hotelling’s test can be adversely affected, even when the sample dimension is larger than the sample size, if the sample covariance matrix is nearly singular (Bai and Saranadasa, 1996).
To overcome the singularity problem in Hotelling’s test, many studies have investigated the multivariate (“fixed ”) and high-dimensional (“divergent ”) cases. Roughly speaking, these tests can be categorized into two classes. In the first class, most scholars propose nonparametric tests in which in (2) is replaced by a known quantity or another estimate. Bai and Saranadasa (1996) and Zhang and Xu (2009) constructed a test statistic by removing the sample covariance matrix, , in , which is based on the squared Euclid norm, . Srivastava and Du (2008), Park and Ayyala (2013), and Feng et al. (2015) propose a test statistic by replacing with the inverse of the diagonal of . In these tests, asymptotic null distributions are derived when , for some , or . Thus, these works all require that the dimension not be too large relative to the sample size. To allow for simultaneous testing of ultra-high-dimensional data, Chen and Qin (2010) propose a U-statistic that is constructed by removing the cross-product terms and in . This ensures the asymptotic null distribution is standard normal, regardless of the relationship between and , when both the sample sizes and dimensions of the two random samples diverge to infinity. As stated above, all the aforementioned tests are essentially based on versions of Hotelling’s test with diagonal estimators of . The tests sacrifice power when the data variables are correlated to ignore the information on covariance structure. In addition, all these tests have unsatisfactory power performance when the multivariate, or high-dimensional distribution, is heavy-tailed and are very sensitive to outlying observations (Wang et al., 2015).
The second class of two-sample tests are through random projections or random subspaces, examples include Lopes et al. (2011), Srivastava et al. (2016), Thulin (2014), and Zhang and Pan (2016). These methods apply Hotelling’s statistics, computed on low-dimensional random linear combinations of the original random variables or low-dimensional “clusters” of the original random variables. In this regard, the covariance structure may be used more effectively when testing with projected samples in a lower dimensional space. However, the asymptotic null distributions of Lopes et al. (2011) and Srivastava et al. (2016) are derived under the assumption of data normality. In addition, it has been observed in Thulin (2014) that the -values of Lopes et al. (2011) obtained from the asymptotic null distribution are highly conservative for finite sample sizes. Furthermore, the asymptotic null distribution of Thulin (2014) and Zhang and Pan (2016) are not tractable. To decide critical values, random permutation resampling must be used, which increases the computational complexities significantly. Consequently, this test is typically regarded as computationally prohibitive, especially for high-dimensional or large-scale problems. Additionally, the power performance of Zhang and Pan (2016) relies heavily on the clustering method selected, and selecting the optimal clustering method to enhance power performance, especially in high dimensions, is not straightforward. In addition, Jurečková and Kalina (2012), Marozzi, 2015, Marozzi, 2016, and Marozzi et al. (2020) propose two-sample tests based on interpoint distance. These tests do not require the normal assumption nor the condition that special structures of the covariance matrix, and also do not need to set the test parameters. They prove that interpoint distance based tests are very effective in practice when analyzing high-dimensional data. Notably, most of these tests are combined tests and permutation approach is used to decide critical values. Furthermore, Pesarin and Salmaso (2010) summarize a very general framework for these combined tests.
We propose a new two-sample test of in a high-dimensional setting. This test involves randomly selecting a low-dimensional subspace of -dimensional samples, projecting the subspace onto one-dimensional spaces, and constructing the test statistic with the adjusted squared Euclidean distance. The proposed method has several distinct advantages. First, compared with the first class of studies discussed above, the random selection and projection tests take the multivariate dependence structure into account, which can make the test more efficient in dealing with dependent variables. Second, different from the second class, with normality assumption or re-sampling for the critical value, the proposed test does not require the normality assumption, and the asymptotic null distribution is standard normal, regardless of the relationship between and when . Therefore, it is much more appealing as the data is deviated from normal distribution in many real applications. Further, no re-sampling procedure is required to approximate the asymptotic null distribution, which is computationally allowed in the proposed test to handle high- or ultra-high-dimensional data sets. Last but not least, the tests of Chen and Qin (2010) from the first class and Lopes et al. (2011) from the second class are commonly compared methods in the literature. Theoretically, we prove that the proposed method outperforms these two tests in terms of asymptotic relative efficiency under mild conditions in Section 2.3.
The rest of this paper is organized as follows. We give explicit form to the proposed test statistic, and thoroughly investigate its asymptotic behaviors in Section 2. Extensive simulation studies are conducted in Section 3 to demonstrate the power performance of our proposed test and to compare it with many existing tests. The empirical studies indicate that our test outperforms competing tests in the parameter regimes anticipated by our theoretical results. Finally, we conclude the paper with a brief discussion in Section 4. All technical details are relegated to Appendix.
Section snippets
The test procedure
In this section, we develop a testing procedure for problem (1) in a high-dimensional setting. First, a single random vector, , is generated, where s are drawn independently from Bernoulli distribution with positive integer for and . This is used to randomly select a low-dimensional subspace of the high-dimensional samples: and . Then, we project the subspace onto one-dimensional spaces and . Second, we
Numerical studies
Throughout, we let be multivariate normal distribution, and be multivariate distribution with degrees of freedom, where is a location vector and is a shape matrix. Let be the uniform distribution defined on the interval , be a vector of zeros, and be an identity matrix. We further set the significance level as 0.05, and use Monte Carlo simulations to estimate the empirical sizes and powers. We consider here the number of Monte Carlo simulations to be
Discussion
In this paper, we propose an efficient nonparametric test by projecting large-scale data onto low dimensional spaces to cope with the issue of high dimensionality. We advocate using random selection because it achieves the goal of dimension reduction while simultaneously, preserving as much information as possible from the two random samples. Random selections and projections make more use of the dependence structure, which can deal with dependent variables. In addition, the asymptotic null
Acknowledgments
This work was supported by the Beijing Natural Science Foundation (Z20001, Z19J00009), the Research Funds for the Major Innovation Platform of Public Health, Disease Control and Prevention, Renmin University of China, and National Natural Science Foundation of China (11971478, 11731011, 11931014).
References (28)
- et al.
A test for the mean vector in large dimension and small samples
J. Statist. Plann. Inference
(2013) - et al.
A test for the mean vector with fewer observations than the dimension
J. Multivariate Anal.
(2008) A high-dimensional two-sample test for the mean using random subspaces
Comput. Statist. Data Anal.
(2014)- et al.
A high-dimension two-sample test for the mean using cluster subspaces
Comput. Statist. Data Anal.
(2016) An Introduction to Multivariate Statistical Analysis
(2003)- et al.
Global trend of breast cancer mortality rate: A 25-year study
Asian Pac. J. Cancer Prev.
(2019) - et al.
Effect of high dimension: By an example of a two sample problem
Statist. Sinica
(1996) - et al.
A two-sample test for high-dimensional data with applications to gene-set testing
Ann. Statist.
(2010) - et al.
Tests for high-dimensional covariance matrices
J. Amer. Statist. Assoc.
(2010) - et al.
Two-sample Behrens-Fisher problem for high-dimensional data
Statist. Sinica
(2015)