Two-sample test in high dimensions through random selection

https://doi.org/10.1016/j.csda.2021.107218Get rights and content

Abstract

Testing the equality for two-sample means with high dimensional distributions is a fundamental problem in statistics. In the past two decades, many efforts have been devoted to comparing the mean vectors of two populations. Many existing tests rely on naive diagonal or trace estimators of the covariance matrix, ignoring the dependence structure between variables. To make more use of the dependence structure, a new nonparametric test based on random selections is proposed to test the population mean vector of nonnormal high-dimensional multivariate data. This makes more efficient use of the covariance structure to deal with dependent variables. The asymptotic null distribution of the proposed test is standard normal, regardless of the parent distributions of the random samples and the relations between data dimensions and sample sizes. Extensive simulations show that the power performance of the proposed test is encouraging compared with some existing methods.

Introduction

Suppose {xij,j=1,,ni} are two random samples drawn independently from a p-dimensional distribution Fi, for i=1,2 respectively. Let μi and Σi be the mean and the covariance of Fi. In this article, we consider the problem of testing the hypothesis H0:μ1=μ2versusH1:μ1μ2. This is a fundamental problem in statistical inference (Lehmann and Romano, 2005). It is often encountered in modern genetic research, geological imaging, signal processing, astrometry and finance. One such example is to test whether two gene sets, or pathways, have equal expression levels under two different experimental conditions. In classic multivariate statistical analysis, Hotelling’s T2 test is perhaps one of the most popular omnibus tests. It is defined as T2=def{n1n2(n1+n2)}(x¯1x¯2)TΣ̂1(x¯1x¯2), where x¯1=n11j=1n1x1j and x¯2=n21j=1n2x2j are the sample means, and Σ̂=(n2)1j=1n1(x1jx¯1)(x1jx¯1)T+(n2)1j=1n2(x2jx¯2)(x2jx¯2)T is the sample covariance matrix with n=n1+n2 (Anderson, 2003). However, the classic Hotelling’s T2 test suffers from a singularity problem for the inverse of the sample covariance matrix when the sample dimension is larger than the sample size. Moreover, the power of Hotelling’s T2 test can be adversely affected, even when the sample dimension is larger than the sample size, if the sample covariance matrix is nearly singular (Bai and Saranadasa, 1996).

To overcome the singularity problem in Hotelling’s T2 test, many studies have investigated the multivariate (“fixed p”) and high-dimensional (“divergent p”) cases. Roughly speaking, these tests can be categorized into two classes. In the first class, most scholars propose nonparametric tests in which Σ̂ in (2) is replaced by a known quantity or another estimate. Bai and Saranadasa (1996) and Zhang and Xu (2009) constructed a test statistic by removing the sample covariance matrix, Σ̂1, in T2, which is based on the squared Euclid norm, x¯1x¯22. Srivastava and Du (2008), Park and Ayyala (2013), and Feng et al. (2015) propose a test statistic by replacing Σ̂1 with the inverse of the diagonal of Σ̂. In these tests, asymptotic null distributions are derived when pnc(0,1), n=O(pγ) for some 12<γ1, or p=o(n3). Thus, these works all require that the dimension not be too large relative to the sample size. To allow for simultaneous testing of ultra-high-dimensional data, Chen and Qin (2010) propose a U-statistic that is constructed by removing the cross-product terms j=1n1x1jTx1j and j=1n2x2jTx2j in x¯1x¯22. This ensures the asymptotic null distribution is standard normal, regardless of the relationship between p and n, when both the sample sizes and dimensions of the two random samples diverge to infinity. As stated above, all the aforementioned tests are essentially based on versions of Hotelling’s T2 test with diagonal estimators of Σ. The tests sacrifice power when the data variables are correlated to ignore the information on covariance structure. In addition, all these tests have unsatisfactory power performance when the multivariate, or high-dimensional distribution, is heavy-tailed and are very sensitive to outlying observations (Wang et al., 2015).

The second class of two-sample tests are through random projections or random subspaces, examples include Lopes et al. (2011), Srivastava et al. (2016), Thulin (2014), and Zhang and Pan (2016). These methods apply Hotelling’s T2 statistics, computed on low-dimensional random linear combinations of the original random variables or low-dimensional “clusters” of the original random variables. In this regard, the covariance structure may be used more effectively when testing with projected samples in a lower dimensional space. However, the asymptotic null distributions of Lopes et al. (2011) and Srivastava et al. (2016) are derived under the assumption of data normality. In addition, it has been observed in Thulin (2014) that the p-values of Lopes et al. (2011) obtained from the asymptotic null distribution are highly conservative for finite sample sizes. Furthermore, the asymptotic null distribution of Thulin (2014) and Zhang and Pan (2016) are not tractable. To decide critical values, random permutation resampling must be used, which increases the computational complexities significantly. Consequently, this test is typically regarded as computationally prohibitive, especially for high-dimensional or large-scale problems. Additionally, the power performance of Zhang and Pan (2016) relies heavily on the clustering method selected, and selecting the optimal clustering method to enhance power performance, especially in high dimensions, is not straightforward. In addition, Jurečková and Kalina (2012), Marozzi, 2015, Marozzi, 2016, and Marozzi et al. (2020) propose two-sample tests based on interpoint distance. These tests do not require the normal assumption nor the condition that special structures of the covariance matrix, and also do not need to set the test parameters. They prove that interpoint distance based tests are very effective in practice when analyzing high-dimensional data. Notably, most of these tests are combined tests and permutation approach is used to decide critical values. Furthermore,  Pesarin and Salmaso (2010) summarize a very general framework for these combined tests.

We propose a new two-sample test of μ1=μ2 in a high-dimensional setting. This test involves randomly selecting a low-dimensional subspace of p-dimensional samples, projecting the subspace onto one-dimensional spaces, and constructing the test statistic with the adjusted squared Euclidean distance. The proposed method has several distinct advantages. First, compared with the first class of studies discussed above, the random selection and projection tests take the multivariate dependence structure into account, which can make the test more efficient in dealing with dependent variables. Second, different from the second class, with normality assumption or re-sampling for the critical value, the proposed test does not require the normality assumption, and the asymptotic null distribution is standard normal, regardless of the relationship between p and n when (p,n). Therefore, it is much more appealing as the data is deviated from normal distribution in many real applications. Further, no re-sampling procedure is required to approximate the asymptotic null distribution, which is computationally allowed in the proposed test to handle high- or ultra-high-dimensional data sets. Last but not least, the tests of Chen and Qin (2010) from the first class and Lopes et al. (2011) from the second class are commonly compared methods in the literature. Theoretically, we prove that the proposed method outperforms these two tests in terms of asymptotic relative efficiency under mild conditions in Section 2.3.

The rest of this paper is organized as follows. We give explicit form to the proposed test statistic, and thoroughly investigate its asymptotic behaviors in Section 2. Extensive simulation studies are conducted in Section 3 to demonstrate the power performance of our proposed test and to compare it with many existing tests. The empirical studies indicate that our test outperforms competing tests in the parameter regimes anticipated by our theoretical results. Finally, we conclude the paper with a brief discussion in Section 4. All technical details are relegated to Appendix.

Section snippets

The test procedure

In this section, we develop a testing procedure for problem (1) in a high-dimensional setting. First, a single random vector, αr=(αr1,,αrp)T, is generated, where αrts are drawn independently from Bernoulli distribution B(1,k1p) with positive integer k1p for t=1,,p and r=1,,k. This is used to randomly select a low-dimensional subspace of the high-dimensional samples: x1=(x11,,x1n1) and x2=(x21,,x2n2). Then, we project the subspace onto one-dimensional spaces αrTx1 and αrTx2. Second, we

Numerical studies

Throughout, we let N(μ,Σ) be multivariate normal distribution, and td(μ,Σ) be multivariate t distribution with d degrees of freedom, where μ is a location vector and Σ is a shape matrix. Let U(a,b) be the uniform distribution defined on the interval (a,b), 0p×1 be a vector of zeros, and Ip×p be an identity matrix. We further set the significance level as 0.05, and use Monte Carlo simulations to estimate the empirical sizes and powers. We consider here the number of Monte Carlo simulations to be

Discussion

In this paper, we propose an efficient nonparametric test by projecting large-scale data onto low dimensional spaces to cope with the issue of high dimensionality. We advocate using random selection because it achieves the goal of dimension reduction while simultaneously, preserving as much information as possible from the two random samples. Random selections and projections make more use of the dependence structure, which can deal with dependent variables. In addition, the asymptotic null

Acknowledgments

This work was supported by the Beijing Natural Science Foundation (Z20001, Z19J00009), the Research Funds for the Major Innovation Platform of Public Health, Disease Control and Prevention, Renmin University of China, and National Natural Science Foundation of China (11971478, 11731011, 11931014).

References (28)

  • GravierE. et al.

    A prognostic dna signature for t1t2 node-negative breast cancer patients

    Genes Chromosom. Cancer

    (2010)
  • HallP. et al.

    Martingale Limit Theory and Applications. Academic Press

    (1980)
  • HarbeckN. et al.

    Breast cancer

    Nat. Rev. Dis. Primers

    (2019)
  • HendrickR.E. et al.

    Breast cancer deaths averted over 3 decades

    Cancer

    (2019)
  • Cited by (0)

    View full text