Abstract
For the general linear hypothesis testing problem for high-dimensional data, several interesting tests have been proposed in the literature. Most of them have imposed strong assumptions on the underlying covariance matrix so that their test statistics under the null hypothesis are asymptotically normally distributed. In practice, however, these strong assumptions may not be satisfied or hardly be checked so that these tests are often applied blindly in real data analysis. Their empirical sizes may then be much larger or smaller than the nominal size. For these tests, this is a size control problem which cannot be overcome via purely increasing the sample size to infinity. To overcome this difficulty, in this paper, a new normal-reference test using the centralized \(L^2\)-norm based test statistic with three cumulant matched chi-square approximation is proposed and studied. Some theoretical discussion and two simulation studies demonstrate that in terms of size control, the new normal-reference test performs very well regardless of if the high-dimensional data are nearly uncorrelated, moderately correlated, or highly correlated and it outperforms two existing competitors substantially. Two real high-dimensional data examples motivate and illustrate the new normal-reference test.
Similar content being viewed by others
References
Ahdesmäki M, Strimmer K et al (2010) Feature selection in omics prediction problems using cat scores and false nondiscovery rate control. Ann Appl Stat 4(1):503–519
Bai ZD, Saranadasa H (1996) Effect of high dimension: by an example of a two sample problem. Stat Sin 6(2):311–329
Burczynski ME, Peterson RL, Twine NC, Zuberek KA, Brodeur BJ, Casciotti L, Maganti V, Reddy PS, Strahs A, Immermann F, Spinelli W, Schwertschlag U, Slager AM, Cotreau MM, Dorner AJ (2006) Molecular classification of Crohn’s disease and ulcerative colitis patients using transcriptional profiles in peripheral blood mononuclear cells. J Mol Diagn 8(1):51–61
Cai TT, Xia Y (2014) High-dimensional sparse MANOVA. J Multivariate Anal 131:174–196. https://doi.org/10.1016/j.jmva.2014.07.002
Chen SX, Qin YL (2010) A two-sample test for high-dimensional data with applications to gene-set testing. Ann Stat 38(2):808–835
Fujikoshi Y, Himeno T, Wakaki H (2004) Asymptotic results of a high dimensional MANOVA test and power comparison when the dimension is large compared to the sample size. J Jpn Stat Soc 34(1):19–26. https://doi.org/10.14490/jjss.34.19
Jiang B, Ye C, Liu JS (2015) Nonparametric k-sample tests via dynamic slicing. J Am Stat Assoc 110(510):642–653. https://doi.org/10.1080/01621459.2014.920257
Mukhopadhyay S, Wang K (2020) A nonparametric approach to high-dimensional k-sample comparison problems. Biometrika 107(3):555–572
Nishiyama T, Hyodo M, Seo T, Pavlenko T (2013) Testing linear hypotheses of mean vectors for high-dimension data with unequal covariance matrices. J Stat Plan Inference 143(11):1898–1911. https://doi.org/10.1016/j.jspi.2013.07.008
Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JY, Goumnerova LC, Black PM, Lau C et al (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415(6870):436–442
Schott JR (2007) Some high-dimensional tests for a one-way MANOVA. J Multivar Anal 98(9):1825–1839
Srivastava MS, Fujikoshi Y (2006) Multivariate analysis of variance with fewer observations than the dimension. J Multivar Anal 97(9):1927–1940. https://doi.org/10.1016/j.jmva.2005.08.010 (special Issue dedicated to Prof. Fujikoshi)
Srivastava MS, Kubokawa T (2013) Tests for multivariate analysis of variance in high dimension under non-normality. J Multivar Anal 115:204–216. https://doi.org/10.1016/j.jmva.2012.10.011
Vaart AWVD (1998) Asymptotic statistics. Cambridge series in statistical and probabilistic mathematics. Cambridge University Press. https://doi.org/10.1017/CBO9780511802256
Yamada T, Himeno T (2015) Testing homogeneity of mean vectors under heteroscedasticity in high-dimension. J Multivar Anal 139:7–27. https://doi.org/10.1016/j.jmva.2015.02.005
Zhang JT (2005) Approximate and asymptotic distributions of chi-squared-type mixtures with applications. J Am Stat Assoc 100(469):273–285
Zhang JT (2011) Two-way MANOVA with unequal cell sizes and unequal cell covariance matrices. Technometrics 53(4):426–439
Zhang JT, Xu J (2009) On the k-sample Behrens-Fisher problem for high-dimensional data. Sci China Ser A Math 52(6):1285–1304
Zhang JT, Guo J, Zhou B (2017) Linear hypothesis testing in high-dimensional one way MANOVA. J Multivar Anal 155:200–216
Zhang JT, Guo J, Zhou B, Cheng MY (2020a) A simple two-sample test in high dimensions based on \(L^2\)-norm. J Am Stat Assoc 115(530):1011–1027
Zhang JT, Zhou B, Guo J (2020b) Testing high-dimensional mean vector with applications: a normal reference approach. Manuscript
Zhou B, Guo J, Zhang JT (2017) High-dimensional general linear hypothesis testing under heteroscedasticity. J Stat Plan Inference 188:36–54
Acknowledgements
The work was supported by the National University of Singapore academic research grant R-155-000-187-114. The authors thank the Editor, the Associate Editor and an anonymous reviewer for their insightful comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Technical proofs
Appendix: Technical proofs
Proof of Theorem 1
We first prove the first expression in (a). Note that as \(n\rightarrow \infty \), we have \(\text{ tr }({\hat{{{{\varvec{\varSigma }}}}}})/\text{tr }({{{\varvec{\varSigma }}}})\rightarrow 1\) in probability for all p. We can write \(T_{n,p,0}=S_{n,p,0}[1+o_p(1)]\) and \({\tilde{T}}_{n,p,0}={\tilde{S}}_{n,p,0}[1+o_p(1)]\) where
Therefore, it is sufficient to show
where \(\zeta \) is defined in Theorem 1(a). For this end, set \(\mathbf{w}_{i}=\sqrt{n_{i}}({\bar{{\mathbf{y}}}}_{i}-{{\varvec{\mu }}}_{i}),\ i=1,\dots ,k\) and let \(\mathbf{w}=(\mathbf{w}_{1}^\top ,\ldots ,\mathbf{w}_{k}^\top )^\top \). We have
Then we can write
where
It is easy to see that we have \(\mathbf{B}\mathbf{B}^\top =\mathbf{I}_{q}\) and \(\mathbf{B}^\top \mathbf{B}\) is idempotent with \(\text{ tr }(\mathbf{B}^\top \mathbf{B})=q\).
To prove (A.1), set \(\mathbf{w}_{n,p}=(\mathbf{B}\otimes \mathbf{I}_{p})\mathbf{w}\). We have \(\text{ E }(\mathbf{w}_{n,p})=\mathbf{0}\) and \(\text{ Cov }(\mathbf{w}_{n,p})=\mathbf{I}_q\otimes {{{\varvec{\varSigma }}}}\). Let \(\mathbf{b}_l^\top , l=1,\ldots , q\) be the rows of \(\mathbf{B}\) so that \(\mathbf{B}=[\mathbf{b}_1,\ldots ,\mathbf{b}_q]^\top \). Set \(\mathbf{w}_{n,p,l}=(\mathbf{b}_l^\top \otimes \mathbf{I}_p)\mathbf{w}, l=1,\ldots , q\). Then \(\mathbf{w}_{n,p,l}, l=1,\ldots , q\) are uncorrelated with \(\text{ E }(\mathbf{w}_{n,p,l})=0\) and \(\text{ Cov }(\mathbf{w}_{n,p,l})={{{\varvec{\varSigma }}}}\). It follows that \(S_{n,p,0}=\sum _{l=1}^q\Vert \mathbf{w}_{n,p,l}\Vert ^2-q\text{ tr }({{{\varvec{\varSigma }}}})\).
Let \(\mathbf{u}_{p,r},\ r=1,\dots ,p\) denote the eigenvectors associated with the decreasing-ordered eigenvalues \(\lambda _{p,r},\ r=1,\dots ,p\) of \({{{\varvec{\varSigma }}}}\). We have \(\mathbf{w}_{n,p,l}=\sum _{r=1}^{p}\xi _{l,r}^{(n,p)}\mathbf{u}_{p,r}\) where \(\xi _{l,r}^{(n,p)}=\mathbf{w}_{n,p,l}^\top \mathbf{u}_{p,r}.\) It is known that \(\xi _{l,r}^{(n,p)},\ r=1,\ldots ,p\) are uncorrelated and \(\text{ E }(\xi _{l,r}^{(n,p)})=0\) and \(\text{ Var }(\xi _{l,r}^{(n,p)})=\lambda _{p,r},\ r=1,\ldots ,p\). Set \(\mathbf{h}_l=(\mathbf{b}_l\otimes \mathbf{I}_{p})\mathbf{u}_{p,r}=(\mathbf{h}_{l,1}^\top ,\ldots ,\mathbf{h}_{l,k}^\top )^\top \) where \(\mathbf{h}_{l,i},\ i=1,\dots ,k\) are \(p\times 1\) vectors. Then we have \(\xi _{l,r}^{(n,p)}=\mathbf{w}^\top (\mathbf{b}_l\otimes \mathbf{I}_{p})\mathbf{u}_{p,r}=\sum _{i=1}^{k}\mathbf{h}_{l,i}^\top \mathbf{w}_{i}\). It follows that
Notice that \(\text{ E }(\mathbf{w}_{i})=\mathbf{0}\), \(\text{ Cov }(\mathbf{w}_{i})={{{\varvec{\varSigma }}}},\;i=1,\dots ,k\) and \(\mathbf{w}_{i},\ i=1,\dots ,k\) are independent, we have
By some algebra, we have
and
In addition, under Conditions C1 and C2, we have
It follows that we have
where \(n_{\min }=\min _{i=1}^{n}n_{i}\) and we use the fact that
Write \({\tilde{S}}_{n,p,0}=\sum _{r=1}^p \left[ \sum _{l=1}^{q}(\xi _{l,r}^{(n,p)})^{2}-q\lambda _{p,r}\right] /\sqrt{2q\text{ tr }({{{\varvec{\varSigma }}}}^{2})}\) and set \({\tilde{S}}_{n,p,0}^{(m)}=\sum _{r=1}^{m}[\sum _{l=1}^q (\xi _{l,r}^{(n,p)})^{2} -q\lambda _{p,r}]/\sqrt{2q\text{ tr }({{{\varvec{\varSigma }}}}^{2})}\) for some \(m<p\). We have \(|\psi _{{\tilde{S}}_{n,p,0}}(t)-\psi _{{\tilde{S}}_{n,p,0}^{(m)}}(t)|\le |t|\big [\text{ E }({\tilde{S}}_{n,p,0}-{\tilde{S}}_{n,p,0}^{(m)})^{2}\big ]^{1/2}\), where \(\psi _{{\tilde{S}}_{n,p,0}}(t)\) and \(\psi _{{\tilde{S}}_{n,p,0}^{(m)}}(t)\) are the characteristic functions of \({\tilde{S}}_{n,p,0}\) and \({\tilde{S}}_{n,p,0}^{(m)}\), respectively. Note that
By (A.3), we have
It follows that
Let t be fixed. By Condition C4, for any fixed q, as \(p\rightarrow \infty \), we have \(\sum _{r=1}^{\infty } \rho _{r}<\infty \) and
By letting \(m\rightarrow \infty \), we further have \( \sum _{r=m+1}^{\infty }\rho _{r}\rightarrow 0.\) Thus, for any given \(\epsilon >0\), there exist \(P_1\), \(M_1\) and \(N_{1}\), depending on t and \(\epsilon \), such that for any \(p\ge P_1\), \(m\ge M_1\) and \(n\ge N_{1}\), we have
For any fixed \(p\ge P_1, m\ge M_1\), by the central limit theorem, it is easy to show that as \(n\rightarrow \infty \), we have \({\tilde{S}}_{n,p,0}^{(m)}{\mathop {\longrightarrow }\limits ^{L}}{\tilde{S}}_{p,0}^{(m)}\) where \({\tilde{S}}_{p,0}^{(m)}= \sum _{l=1}^q\sum _{r=1}^{m} \rho _{p,r} (A_r-q)/\sqrt{2q}\) since as \(n\rightarrow \infty \), \(\xi _{l,r}^{(n,p)}{\mathop {\longrightarrow }\limits ^{L}}N(0,\lambda _{p,r})\) and \(\xi _{l,r}^{(n,p)}\)’s are asymptotically independent for \(r=1,\ldots , m;\; l=1,\ldots , q\). That is, under Condition C3, there exists \(N_{2}\), depending on p, m, t and \(\epsilon \), such that for any \(n\ge N_{2}\) we have
Recall that \(\zeta =\sum _{r=1}^{\infty }\rho _{r}(A_{r}-q)/\sqrt{2q}\). Set \(\zeta ^{(m)}=\sum _{r=1}^{m}\rho _{r}(A_{r}-q)/\sqrt{2q}\). Then, under Condition C4, for any fixed m, as \(p \rightarrow \infty \), we have \({\tilde{S}}_{p,0}^{(m)}{\mathop {\longrightarrow }\limits ^{L}}\zeta ^{(m)}\). That is, there exists a \(P_2\), depending on m, t and \(\epsilon \), such that for any \(p\ge P_2\) we have
Furthermore, we have
which, under Condition C4, tends to 0 as \(m\rightarrow \infty \). Thus, there exists \(M_2\), depending on t and \(\epsilon \), such that for any \(m\ge M_2\) we have
It follows from (A.4)–(A.7) that for any \(n\ge \max (N_{1},N_{2})\), \(p\ge \max (P_1,P_2)\) and \(m\ge \max (M_1,M_2)\) we have
The convergence in distribution of \({\tilde{S}}_{n,p,0}\) to \(\zeta \) given in (A.1) follows as we can let \(\epsilon \rightarrow 0\). The first expression of Theorem 1(a) is then proved.
Notice that when the k samples (1) are normally distributed, Conditions C1 and C2 are automatically satisfied so that under Conditions C3 and C4, the second expression of (15) follows immediately since under the Gaussian assumption, we have \(T_{n,p,0}{\mathop {=}\limits ^{d}}T_{n,p,0}^{*}\).
We now prove (b). Under Conditions C1, C2, C3, C5, and C6, the first expression of (16) can be proved using the central limit theorem of martingale difference as in Bai and Saranadasa (1996), and the second expression of (16) can be directly proved by the Lindeberg–Feller central limit theorem.
Finally, (17) can also be shown by Pólya’s theorem (see, e.g., Lemma 2.11 of Vaart 1998) using the same proof for Eq. (7) in Zhang et al. (2020a). \(\square \)
Proof of Theorem 2
Under the local alternative (23), we have \(T_{n,p}=\left[ T_{n,p,0}+\text{ tr }(\mathbf{M}^\top \mathbf{H}\mathbf{M})\right] [1+o_p(1)]\). In addition, under the given conditions, we have \({\hat{\beta }}_0/\beta _0{\mathop {\longrightarrow }\limits ^{P}}1, {\hat{\beta }}_1/\beta _1{\mathop {\longrightarrow }\limits ^{P}}1\) and \({\hat{d}}/d{\mathop {\longrightarrow }\limits ^{P}}1\) as \(n, p\rightarrow \infty \). We first prove (a). Under Conditions C1, C2, C3, and C4, Theorem 1(a) indicates that as \(n,p\rightarrow \infty \), we have \({\tilde{T}}_{n,p,0}=T_{n,p,0}/\sqrt{\frac{2q(v_k+q)}{v_k}\text{ tr }({{{\varvec{\varSigma }}}}^2)}{\mathop {\longrightarrow }\limits ^{L}}\zeta \) where \(\zeta \) is defined in Theorem 1(a). It follows that as \(n, p\rightarrow \infty \), we have
where \(\mathbf{H}^*\) is defined in (24).
We now prove (b). Under the given conditions, Theorem 1(b) indicates that as \(n\rightarrow \infty \), we have \({\tilde{T}}_{n,p,0} {\mathop {\longrightarrow }\limits ^{L}}{\mathcal {N}}(0,1)\) and \({\tilde{T}}_{n,p,0}^* {\mathop {\longrightarrow }\limits ^{L}}{\mathcal {N}}(0,1)\). Therefore, the skewness of \(T_{n,p,0}^*\) will also tend to 0. This, together with (19), shows that as \(n,p\rightarrow \infty \), we have \(d\rightarrow \infty \) and \((\chi _d^2(\alpha )-d)/\sqrt{2d}\rightarrow z_{\alpha }\) where \(z_{\alpha }\) denotes the upper \(100\alpha \)-percentile of \({\mathcal {N}}(0,1)\). Then by (A.8), as \(n,p\rightarrow \infty \), we have
where \(\varPhi (\cdot )\) denotes the cumulative distribution of \({\mathcal {N}}(0,1)\). The proof is complete. \(\square \)
Rights and permissions
About this article
Cite this article
Zhu, T., Zhang, JT. Linear hypothesis testing in high-dimensional one-way MANOVA: a new normal reference approach. Comput Stat 37, 1–27 (2022). https://doi.org/10.1007/s00180-021-01110-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-021-01110-6