Abstract
Large-scale multiple testing is common in the statistical analysis of high-dimensional data. Conventional multiple testing procedures usually implicitly assumed that the tests are independent. However, this assumption is rarely established in many practical applications, particularly in “high-throughput” data analysis. Incorporating dependence structure information among tests can improve statistical power and interpretability of discoveries. In this paper, we propose a new large-scale dependent multiple testing procedure based on the hidden semi-Markov model (HSMM), which characterizes local correlations among tests using a semi-Markov process instead of a first-order Markov chain. Our novel approach allows for the number of consecutive null hypotheses to follow any reasonable distribution, enabling a more accurate description of complex local correlations. We show that the proposed procedure minimizes the marginal false non-discovery rate (mFNR) at the same marginal false discovery rate (mFDR) level. To reduce the computational complexity of the HSMM, we make use of the hidden Markov model (HMM) with an expanded state space to approximate it. We provide a forward-backward algorithm and an expectation-maximization (EM) algorithm for implementing the proposed procedure. Finally, we demonstrate the superior performance of the SMLIS procedure through extensive simulations and a real data analysis.
Similar content being viewed by others
References
Barras L, Scaillet O, Wermers R (2010) False discoveries in mutual fund performance: measuring luck in estimated alphas. J Financ 65(1):179–216
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Stat Soc Series B Stat Methodol 57(1):289–300
Benjamini Y, Hochberg Y (2000) On the adaptive control of the false discovery rate in multiple testing with independent statistics. J Educ Behav Stat 25(1):60–83
Cai TT, Sun W, Wang W (2019) Covariate-assisted ranking and screening for large-scale two-sample inference. J Royal Stat Soc Series B Stat Methodol 81(2):187–234
Clarke S, Hall P (2009) Robustness of multiple testing procedures against dependence. Ann Stat 37(1):332–358
Cui T, Wang P, Zhu W (2021) Covariate-adjusted multiple testing in genome-wide association studies via factorial hidden Markov models. TEST 30(3):737–757
Efron B (2004) Large-scale simultaneous hypothesis testing. J Am Stat Assoc 99(465):96–104
Efron B (2008) Microarrays, empirical Bayes, and the two-groups model. Stat Sci 23(1):1–22
Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96(456):1151–1160
Felix EFO, Buhat CAH, Mamplata JB (2022) Poisson hidden Markov model on earthquake occurrences in metro manila, philippines. Earth Sci Inf 15(3):1635–1645
Finner H, Dickhaus T, Roters M (2007) Dependency and false discovery rate: Asymptotics. Ann Stat 35(4):1432–1455
Fu L, Gang B, James GM, Sun W (2022) Heteroscedasticity-adjusted ranking and thresholding for large-scale multiple testing. J Am Stat Assoc 117(538):1028–1040
Gang B, Sun W, Wang W (2023) Structure-adaptive sequential testing for online false discovery rate control. J Am Stat Assoc 118(541):732–745
Genovese C, Wasserman L (2002) Operating characteristics and extensions of the false discovery rate procedure. J Royal Stat Soc Series B Stat Methodol 64(3):499–517
Guédon Y (2005) Hidden hybrid Markov/semi-Markov chains. Comput Stat Data Anal 49(3):663–688
Kuan PF, Chiang DY (2012) Integrating prior knowledge in multiple testing under dependence with applications to detecting differential DNA methylation. Biometrics 68(3):774–783
Langrock R, Zucchini W (2011) Hidden Markov models with arbitrary state dwell-time distributions. Comput Stat Data Anal 55(1):715–724
Lei L, Fithian W (2018) AdaPT: an interactive procedure for multiple testing with side information. J Royal Stat Soc Series B Stat Methodol 80(4):649–679
Liu J, Zhang C, Page D (2016) Multiple testing under dependence via graphical models. Ann Appl Stat 10(3):1699–1724
Owen AB (2005) Variance of the number of false discoveries. J Royal Stat Soc Series B Stat Methodol 67(3):411–426
Roeder K, Wasserman L (2009) Genome-wide significance levels and weighted hypothesis testing. Stat Sci 24(4):398–413
Russell M, Cook A (1987) Experimental evaluation of duration modelling techniques for automatic speech recognition. In: ICASSP ’87. IEEE international conference on acoustics, speech, and signal processing: 2376–2379
Schizophrenia Working Group of the Psychiatric Genomics Consortium (2014) Biological insights from 108 schizophrenia-associated genetic loci. Nature 511(7510):421–427
Schwartzman A, Lin X (2011) The effect of correlation in false discovery rate estimation. Biometrika 98(1):199–214
Schwartzman A, Dougherty RF, Taylor JE (2008) False discovery rate analysis of brain diffusion direction maps. Ann Appl Stat 2(1):153–175
Shu H, Nan B, Koeppe R (2015) Multiple testing for neuroimaging via hidden Markov random field. Biometrics 71(3):741–750
Storey JD (2002) A direct approach to false discovery rates. J Royal Stat Soc Series B Stat Methodol 64(3):479–498
Sun W, Cai TT (2009) Large-scale multiple testing under dependence. J Royal Stat Soc Series B Stat Methodol 71(2):393–424
Sun W, Reich BJ, Cai TT, Guindani M, Schwartzman A (2015) False discovery control in large-scale spatial multiple testing. J Royal Stat Soc Series B Stat Methodol 77(1):59–83
The International HapMap Consortium (2003) The international hapmap project. Nature 426:789–796
Wang J, Cui T, Zhu W, Wang P (2023) Covariate-modulated large-scale multiple testing under dependence. Comput Stat Data Anal 180:107664
Wang P, Zhu W (2019) Replicability analysis in genome-wide association studies via Cartesian hidden Markov models. BMC Bioinf 20(1):146
Wang X, Shojaie A, Zou J (2019) Bayesian hidden Markov models for dependent large-scale multiple testing. Comput Stat Data Anal 136:123–136
Wei Z, Sun W, Wang K, Hakonarson H (2009) Multiple testing in genome-wide association studies via hidden Markov models. Bioinformatics 25(21):2802–2808
Funding
This work was supported by Department of Education of Liaoning Province Grant (Grant No. LN2020Q23)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 A.1 Proof of Theorem 1
Proof
Note that \(\Lambda _k(\varvec{Z})=\textrm{SMLIS}_k(\varvec{Z})/ (1-\textrm{SMLIS}_k(\varvec{Z}))\) satisfies the monotone ratio condition (MRC) defined in Sun and Cai (2009) and \(\Lambda _k(\varvec{Z})\) is strictly increasing in \(\textrm{SMLIS}_k(\varvec{Z})\). By Theorem 1 of Sun and Cai (2009), it follows that \(\textrm{mFDR}\left( {\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}), c)\right)\) is strictly increasing in c. Since the PDF of \(\{Z_{i}\}^m_{i=1}\) is continuous, the PDF and the cumulative distribution function (CDF) of \(\textrm{SMLIS}_k(\varvec{Z})\) are continuous. Note that
Hence \(\textrm{mFDR}\left( {\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}), c)\right)\) is continuous with respect to c. Also note that
and
Hence, for any \(0<\alpha <1\), the set \(\{t:\textrm{mFDR} ({\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}),t))\le \alpha \}\) is nonempty. Let
then we have \(\textrm{mFDR}({\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}), c_{\alpha }))=\alpha\). \(\square\)
1.2 A.2 Proof of Theorem 2
Proof
Let \(\Lambda _k(\varvec{Z})={\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=0\mid \varvec{Z})/{\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z})\). Since \(\Lambda _k(\varvec{Z})=\textrm{SMLIS}_k(\varvec{Z})/ (1-\textrm{SMLIS}_k(\varvec{Z}))\), \(\Lambda _k(\varvec{Z})\) is strictly increasing in \(\textrm{SMLIS}_k(\varvec{Z})\) and \({\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}), c_{\alpha })\) can be re-expressed as
where \(c^{*}_{\alpha }=c_{\alpha }/(1-c_{\alpha })\). In view of the definition of \(\Lambda _k(\varvec{Z})\), it is easy to get
This implies that
By the assumption that \(\textrm{mFDR}\left( {\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}), c_{\alpha })\right) =\alpha\), we have
It follows that
Similarly, the condition \(\textrm{mFDR}\left( \widetilde{\varvec{\delta }}(T(\varvec{Z}), c)\right) \le \alpha\) yields
Combining (A.3) with (A.4), we obtain that
Then (A.1) and (A.5) imply that
Note also that
Combining this with (A.3), we have
Thus (A.6) implies that
It follows that
Combining (A.2) with (A.7), we have
It follows that
Note that \(\dfrac{1-(1+c^{*}_{\alpha })x}{x}\) is strictly decreasing in x and hence
\(\square\)
1.3 A.3 The expression of the intermediate variables
The intermediate variables in the E-step of the tth iteration can be expressed as:
where \(\alpha _{i}^{(t)}(p) ={\mathbb {P}}_{\varvec{\widehat{\varphi }}^{(t)}}(\theta ^{\star }_{i} =p, \{z_{i}\}_{j=1}^{i})\), \(\beta _{i}^{(t)}(p) ={\mathbb {P}}_{\varvec{\widehat{\varphi }}^{(t)}} (\{z_{j}\}_{j=i+1}^{m}\mid \theta ^{\star }_{i}=p)\), and \(a^{(t)}(p,q)\) is the (p, q)th element of \({\mathcal {A}}^{(t)}\).
1.4 A.4 The additional simulation results of Scenario 1
In this section, we start by conducting simulation studies corresponding to Scenario 1 in the cases where \(L=2\) is known and unknown, respectively. For Scenario 1, consider the following model settings:
Setting S1: fix \(n_1=4\), \(\pi _b=0.4\), \(\mu _1=1.5\), \(\sigma _1=\sigma _2=1\), \(\pi _{11}=\pi _{12}=0.5\) and vary \(\mu _2\) from 2 to 3 with an increment 0.5;
Setting S2: fix \(\mu _1=1.5\), \(\mu _2=2.5\), \(\sigma _1=\sigma _2=1\), \(\pi _{11}=\pi _{12}=0.5\), \(\pi _b=0.4\) and vary \(n_1\) from 4 to 8 with an increment 2;
Setting S3: fix \(\mu _1=1.5\), \(\mu _2=2.5\), \(\sigma _1=\sigma _2=1\), \(\pi _{11}=\pi _{12}=0.5\), \(n_1=6\) and vary \(\pi _b\) from 0.35 to 0.45 with an increment 0.05.
The simulation results where \(L=2\) is known and unknown are displayed in Figs. 6 and 7, respectively. The results are consistent with those in Sect. 3 and hence we will not repeat.
1.5 A.5 The additional simulation results of Scenario 2
Then we perform simulations corresponding to Scenario 2 in the cases where \(L=2\) is known and unknown, respectively. For Scenario 2, consider the following model settings:
Setting S4: fix \(\lambda =5\), \(\mu _2=2.5\), \(\sigma _1=\sigma _2=1\), \(\pi _{11}=\pi _{12}=0.5\) and vary \(\mu _1\) from 1 to 2 with an increment 0.5;
Setting S5: fix \(\mu _1=1.5\), \(\mu _2=2.5\), \(\sigma _1=\sigma _2=1\), \(\pi _{11}=\pi _{12}=0.5\) and vary \(\lambda\) from 3 to 7 with an increment 2.
The results where \(L=2\) is known and unknown are displayed in Figs. 8 and 9, respectively.
1.6 A.6 The simulation results for Scenarios 3 and 4
Fix \(L=2\) and consider the following scenarios.
-
Scenario 3:
the PMF of shifted negative binomial distribution
$$\begin{aligned} d_0(r) = \frac{\Gamma (r+n_2-1)}{(r-1)!\Gamma (n_2)} \pi _{nb}^{n_2} (1-\pi _{nb})^{r-1}, r=1, 2, \cdots ; \end{aligned}$$ -
Scenario 4:
the PMF of the distribution with unstructured start and geometric tail
$$\begin{aligned} d_0(r) = \left\{ \begin{array}{lr} \delta _0(r), &{} \text {for } r\le q, \\ \delta _0(q)\left( \frac{1-\sum _{k=1}^q\delta _0(k)}{1-\sum _{k=1}^{q-1}\delta _0(k)}\right) ^{r-q}, &{} \text {for } r> q. \end{array}\right. \end{aligned}$$
For Scenario 3, consider the following model settings:
Setting S6: fix \(n_2=4\), \(\pi _{nb}=0.4\), \(\mu _1=1\), \(\sigma _1=\sigma _2=1\), \(\pi _{11}=\pi _{12}=0.5\) and vary \(\mu _2\) from 1.5 to 2.5 with an increment 0.5;
Setting S7: fix \(\mu _1=1\), \(\mu _2=2\), \(\sigma _1=\sigma _2=1\), \(\pi _{11}=\pi _{12}=0.5\), \(\pi _{nb}=0.4\) and vary \(n_2\) from 3 to 5 with an increment 1;
Setting S8: fix \(\mu _1=1\), \(\mu _2=2\), \(\sigma _1=\sigma _2=1\), \(\pi _{11}=\pi _{12}=0.5\), \(n_2=4\) and vary \(\pi _{nb}\) from 0.35 to 0.45 with an increment 0.05.
The corresponding simulation results where \(L=2\) is known and unknown are displayed in Figs. 10 and 11, respectively.
For Scenario 4, consider the following model settings:
Setting S9: fix \(\left( \delta _0(1), \delta _0(2), \delta _0(3), \delta _0(4)\right) =(0.1, 0.1, 0.1, 0.3)\), \(q=4\), \(\mu _1=1\), \(\sigma _1=\sigma _2=1\), \(\pi _{11}=\pi _{12}=0.5\), and vary \(\mu _2\) from 1.5 to 2.5 with an increment 0.5;
Setting S10: fix \(\left( \delta _0(1), \delta _0(2), \delta _0(3)\right) =(0.1, 0.1, 0.1)\), \(q=4\), \(\mu _1=1\), \(\mu _2=2\), \(\sigma _1=\sigma _2=1\), \(\pi _{11}=\pi _{12}=0.5\), and vary \(\delta _0(4)\) from 0.2 to 0.4 with an increment 0.1.
The corresponding simulation results where \(L=2\) is known and unknown are displayed in Figs. 12 and 13, respectively. Similarly, we can conclude that the oracle and data-driven SMLIS procedures perform very similar overall and both have higher power than their competing methods. Although the FDR yielded by the data-driven SMLIS procedure exceeds 0.1 slightly, it is still acceptable.
1.7 A.7 The robustness of the SMLIS procedure
In this section, we carry out simulations to investigate the robustness of the SMLIS procedure when observations are generated from the HMM. Specifically, the underlying states of hypotheses \(\{\theta _i\}^m_{i=1}\) are generated from a Markov chain with the initial probabilities \({\mathbb {P}}(\theta _1=0)=1\), \({\mathbb {P}}(\theta _1=1)=0\), and transition probability matrix \(A=(0.8, 0.2; a_{10}, 1-a_{10})\). The observations \(\{z_i\}^m_{i=1}\) are generated from the two-component mixture model (4) with \(F_0\sim N(0, 1)\). Without loss of generality, consider the following settings with \(L=1\) and \(m=5000\).
Setting S11: fix \(\mu _1=2\), \(\sigma _1=1\), and vary \(a_{10}\) from 0.1 to 0.4 with an increment 0.025;
Setting S12: fix \(a_{10}=0.25\), \(\sigma _1=1\), and vary \(\mu _1\) from 1.5 to 2.5 with an increment 0.1.
The simulation results are listed in Fig. 14. We can conclude that the data-driven SMLIS procedure performs almost the same as the LIS procedure even if observations are generated from the HMM.
1.8 A.8 The empirical PDF of the non-null
In this section, we plot the empirical probability density function (PDF) of the non-null, that is, the empirical PDF of the z-values of the associated SNPs in Simulation II. It is clear to see from Fig. 15 that the empirical PDF for the non-null is bimodal and hence we select \(L=2\) for both the SMLIS and LIS procedures.
1.9 A.9 The analysis of the running time of SMLIS
In this section, to test the scalability and efficiency of the SMLIS procedure, we further examine its running time with the number of tests m varied from 5000 to 10000. Without loss of generality, consider the Settings 1–3. The corresponding simulation results are displayed in Fig. 16. We can see from Fig. 16 that the running time of the SMLIS procedure is approximated as a linear function of m for the same settings of the other parameters. This indicates that even with tens of thousands of tests, the running time of the SMLIS procedure is still acceptable.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, J., Wang, P. Large-scale dependent multiple testing via hidden semi-Markov models. Comput Stat 39, 1093–1126 (2024). https://doi.org/10.1007/s00180-023-01367-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-023-01367-z