Skip to main content
Log in

Large-scale dependent multiple testing via hidden semi-Markov models

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Large-scale multiple testing is common in the statistical analysis of high-dimensional data. Conventional multiple testing procedures usually implicitly assumed that the tests are independent. However, this assumption is rarely established in many practical applications, particularly in “high-throughput” data analysis. Incorporating dependence structure information among tests can improve statistical power and interpretability of discoveries. In this paper, we propose a new large-scale dependent multiple testing procedure based on the hidden semi-Markov model (HSMM), which characterizes local correlations among tests using a semi-Markov process instead of a first-order Markov chain. Our novel approach allows for the number of consecutive null hypotheses to follow any reasonable distribution, enabling a more accurate description of complex local correlations. We show that the proposed procedure minimizes the marginal false non-discovery rate (mFNR) at the same marginal false discovery rate (mFDR) level. To reduce the computational complexity of the HSMM, we make use of the hidden Markov model (HMM) with an expanded state space to approximate it. We provide a forward-backward algorithm and an expectation-maximization (EM) algorithm for implementing the proposed procedure. Finally, we demonstrate the superior performance of the SMLIS procedure through extensive simulations and a real data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Barras L, Scaillet O, Wermers R (2010) False discoveries in mutual fund performance: measuring luck in estimated alphas. J Financ 65(1):179–216

    Article  Google Scholar 

  • Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Stat Soc Series B Stat Methodol 57(1):289–300

    MathSciNet  Google Scholar 

  • Benjamini Y, Hochberg Y (2000) On the adaptive control of the false discovery rate in multiple testing with independent statistics. J Educ Behav Stat 25(1):60–83

    Article  Google Scholar 

  • Cai TT, Sun W, Wang W (2019) Covariate-assisted ranking and screening for large-scale two-sample inference. J Royal Stat Soc Series B Stat Methodol 81(2):187–234

    Article  MathSciNet  Google Scholar 

  • Clarke S, Hall P (2009) Robustness of multiple testing procedures against dependence. Ann Stat 37(1):332–358

    Article  MathSciNet  Google Scholar 

  • Cui T, Wang P, Zhu W (2021) Covariate-adjusted multiple testing in genome-wide association studies via factorial hidden Markov models. TEST 30(3):737–757

    Article  MathSciNet  Google Scholar 

  • Efron B (2004) Large-scale simultaneous hypothesis testing. J Am Stat Assoc 99(465):96–104

    Article  Google Scholar 

  • Efron B (2008) Microarrays, empirical Bayes, and the two-groups model. Stat Sci 23(1):1–22

    MathSciNet  Google Scholar 

  • Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96(456):1151–1160

    Article  MathSciNet  Google Scholar 

  • Felix EFO, Buhat CAH, Mamplata JB (2022) Poisson hidden Markov model on earthquake occurrences in metro manila, philippines. Earth Sci Inf 15(3):1635–1645

    Article  Google Scholar 

  • Finner H, Dickhaus T, Roters M (2007) Dependency and false discovery rate: Asymptotics. Ann Stat 35(4):1432–1455

    Article  MathSciNet  Google Scholar 

  • Fu L, Gang B, James GM, Sun W (2022) Heteroscedasticity-adjusted ranking and thresholding for large-scale multiple testing. J Am Stat Assoc 117(538):1028–1040

    Article  MathSciNet  Google Scholar 

  • Gang B, Sun W, Wang W (2023) Structure-adaptive sequential testing for online false discovery rate control. J Am Stat Assoc 118(541):732–745

    Article  MathSciNet  Google Scholar 

  • Genovese C, Wasserman L (2002) Operating characteristics and extensions of the false discovery rate procedure. J Royal Stat Soc Series B Stat Methodol 64(3):499–517

    Article  MathSciNet  Google Scholar 

  • Guédon Y (2005) Hidden hybrid Markov/semi-Markov chains. Comput Stat Data Anal 49(3):663–688

    Article  MathSciNet  Google Scholar 

  • Kuan PF, Chiang DY (2012) Integrating prior knowledge in multiple testing under dependence with applications to detecting differential DNA methylation. Biometrics 68(3):774–783

    Article  MathSciNet  Google Scholar 

  • Langrock R, Zucchini W (2011) Hidden Markov models with arbitrary state dwell-time distributions. Comput Stat Data Anal 55(1):715–724

    Article  MathSciNet  Google Scholar 

  • Lei L, Fithian W (2018) AdaPT: an interactive procedure for multiple testing with side information. J Royal Stat Soc Series B Stat Methodol 80(4):649–679

    Article  MathSciNet  Google Scholar 

  • Liu J, Zhang C, Page D (2016) Multiple testing under dependence via graphical models. Ann Appl Stat 10(3):1699–1724

    Article  MathSciNet  Google Scholar 

  • Owen AB (2005) Variance of the number of false discoveries. J Royal Stat Soc Series B Stat Methodol 67(3):411–426

    Article  MathSciNet  Google Scholar 

  • Roeder K, Wasserman L (2009) Genome-wide significance levels and weighted hypothesis testing. Stat Sci 24(4):398–413

    Article  MathSciNet  Google Scholar 

  • Russell M, Cook A (1987) Experimental evaluation of duration modelling techniques for automatic speech recognition. In: ICASSP ’87. IEEE international conference on acoustics, speech, and signal processing: 2376–2379

  • Schizophrenia Working Group of the Psychiatric Genomics Consortium (2014) Biological insights from 108 schizophrenia-associated genetic loci. Nature 511(7510):421–427

    Article  Google Scholar 

  • Schwartzman A, Lin X (2011) The effect of correlation in false discovery rate estimation. Biometrika 98(1):199–214

    Article  MathSciNet  Google Scholar 

  • Schwartzman A, Dougherty RF, Taylor JE (2008) False discovery rate analysis of brain diffusion direction maps. Ann Appl Stat 2(1):153–175

    Article  MathSciNet  Google Scholar 

  • Shu H, Nan B, Koeppe R (2015) Multiple testing for neuroimaging via hidden Markov random field. Biometrics 71(3):741–750

    Article  MathSciNet  Google Scholar 

  • Storey JD (2002) A direct approach to false discovery rates. J Royal Stat Soc Series B Stat Methodol 64(3):479–498

    Article  MathSciNet  Google Scholar 

  • Sun W, Cai TT (2009) Large-scale multiple testing under dependence. J Royal Stat Soc Series B Stat Methodol 71(2):393–424

    Article  MathSciNet  Google Scholar 

  • Sun W, Reich BJ, Cai TT, Guindani M, Schwartzman A (2015) False discovery control in large-scale spatial multiple testing. J Royal Stat Soc Series B Stat Methodol 77(1):59–83

    Article  MathSciNet  Google Scholar 

  • The International HapMap Consortium (2003) The international hapmap project. Nature 426:789–796

    Article  Google Scholar 

  • Wang J, Cui T, Zhu W, Wang P (2023) Covariate-modulated large-scale multiple testing under dependence. Comput Stat Data Anal 180:107664

    Article  MathSciNet  Google Scholar 

  • Wang P, Zhu W (2019) Replicability analysis in genome-wide association studies via Cartesian hidden Markov models. BMC Bioinf 20(1):146

    Article  Google Scholar 

  • Wang X, Shojaie A, Zou J (2019) Bayesian hidden Markov models for dependent large-scale multiple testing. Comput Stat Data Anal 136:123–136

    Article  MathSciNet  Google Scholar 

  • Wei Z, Sun W, Wang K, Hakonarson H (2009) Multiple testing in genome-wide association studies via hidden Markov models. Bioinformatics 25(21):2802–2808

    Article  Google Scholar 

Download references

Funding

This work was supported by Department of Education of Liaoning Province Grant (Grant No. LN2020Q23)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pengfei Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 A.1 Proof of Theorem 1

Proof

Note that \(\Lambda _k(\varvec{Z})=\textrm{SMLIS}_k(\varvec{Z})/ (1-\textrm{SMLIS}_k(\varvec{Z}))\) satisfies the monotone ratio condition (MRC) defined in Sun and Cai (2009) and \(\Lambda _k(\varvec{Z})\) is strictly increasing in \(\textrm{SMLIS}_k(\varvec{Z})\). By Theorem 1 of Sun and Cai (2009), it follows that \(\textrm{mFDR}\left( {\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}), c)\right)\) is strictly increasing in c. Since the PDF of \(\{Z_{i}\}^m_{i=1}\) is continuous, the PDF and the cumulative distribution function (CDF) of \(\textrm{SMLIS}_k(\varvec{Z})\) are continuous. Note that

$$\begin{aligned} \textrm{mFDR}\left( {\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}), c)\right) =\dfrac{\sum \nolimits ^m_{k=1}{\mathbb {P}}_{\varvec{\vartheta }} (\textrm{SMLIS}_k(\varvec{Z})<c, \theta _k=0)}{\left( \sum \nolimits ^m_{k=1} {\mathbb {P}}_{\varvec{\vartheta }}(\textrm{SMLIS}_k(\varvec{Z})<c)\right) \vee 1}. \end{aligned}$$

Hence \(\textrm{mFDR}\left( {\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}), c)\right)\) is continuous with respect to c. Also note that

$$\begin{aligned} \lim \limits _{c \rightarrow 0}\textrm{mFDR}\left( {\varvec{\delta }} (\textrm{SMLIS}(\varvec{Z}), c)\right) =0, \end{aligned}$$

and

$$\begin{aligned} \lim \limits _{c \rightarrow +\infty }\textrm{mFDR}\left( {\varvec{\delta }} (\textrm{SMLIS}(\varvec{Z}), c)\right) =1. \end{aligned}$$

Hence, for any \(0<\alpha <1\), the set \(\{t:\textrm{mFDR} ({\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}),t))\le \alpha \}\) is nonempty. Let

$$\begin{aligned} c_{\alpha } = \sup \{t: \textrm{mFDR}({\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}),t)) \le \alpha \}, \end{aligned}$$

then we have \(\textrm{mFDR}({\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}), c_{\alpha }))=\alpha\). \(\square\)

1.2 A.2 Proof of Theorem 2

Proof

Let \(\Lambda _k(\varvec{Z})={\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=0\mid \varvec{Z})/{\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z})\). Since \(\Lambda _k(\varvec{Z})=\textrm{SMLIS}_k(\varvec{Z})/ (1-\textrm{SMLIS}_k(\varvec{Z}))\), \(\Lambda _k(\varvec{Z})\) is strictly increasing in \(\textrm{SMLIS}_k(\varvec{Z})\) and \({\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}), c_{\alpha })\) can be re-expressed as

$$\begin{aligned} {\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}),c_{\alpha }) =\left\{ I_{(\Lambda _1(\varvec{Z})<c^{*}_{\alpha })}, I_{(\Lambda _2(\varvec{Z})<c^{*}_{\alpha })},\cdots , I_{(\Lambda _m(\varvec{Z})<c^{*}_{\alpha })}\right\} , \end{aligned}$$

where \(c^{*}_{\alpha }=c_{\alpha }/(1-c_{\alpha })\). In view of the definition of \(\Lambda _k(\varvec{Z})\), it is easy to get

$$\begin{aligned} \sum \limits ^m_{k=1}\textrm{E}\left[ \left( I_{(\Lambda _k(\varvec{Z})<c^{*}_{\alpha })}-I_{(T_k(\varvec{Z})<c)}\right) \left( {\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=0\mid \varvec{Z})-c^{*}_{\alpha }{\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z})\right) \right] \le 0. \end{aligned}$$
(A.1)

This implies that

$$\begin{aligned} \sum \limits ^m_{k=1}\textrm{E}\left[ \left( (1-I_{(\Lambda _k(\varvec{Z})<c^{*}_{\alpha })})-(1-I_{(T_k(\varvec{Z})<c)})\right) \left( 1-(1+c^{*}_{\alpha }){\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z})\right) \right] \ge 0. \end{aligned}$$
(A.2)

By the assumption that \(\textrm{mFDR}\left( {\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}), c_{\alpha })\right) =\alpha\), we have

$$\begin{aligned} \frac{\textrm{E}\left[ \sum \nolimits ^m_{k=1}I_{(\Lambda _k(\varvec{Z})<c^{*}_{\alpha })}{\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=0\mid \varvec{Z})\right] }{\textrm{E}\left[ \sum \nolimits ^m_{k=1} I_{(\Lambda _k(\varvec{Z})<c^{*}_{\alpha })} ({\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=0\mid \varvec{Z}) +{\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=1\mid \varvec{Z})) \right] }=\alpha . \end{aligned}$$

It follows that

$$\begin{aligned} \sum \limits ^m_{k=1}\textrm{E}\left[ I_{(\Lambda _k(\varvec{Z}) <c^{*}_{\alpha })}\left( {\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=0\mid \varvec{Z})-\frac{\alpha }{1-\alpha } {\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=1\mid \varvec{Z})\right) \right] =0. \end{aligned}$$
(A.3)

Similarly, the condition \(\textrm{mFDR}\left( \widetilde{\varvec{\delta }}(T(\varvec{Z}), c)\right) \le \alpha\) yields

$$\begin{aligned} \sum \limits ^m_{k=1}\textrm{E}\left[ I_{(T_k(\varvec{Z})<c)} \left( {\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=0\mid \varvec{Z}) -\frac{\alpha }{1-\alpha }{\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z})\right) \right] \le 0. \end{aligned}$$
(A.4)

Combining (A.3) with (A.4), we obtain that

$$\begin{aligned} \sum \limits ^m_{k=1}\textrm{E}\left[ \left( I_{(\Lambda _k (\varvec{Z})<c^{*}_{\alpha })}-I_{(T_k(\varvec{Z})<c)}\right) \left( {\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=0\mid \varvec{Z}) -\frac{\alpha }{1-\alpha }{\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z})\right) \right] \ge 0. \end{aligned}$$
(A.5)

Then (A.1) and (A.5) imply that

$$\begin{aligned} \sum \limits ^m_{k=1}\textrm{E}\left[ \left( I_{(\Lambda _k(\varvec{Z})<c^{*}_{\alpha })} -I_{(T_k(\varvec{Z})<c)}\right) \left( \left( \frac{\alpha }{1-\alpha } -c^{*}_{\alpha }\right) {\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z})\right) \right] \le 0. \end{aligned}$$
(A.6)

Note also that

$$\begin{aligned} \sum \limits ^m_{k=1}\textrm{E}\left[ I_{(\Lambda _k(\varvec{Z})<c^{*}_{\alpha })}\left( {\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=0\mid \varvec{Z})-c^{*}_{\alpha } {\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=1 \mid \varvec{Z})\right) \right] <0. \end{aligned}$$

Combining this with (A.3), we have

$$\begin{aligned} \frac{\alpha }{1-\alpha } = \frac{\sum \nolimits ^m_{k=1}\textrm{E} \left[ I_{(\Lambda _k(\varvec{Z})<c^{*}_{\alpha })} {\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=0\mid \varvec{Z})\right] }{\sum \nolimits ^m_{k=1}\textrm{E}\left[ I_{(\Lambda _k(\varvec{Z})<c^{*}_{\alpha })}{\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z})\right] } < c^{*}_{\alpha }. \end{aligned}$$

Thus (A.6) implies that

$$\begin{aligned} \sum \limits ^m_{k=1}\textrm{E}\left[ I_{(\Lambda _k(\varvec{Z})<c^{*}_{\alpha })}{\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=1\mid \varvec{Z}) \right] \ge \sum \limits ^m_{k=1}\textrm{E} \left[ I_{(T_k(\varvec{Z})<c)}{\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z}) \right] . \end{aligned}$$

It follows that

$$\begin{aligned} \frac{1}{\sum \nolimits ^m_{k=1}\textrm{E} \left[ \left( 1-I_{(\Lambda _k(\varvec{Z})<c^{*}_{\alpha })}\right) {\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=1\mid \varvec{Z}) \right] } \ge \frac{1}{\sum \nolimits ^m_{k=1}\textrm{E} \left[ \left( 1-I_{(T_k(\varvec{Z})<c)}\right) {\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=1\mid \varvec{Z}) \right] }. \end{aligned}$$
(A.7)

Combining (A.2) with (A.7), we have

$$\begin{aligned}{} & {} \frac{ \sum \nolimits ^m_{k=1}\textrm{E}\left[ \left( 1-I_{(\Lambda _k (\varvec{Z})<c^{*}_{\alpha })}\right) \left( 1-(1+c^{*}_{\alpha }) {\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=1\mid \varvec{Z})\right) \right] }{\sum \nolimits ^m_{k=1}\textrm{E}\left[ \left( 1-I_{(\Lambda _k (\varvec{Z})<c^{*}_{\alpha })}\right) {\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=1\mid \varvec{Z}) \right] } \\{} & {} \quad \ge \frac{\sum \nolimits ^m_{k=1}\textrm{E} \left[ \left( 1-I_{(T_k(\varvec{Z})<c)}\right) \left( 1-(1+c^{*}_{\alpha }){\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z})\right) \right] }{\sum \nolimits ^m_{k=1}\textrm{E}\left[ \left( 1-I_{(T_k (\varvec{Z})<c)}\right) {\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z}) \right] }. \end{aligned}$$

It follows that

$$\begin{aligned} \frac{1-(1+c^{*}_{\alpha })\textrm{mFNR}\left( {\varvec{\delta }} (\textrm{SMLIS}(\varvec{Z}),c_{\alpha })\right) }{\textrm{mFNR}\left( {\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}),c_{\alpha })\right) } \ge \frac{1-(1+c^{*}_{\alpha })\textrm{mFNR}(\widetilde{\varvec{\delta }} (T(\varvec{Z}), c)}{\textrm{mFNR}(\widetilde{\varvec{\delta }} (T(\varvec{Z}), c))}. \end{aligned}$$

Note that \(\dfrac{1-(1+c^{*}_{\alpha })x}{x}\) is strictly decreasing in x and hence

$$\begin{aligned} \textrm{mFNR}\left( {\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}),c_{\alpha })\right) \le \textrm{mFNR}(\widetilde{\varvec{\delta }}(T(\varvec{Z}), c)). \end{aligned}$$

\(\square\)

1.3 A.3 The expression of the intermediate variables

The intermediate variables in the E-step of the tth iteration can be expressed as:

$$\begin{aligned}{} & {} \gamma _{i}^{(t)}(p)=\frac{\alpha _{i}^{(t)}(p) \beta _{i}^{(t)}(p)}{\sum \nolimits _{p=1}^{m_{0}+m_{1}} \alpha _{i}^{(t)}(p)\beta _{i}^{(t)}(p)}, \text { for }p=1, \cdots , m_{0}+m_{1};\\{} & {} \psi _{i}^{(t)}(0)=\sum \limits _{p=1}^{m_{0}} \gamma _{i}^{(t)}(p);\\{} & {} \psi _{i}^{(t)}(l)=\frac{\pi _{1l}^{(t)}f(z_{i}\mid \mu _{l}^{(t)}, {\sigma _{l}^{2}}^{(t)})}{\sum \nolimits _{l=1}^{L}\pi _{1l}^{(t)} f(z_{i}\mid \mu _{l}^{(t)}, {\sigma _{l}^{2}}^{(t)})} \sum \nolimits _{p=m_{0}+1}^{m_{0}+m_{1}}\gamma _{i}^{(t)}(p), \text { for }l=1, 2, \ldots , L;\\{} & {} \eta _{i}^{(t)}(p,q)\\{} & {} \quad = \frac{\alpha _{i}^{(t)}(p)a^{(t)}(p,q)f(z_{i+1} \mid \mu _{0}^{(t)}, {\sigma _{0}^{2}}^{(t)})^{I_{(q\in S_0)}} f(z_{i+1}\mid \mu _{q}^{(t)}, {\sigma _{q}^{2}}^{(t)})^{I_{(q\in S_1)}} \beta _{i+1}^{(t)}(q)}{\sum \nolimits _{p=1}^{m_{0}+m_{1}} \sum \nolimits _{q=1}^{m_{0}+m_{1}}\alpha _{i}^{(t)}(p) a^{(t)}(p,q)f(z_{i+1}\mid \mu _{0}^{(t)}, {\sigma _{0}^{2}}^{(t)})^{I_{(q\in S_0)}} f(z_{i+1} \mid \mu _{q}^{(t)}, {\sigma _{q}^{2}}^{(t)})^{I_{(q\in S_1)}} \beta _{i+1}^{(t)}(q)}, \end{aligned}$$

where \(\alpha _{i}^{(t)}(p) ={\mathbb {P}}_{\varvec{\widehat{\varphi }}^{(t)}}(\theta ^{\star }_{i} =p, \{z_{i}\}_{j=1}^{i})\), \(\beta _{i}^{(t)}(p) ={\mathbb {P}}_{\varvec{\widehat{\varphi }}^{(t)}} (\{z_{j}\}_{j=i+1}^{m}\mid \theta ^{\star }_{i}=p)\), and \(a^{(t)}(p,q)\) is the (pq)th element of \({\mathcal {A}}^{(t)}\).

1.4 A.4 The additional simulation results of Scenario 1

In this section, we start by conducting simulation studies corresponding to Scenario 1 in the cases where \(L=2\) is known and unknown, respectively. For Scenario 1, consider the following model settings:

Setting S1: fix \(n_1=4\), \(\pi _b=0.4\), \(\mu _1=1.5\), \(\sigma _1=\sigma _2=1\), \(\pi _{11}=\pi _{12}=0.5\) and vary \(\mu _2\) from 2 to 3 with an increment 0.5;

Setting S2: fix \(\mu _1=1.5\), \(\mu _2=2.5\), \(\sigma _1=\sigma _2=1\), \(\pi _{11}=\pi _{12}=0.5\), \(\pi _b=0.4\) and vary \(n_1\) from 4 to 8 with an increment 2;

Setting S3: fix \(\mu _1=1.5\), \(\mu _2=2.5\), \(\sigma _1=\sigma _2=1\), \(\pi _{11}=\pi _{12}=0.5\), \(n_1=6\) and vary \(\pi _b\) from 0.35 to 0.45 with an increment 0.05.

The simulation results where \(L=2\) is known and unknown are displayed in Figs. 6 and 7, respectively. The results are consistent with those in Sect. 3 and hence we will not repeat.

Fig. 6
figure 6

Simulation results corresponding to Scenario 1 where \(L=2\) is known (shifted binomial distribution)

Fig. 7
figure 7

Simulation results corresponding to Scenario 1 where \(L=2\) is unknown (shifted binomial distribution)

1.5 A.5 The additional simulation results of Scenario 2

Then we perform simulations corresponding to Scenario 2 in the cases where \(L=2\) is known and unknown, respectively. For Scenario 2, consider the following model settings:

Setting S4: fix \(\lambda =5\), \(\mu _2=2.5\), \(\sigma _1=\sigma _2=1\), \(\pi _{11}=\pi _{12}=0.5\) and vary \(\mu _1\) from 1 to 2 with an increment 0.5;

Setting S5: fix \(\mu _1=1.5\), \(\mu _2=2.5\), \(\sigma _1=\sigma _2=1\), \(\pi _{11}=\pi _{12}=0.5\) and vary \(\lambda\) from 3 to 7 with an increment 2.

The results where \(L=2\) is known and unknown are displayed in Figs. 8 and 9, respectively.

Fig. 8
figure 8

Simulation results corresponding to Scenario 2 where \(L=2\) is known (shifted Poisson distribution)

Fig. 9
figure 9

Simulation results corresponding to Scenario 2 where \(L=2\) is unknown (shifted Poisson distribution)

1.6 A.6 The simulation results for Scenarios 3 and 4

Fix \(L=2\) and consider the following scenarios.

  1. Scenario 3:

    the PMF of shifted negative binomial distribution

    $$\begin{aligned} d_0(r) = \frac{\Gamma (r+n_2-1)}{(r-1)!\Gamma (n_2)} \pi _{nb}^{n_2} (1-\pi _{nb})^{r-1}, r=1, 2, \cdots ; \end{aligned}$$
  2. Scenario 4:

    the PMF of the distribution with unstructured start and geometric tail

    $$\begin{aligned} d_0(r) = \left\{ \begin{array}{lr} \delta _0(r), &{} \text {for } r\le q, \\ \delta _0(q)\left( \frac{1-\sum _{k=1}^q\delta _0(k)}{1-\sum _{k=1}^{q-1}\delta _0(k)}\right) ^{r-q}, &{} \text {for } r> q. \end{array}\right. \end{aligned}$$

For Scenario 3, consider the following model settings:

Setting S6: fix \(n_2=4\), \(\pi _{nb}=0.4\), \(\mu _1=1\), \(\sigma _1=\sigma _2=1\), \(\pi _{11}=\pi _{12}=0.5\) and vary \(\mu _2\) from 1.5 to 2.5 with an increment 0.5;

Setting S7: fix \(\mu _1=1\), \(\mu _2=2\), \(\sigma _1=\sigma _2=1\), \(\pi _{11}=\pi _{12}=0.5\), \(\pi _{nb}=0.4\) and vary \(n_2\) from 3 to 5 with an increment 1;

Setting S8: fix \(\mu _1=1\), \(\mu _2=2\), \(\sigma _1=\sigma _2=1\), \(\pi _{11}=\pi _{12}=0.5\), \(n_2=4\) and vary \(\pi _{nb}\) from 0.35 to 0.45 with an increment 0.05.

The corresponding simulation results where \(L=2\) is known and unknown are displayed in Figs. 10 and 11, respectively.

Fig. 10
figure 10

Simulation results corresponding to Scenario 2 where \(L=2\) is known (shifted negative binomial distribution)

Fig. 11
figure 11

Simulation results corresponding to Scenario 2 where \(L=2\) is unknown (shifted negative binomial distribution)

For Scenario 4, consider the following model settings:

Setting S9: fix \(\left( \delta _0(1), \delta _0(2), \delta _0(3), \delta _0(4)\right) =(0.1, 0.1, 0.1, 0.3)\), \(q=4\), \(\mu _1=1\), \(\sigma _1=\sigma _2=1\), \(\pi _{11}=\pi _{12}=0.5\), and vary \(\mu _2\) from 1.5 to 2.5 with an increment 0.5;

Setting S10: fix \(\left( \delta _0(1), \delta _0(2), \delta _0(3)\right) =(0.1, 0.1, 0.1)\), \(q=4\), \(\mu _1=1\), \(\mu _2=2\), \(\sigma _1=\sigma _2=1\), \(\pi _{11}=\pi _{12}=0.5\), and vary \(\delta _0(4)\) from 0.2 to 0.4 with an increment 0.1.

The corresponding simulation results where \(L=2\) is known and unknown are displayed in Figs. 12 and 13, respectively. Similarly, we can conclude that the oracle and data-driven SMLIS procedures perform very similar overall and both have higher power than their competing methods. Although the FDR yielded by the data-driven SMLIS procedure exceeds 0.1 slightly, it is still acceptable.

Fig. 12
figure 12

Simulation results corresponding to Scenario 2 where \(L=2\) is known (distribution with unstructured start and geometric tail)

Fig. 13
figure 13

Simulation results corresponding to Scenario 2 where \(L=2\) is unknown (distribution with unstructured start and geometric tail)

1.7 A.7 The robustness of the SMLIS procedure

In this section, we carry out simulations to investigate the robustness of the SMLIS procedure when observations are generated from the HMM. Specifically, the underlying states of hypotheses \(\{\theta _i\}^m_{i=1}\) are generated from a Markov chain with the initial probabilities \({\mathbb {P}}(\theta _1=0)=1\), \({\mathbb {P}}(\theta _1=1)=0\), and transition probability matrix \(A=(0.8, 0.2; a_{10}, 1-a_{10})\). The observations \(\{z_i\}^m_{i=1}\) are generated from the two-component mixture model (4) with \(F_0\sim N(0, 1)\). Without loss of generality, consider the following settings with \(L=1\) and \(m=5000\).

Setting S11: fix \(\mu _1=2\), \(\sigma _1=1\), and vary \(a_{10}\) from 0.1 to 0.4 with an increment 0.025;

Setting S12: fix \(a_{10}=0.25\), \(\sigma _1=1\), and vary \(\mu _1\) from 1.5 to 2.5 with an increment 0.1.

The simulation results are listed in Fig. 14. We can conclude that the data-driven SMLIS procedure performs almost the same as the LIS procedure even if observations are generated from the HMM.

Fig. 14
figure 14

Simulation results corresponding to Settings S11-S12

1.8 A.8 The empirical PDF of the non-null

In this section, we plot the empirical probability density function (PDF) of the non-null, that is, the empirical PDF of the z-values of the associated SNPs in Simulation II. It is clear to see from Fig. 15 that the empirical PDF for the non-null is bimodal and hence we select \(L=2\) for both the SMLIS and LIS procedures.

Fig. 15
figure 15

The empirical PDF of the non-null

1.9 A.9 The analysis of the running time of SMLIS

In this section, to test the scalability and efficiency of the SMLIS procedure, we further examine its running time with the number of tests m varied from 5000 to 10000. Without loss of generality, consider the Settings 1–3. The corresponding simulation results are displayed in Fig. 16. We can see from Fig. 16 that the running time of the SMLIS procedure is approximated as a linear function of m for the same settings of the other parameters. This indicates that even with tens of thousands of tests, the running time of the SMLIS procedure is still acceptable.

Fig. 16
figure 16

The running time of the SMLIS procedure with m varied from 5000 to 10000

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, J., Wang, P. Large-scale dependent multiple testing via hidden semi-Markov models. Comput Stat 39, 1093–1126 (2024). https://doi.org/10.1007/s00180-023-01367-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-023-01367-z

Keywords

Navigation