Large-scale dependent multiple testing via hidden semi-Markov models

Wang, Jiangzhou; Wang, Pengfei

doi:10.1007/s00180-023-01367-z

Large-scale dependent multiple testing via hidden semi-Markov models

Original Paper
Published: 31 May 2023

Volume 39, pages 1093–1126, (2024)
Cite this article

Computational Statistics Aims and scope Submit manuscript

220 Accesses
Explore all metrics

Abstract

Large-scale multiple testing is common in the statistical analysis of high-dimensional data. Conventional multiple testing procedures usually implicitly assumed that the tests are independent. However, this assumption is rarely established in many practical applications, particularly in “high-throughput” data analysis. Incorporating dependence structure information among tests can improve statistical power and interpretability of discoveries. In this paper, we propose a new large-scale dependent multiple testing procedure based on the hidden semi-Markov model (HSMM), which characterizes local correlations among tests using a semi-Markov process instead of a first-order Markov chain. Our novel approach allows for the number of consecutive null hypotheses to follow any reasonable distribution, enabling a more accurate description of complex local correlations. We show that the proposed procedure minimizes the marginal false non-discovery rate (mFNR) at the same marginal false discovery rate (mFDR) level. To reduce the computational complexity of the HSMM, we make use of the hidden Markov model (HMM) with an expanded state space to approximate it. We provide a forward-backward algorithm and an expectation-maximization (EM) algorithm for implementing the proposed procedure. Finally, we demonstrate the superior performance of the SMLIS procedure through extensive simulations and a real data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast and covariate-adaptive method amplifies detection power in large-scale multiple hypothesis testing

Article Open access 31 July 2019

Multiple Hypothesis Tests: A Bayesian Approach

Selective inference for false discovery proportion in a hidden Markov model

Article 14 September 2023

References

Barras L, Scaillet O, Wermers R (2010) False discoveries in mutual fund performance: measuring luck in estimated alphas. J Financ 65(1):179–216
Article Google Scholar
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Stat Soc Series B Stat Methodol 57(1):289–300
MathSciNet Google Scholar
Benjamini Y, Hochberg Y (2000) On the adaptive control of the false discovery rate in multiple testing with independent statistics. J Educ Behav Stat 25(1):60–83
Article Google Scholar
Cai TT, Sun W, Wang W (2019) Covariate-assisted ranking and screening for large-scale two-sample inference. J Royal Stat Soc Series B Stat Methodol 81(2):187–234
Article MathSciNet Google Scholar
Clarke S, Hall P (2009) Robustness of multiple testing procedures against dependence. Ann Stat 37(1):332–358
Article MathSciNet Google Scholar
Cui T, Wang P, Zhu W (2021) Covariate-adjusted multiple testing in genome-wide association studies via factorial hidden Markov models. TEST 30(3):737–757
Article MathSciNet Google Scholar
Efron B (2004) Large-scale simultaneous hypothesis testing. J Am Stat Assoc 99(465):96–104
Article Google Scholar
Efron B (2008) Microarrays, empirical Bayes, and the two-groups model. Stat Sci 23(1):1–22
MathSciNet Google Scholar
Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96(456):1151–1160
Article MathSciNet Google Scholar
Felix EFO, Buhat CAH, Mamplata JB (2022) Poisson hidden Markov model on earthquake occurrences in metro manila, philippines. Earth Sci Inf 15(3):1635–1645
Article Google Scholar
Finner H, Dickhaus T, Roters M (2007) Dependency and false discovery rate: Asymptotics. Ann Stat 35(4):1432–1455
Article MathSciNet Google Scholar
Fu L, Gang B, James GM, Sun W (2022) Heteroscedasticity-adjusted ranking and thresholding for large-scale multiple testing. J Am Stat Assoc 117(538):1028–1040
Article MathSciNet Google Scholar
Gang B, Sun W, Wang W (2023) Structure-adaptive sequential testing for online false discovery rate control. J Am Stat Assoc 118(541):732–745
Article MathSciNet Google Scholar
Genovese C, Wasserman L (2002) Operating characteristics and extensions of the false discovery rate procedure. J Royal Stat Soc Series B Stat Methodol 64(3):499–517
Article MathSciNet Google Scholar
Guédon Y (2005) Hidden hybrid Markov/semi-Markov chains. Comput Stat Data Anal 49(3):663–688
Article MathSciNet Google Scholar
Kuan PF, Chiang DY (2012) Integrating prior knowledge in multiple testing under dependence with applications to detecting differential DNA methylation. Biometrics 68(3):774–783
Article MathSciNet Google Scholar
Langrock R, Zucchini W (2011) Hidden Markov models with arbitrary state dwell-time distributions. Comput Stat Data Anal 55(1):715–724
Article MathSciNet Google Scholar
Lei L, Fithian W (2018) AdaPT: an interactive procedure for multiple testing with side information. J Royal Stat Soc Series B Stat Methodol 80(4):649–679
Article MathSciNet Google Scholar
Liu J, Zhang C, Page D (2016) Multiple testing under dependence via graphical models. Ann Appl Stat 10(3):1699–1724
Article MathSciNet Google Scholar
Owen AB (2005) Variance of the number of false discoveries. J Royal Stat Soc Series B Stat Methodol 67(3):411–426
Article MathSciNet Google Scholar
Roeder K, Wasserman L (2009) Genome-wide significance levels and weighted hypothesis testing. Stat Sci 24(4):398–413
Article MathSciNet Google Scholar
Russell M, Cook A (1987) Experimental evaluation of duration modelling techniques for automatic speech recognition. In: ICASSP ’87. IEEE international conference on acoustics, speech, and signal processing: 2376–2379
Schizophrenia Working Group of the Psychiatric Genomics Consortium (2014) Biological insights from 108 schizophrenia-associated genetic loci. Nature 511(7510):421–427
Article Google Scholar
Schwartzman A, Lin X (2011) The effect of correlation in false discovery rate estimation. Biometrika 98(1):199–214
Article MathSciNet Google Scholar
Schwartzman A, Dougherty RF, Taylor JE (2008) False discovery rate analysis of brain diffusion direction maps. Ann Appl Stat 2(1):153–175
Article MathSciNet Google Scholar
Shu H, Nan B, Koeppe R (2015) Multiple testing for neuroimaging via hidden Markov random field. Biometrics 71(3):741–750
Article MathSciNet Google Scholar
Storey JD (2002) A direct approach to false discovery rates. J Royal Stat Soc Series B Stat Methodol 64(3):479–498
Article MathSciNet Google Scholar
Sun W, Cai TT (2009) Large-scale multiple testing under dependence. J Royal Stat Soc Series B Stat Methodol 71(2):393–424
Article MathSciNet Google Scholar
Sun W, Reich BJ, Cai TT, Guindani M, Schwartzman A (2015) False discovery control in large-scale spatial multiple testing. J Royal Stat Soc Series B Stat Methodol 77(1):59–83
Article MathSciNet Google Scholar
The International HapMap Consortium (2003) The international hapmap project. Nature 426:789–796
Article Google Scholar
Wang J, Cui T, Zhu W, Wang P (2023) Covariate-modulated large-scale multiple testing under dependence. Comput Stat Data Anal 180:107664
Article MathSciNet Google Scholar
Wang P, Zhu W (2019) Replicability analysis in genome-wide association studies via Cartesian hidden Markov models. BMC Bioinf 20(1):146
Article Google Scholar
Wang X, Shojaie A, Zou J (2019) Bayesian hidden Markov models for dependent large-scale multiple testing. Comput Stat Data Anal 136:123–136
Article MathSciNet Google Scholar
Wei Z, Sun W, Wang K, Hakonarson H (2009) Multiple testing in genome-wide association studies via hidden Markov models. Bioinformatics 25(21):2802–2808
Article Google Scholar

Download references

Funding

This work was supported by Department of Education of Liaoning Province Grant (Grant No. LN2020Q23)

Author information

Authors and Affiliations

College of Mathematics and Statistics, Institute of Statistical Sciences, Shenzhen University, Shenzhen, 518060, China
Jiangzhou Wang
School of Statistics, Dongbei University of Finance and Economics, Dalian, 116025, China
Pengfei Wang

Authors

Jiangzhou Wang
View author publications
You can also search for this author in PubMed Google Scholar
Pengfei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pengfei Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 A.1 Proof of Theorem 1

Proof

Note that $\Lambda _k(\varvec{Z})=\textrm{SMLIS}_k(\varvec{Z})/ (1-\textrm{SMLIS}_k(\varvec{Z}))$ satisfies the monotone ratio condition (MRC) defined in Sun and Cai (2009) and $\Lambda _k(\varvec{Z})$ is strictly increasing in $\textrm{SMLIS}_k(\varvec{Z})$. By Theorem 1 of Sun and Cai (2009), it follows that $\textrm{mFDR}\left( {\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}), c)\right)$ is strictly increasing in c. Since the PDF of $\{Z_{i}\}^m_{i=1}$ is continuous, the PDF and the cumulative distribution function (CDF) of $\textrm{SMLIS}_k(\varvec{Z})$ are continuous. Note that

$$\begin{aligned} \textrm{mFDR}\left( {\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}), c)\right) =\dfrac{\sum \nolimits ^m_{k=1}{\mathbb {P}}_{\varvec{\vartheta }} (\textrm{SMLIS}_k(\varvec{Z})<c, \theta _k=0)}{\left( \sum \nolimits ^m_{k=1} {\mathbb {P}}_{\varvec{\vartheta }}(\textrm{SMLIS}_k(\varvec{Z})<c)\right) \vee 1}. \end{aligned}$$

Hence $\textrm{mFDR}\left( {\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}), c)\right)$ is continuous with respect to c. Also note that

$$\begin{aligned} \lim \limits _{c \rightarrow 0}\textrm{mFDR}\left( {\varvec{\delta }} (\textrm{SMLIS}(\varvec{Z}), c)\right) =0, \end{aligned}$$

and

$$\begin{aligned} \lim \limits _{c \rightarrow +\infty }\textrm{mFDR}\left( {\varvec{\delta }} (\textrm{SMLIS}(\varvec{Z}), c)\right) =1. \end{aligned}$$

Hence, for any $0<\alpha <1$, the set $\{t:\textrm{mFDR} ({\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}),t))\le \alpha \}$ is nonempty. Let

$$\begin{aligned} c_{\alpha } = \sup \{t: \textrm{mFDR}({\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}),t)) \le \alpha \}, \end{aligned}$$

then we have $\textrm{mFDR}({\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}), c_{\alpha }))=\alpha$. $\square$

1.2 A.2 Proof of Theorem 2

Proof

Let $\Lambda _k(\varvec{Z})={\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=0\mid \varvec{Z})/{\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z})$. Since $\Lambda _k(\varvec{Z})=\textrm{SMLIS}_k(\varvec{Z})/ (1-\textrm{SMLIS}_k(\varvec{Z}))$, $\Lambda _k(\varvec{Z})$ is strictly increasing in $\textrm{SMLIS}_k(\varvec{Z})$ and ${\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}), c_{\alpha })$ can be re-expressed as

$$\begin{aligned} {\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}),c_{\alpha }) =\left\{ I_{(\Lambda _1(\varvec{Z})<c^{*}_{\alpha })}, I_{(\Lambda _2(\varvec{Z})<c^{*}_{\alpha })},\cdots , I_{(\Lambda _m(\varvec{Z})<c^{*}_{\alpha })}\right\} , \end{aligned}$$

where $c^{*}_{\alpha }=c_{\alpha }/(1-c_{\alpha })$. In view of the definition of $\Lambda _k(\varvec{Z})$, it is easy to get

$$\begin{aligned} \sum \limits ^m_{k=1}\textrm{E}\left[ \left( I_{(\Lambda _k(\varvec{Z})<c^{*}_{\alpha })}-I_{(T_k(\varvec{Z})<c)}\right) \left( {\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=0\mid \varvec{Z})-c^{*}_{\alpha }{\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z})\right) \right] \le 0. \end{aligned}$$

(A.1)

This implies that

$$\begin{aligned} \sum \limits ^m_{k=1}\textrm{E}\left[ \left( (1-I_{(\Lambda _k(\varvec{Z})<c^{*}_{\alpha })})-(1-I_{(T_k(\varvec{Z})<c)})\right) \left( 1-(1+c^{*}_{\alpha }){\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z})\right) \right] \ge 0. \end{aligned}$$

(A.2)

By the assumption that $\textrm{mFDR}\left( {\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}), c_{\alpha })\right) =\alpha$, we have

$$\begin{aligned} \frac{\textrm{E}\left[ \sum \nolimits ^m_{k=1}I_{(\Lambda _k(\varvec{Z})<c^{*}_{\alpha })}{\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=0\mid \varvec{Z})\right] }{\textrm{E}\left[ \sum \nolimits ^m_{k=1} I_{(\Lambda _k(\varvec{Z})<c^{*}_{\alpha })} ({\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=0\mid \varvec{Z}) +{\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=1\mid \varvec{Z})) \right] }=\alpha . \end{aligned}$$

It follows that

$$\begin{aligned} \sum \limits ^m_{k=1}\textrm{E}\left[ I_{(\Lambda _k(\varvec{Z}) <c^{*}_{\alpha })}\left( {\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=0\mid \varvec{Z})-\frac{\alpha }{1-\alpha } {\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=1\mid \varvec{Z})\right) \right] =0. \end{aligned}$$

(A.3)

Similarly, the condition $\textrm{mFDR}\left( \widetilde{\varvec{\delta }}(T(\varvec{Z}), c)\right) \le \alpha$ yields

$$\begin{aligned} \sum \limits ^m_{k=1}\textrm{E}\left[ I_{(T_k(\varvec{Z})<c)} \left( {\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=0\mid \varvec{Z}) -\frac{\alpha }{1-\alpha }{\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z})\right) \right] \le 0. \end{aligned}$$

(A.4)

Combining (A.3) with (A.4), we obtain that

$$\begin{aligned} \sum \limits ^m_{k=1}\textrm{E}\left[ \left( I_{(\Lambda _k (\varvec{Z})<c^{*}_{\alpha })}-I_{(T_k(\varvec{Z})<c)}\right) \left( {\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=0\mid \varvec{Z}) -\frac{\alpha }{1-\alpha }{\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z})\right) \right] \ge 0. \end{aligned}$$

(A.5)

Then (A.1) and (A.5) imply that

$$\begin{aligned} \sum \limits ^m_{k=1}\textrm{E}\left[ \left( I_{(\Lambda _k(\varvec{Z})<c^{*}_{\alpha })} -I_{(T_k(\varvec{Z})<c)}\right) \left( \left( \frac{\alpha }{1-\alpha } -c^{*}_{\alpha }\right) {\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z})\right) \right] \le 0. \end{aligned}$$

(A.6)

Note also that

$$\begin{aligned} \sum \limits ^m_{k=1}\textrm{E}\left[ I_{(\Lambda _k(\varvec{Z})<c^{*}_{\alpha })}\left( {\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=0\mid \varvec{Z})-c^{*}_{\alpha } {\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=1 \mid \varvec{Z})\right) \right] <0. \end{aligned}$$

Combining this with (A.3), we have

$$\begin{aligned} \frac{\alpha }{1-\alpha } = \frac{\sum \nolimits ^m_{k=1}\textrm{E} \left[ I_{(\Lambda _k(\varvec{Z})<c^{*}_{\alpha })} {\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=0\mid \varvec{Z})\right] }{\sum \nolimits ^m_{k=1}\textrm{E}\left[ I_{(\Lambda _k(\varvec{Z})<c^{*}_{\alpha })}{\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z})\right] } < c^{*}_{\alpha }. \end{aligned}$$

Thus (A.6) implies that

$$\begin{aligned} \sum \limits ^m_{k=1}\textrm{E}\left[ I_{(\Lambda _k(\varvec{Z})<c^{*}_{\alpha })}{\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=1\mid \varvec{Z}) \right] \ge \sum \limits ^m_{k=1}\textrm{E} \left[ I_{(T_k(\varvec{Z})<c)}{\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z}) \right] . \end{aligned}$$

It follows that

$$\begin{aligned} \frac{1}{\sum \nolimits ^m_{k=1}\textrm{E} \left[ \left( 1-I_{(\Lambda _k(\varvec{Z})<c^{*}_{\alpha })}\right) {\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=1\mid \varvec{Z}) \right] } \ge \frac{1}{\sum \nolimits ^m_{k=1}\textrm{E} \left[ \left( 1-I_{(T_k(\varvec{Z})<c)}\right) {\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=1\mid \varvec{Z}) \right] }. \end{aligned}$$

(A.7)

Combining (A.2) with (A.7), we have

$$\begin{aligned}{} & {} \frac{ \sum \nolimits ^m_{k=1}\textrm{E}\left[ \left( 1-I_{(\Lambda _k (\varvec{Z})<c^{*}_{\alpha })}\right) \left( 1-(1+c^{*}_{\alpha }) {\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=1\mid \varvec{Z})\right) \right] }{\sum \nolimits ^m_{k=1}\textrm{E}\left[ \left( 1-I_{(\Lambda _k (\varvec{Z})<c^{*}_{\alpha })}\right) {\mathbb {P}}_{\varvec{\vartheta }}(\theta _k=1\mid \varvec{Z}) \right] } \\{} & {} \quad \ge \frac{\sum \nolimits ^m_{k=1}\textrm{E} \left[ \left( 1-I_{(T_k(\varvec{Z})<c)}\right) \left( 1-(1+c^{*}_{\alpha }){\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z})\right) \right] }{\sum \nolimits ^m_{k=1}\textrm{E}\left[ \left( 1-I_{(T_k (\varvec{Z})<c)}\right) {\mathbb {P}}_{\varvec{\vartheta }} (\theta _k=1\mid \varvec{Z}) \right] }. \end{aligned}$$

It follows that

$$\begin{aligned} \frac{1-(1+c^{*}_{\alpha })\textrm{mFNR}\left( {\varvec{\delta }} (\textrm{SMLIS}(\varvec{Z}),c_{\alpha })\right) }{\textrm{mFNR}\left( {\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}),c_{\alpha })\right) } \ge \frac{1-(1+c^{*}_{\alpha })\textrm{mFNR}(\widetilde{\varvec{\delta }} (T(\varvec{Z}), c)}{\textrm{mFNR}(\widetilde{\varvec{\delta }} (T(\varvec{Z}), c))}. \end{aligned}$$

Note that $\dfrac{1-(1+c^{*}_{\alpha })x}{x}$ is strictly decreasing in x and hence

$$\begin{aligned} \textrm{mFNR}\left( {\varvec{\delta }}(\textrm{SMLIS}(\varvec{Z}),c_{\alpha })\right) \le \textrm{mFNR}(\widetilde{\varvec{\delta }}(T(\varvec{Z}), c)). \end{aligned}$$

$\square$

1.3 A.3 The expression of the intermediate variables

The intermediate variables in the E-step of the tth iteration can be expressed as:

$$\begin{aligned}{} & {} \gamma _{i}^{(t)}(p)=\frac{\alpha _{i}^{(t)}(p) \beta _{i}^{(t)}(p)}{\sum \nolimits _{p=1}^{m_{0}+m_{1}} \alpha _{i}^{(t)}(p)\beta _{i}^{(t)}(p)}, \text { for }p=1, \cdots , m_{0}+m_{1};\\{} & {} \psi _{i}^{(t)}(0)=\sum \limits _{p=1}^{m_{0}} \gamma _{i}^{(t)}(p);\\{} & {} \psi _{i}^{(t)}(l)=\frac{\pi _{1l}^{(t)}f(z_{i}\mid \mu _{l}^{(t)}, {\sigma _{l}^{2}}^{(t)})}{\sum \nolimits _{l=1}^{L}\pi _{1l}^{(t)} f(z_{i}\mid \mu _{l}^{(t)}, {\sigma _{l}^{2}}^{(t)})} \sum \nolimits _{p=m_{0}+1}^{m_{0}+m_{1}}\gamma _{i}^{(t)}(p), \text { for }l=1, 2, \ldots , L;\\{} & {} \eta _{i}^{(t)}(p,q)\\{} & {} \quad = \frac{\alpha _{i}^{(t)}(p)a^{(t)}(p,q)f(z_{i+1} \mid \mu _{0}^{(t)}, {\sigma _{0}^{2}}^{(t)})^{I_{(q\in S_0)}} f(z_{i+1}\mid \mu _{q}^{(t)}, {\sigma _{q}^{2}}^{(t)})^{I_{(q\in S_1)}} \beta _{i+1}^{(t)}(q)}{\sum \nolimits _{p=1}^{m_{0}+m_{1}} \sum \nolimits _{q=1}^{m_{0}+m_{1}}\alpha _{i}^{(t)}(p) a^{(t)}(p,q)f(z_{i+1}\mid \mu _{0}^{(t)}, {\sigma _{0}^{2}}^{(t)})^{I_{(q\in S_0)}} f(z_{i+1} \mid \mu _{q}^{(t)}, {\sigma _{q}^{2}}^{(t)})^{I_{(q\in S_1)}} \beta _{i+1}^{(t)}(q)}, \end{aligned}$$

where $\alpha _{i}^{(t)}(p) ={\mathbb {P}}_{\varvec{\widehat{\varphi }}^{(t)}}(\theta ^{\star }_{i} =p, \{z_{i}\}_{j=1}^{i})$, $\beta _{i}^{(t)}(p) ={\mathbb {P}}_{\varvec{\widehat{\varphi }}^{(t)}} (\{z_{j}\}_{j=i+1}^{m}\mid \theta ^{\star }_{i}=p)$, and $a^{(t)}(p,q)$ is the (p, q)th element of ${\mathcal {A}}^{(t)}$.

1.4 A.4 The additional simulation results of Scenario 1

In this section, we start by conducting simulation studies corresponding to Scenario 1 in the cases where $L=2$ is known and unknown, respectively. For Scenario 1, consider the following model settings:

Setting S1: fix $n_1=4$, $\pi _b=0.4$, $\mu _1=1.5$, $\sigma _1=\sigma _2=1$, $\pi _{11}=\pi _{12}=0.5$ and vary $\mu _2$ from 2 to 3 with an increment 0.5;

Setting S2: fix $\mu _1=1.5$, $\mu _2=2.5$, $\sigma _1=\sigma _2=1$, $\pi _{11}=\pi _{12}=0.5$, $\pi _b=0.4$ and vary $n_1$ from 4 to 8 with an increment 2;

Setting S3: fix $\mu _1=1.5$, $\mu _2=2.5$, $\sigma _1=\sigma _2=1$, $\pi _{11}=\pi _{12}=0.5$, $n_1=6$ and vary $\pi _b$ from 0.35 to 0.45 with an increment 0.05.

The simulation results where $L=2$ is known and unknown are displayed in Figs. 6 and 7, respectively. The results are consistent with those in Sect. 3 and hence we will not repeat.

1.5 A.5 The additional simulation results of Scenario 2

Then we perform simulations corresponding to Scenario 2 in the cases where $L=2$ is known and unknown, respectively. For Scenario 2, consider the following model settings:

Setting S4: fix $\lambda =5$, $\mu _2=2.5$, $\sigma _1=\sigma _2=1$, $\pi _{11}=\pi _{12}=0.5$ and vary $\mu _1$ from 1 to 2 with an increment 0.5;

Setting S5: fix $\mu _1=1.5$, $\mu _2=2.5$, $\sigma _1=\sigma _2=1$, $\pi _{11}=\pi _{12}=0.5$ and vary $\lambda$ from 3 to 7 with an increment 2.

The results where $L=2$ is known and unknown are displayed in Figs. 8 and 9, respectively.

1.6 A.6 The simulation results for Scenarios 3 and 4

Fix $L=2$ and consider the following scenarios.

Scenario 3:
the PMF of shifted negative binomial distribution
$$\begin{aligned} d_0(r) = \frac{\Gamma (r+n_2-1)}{(r-1)!\Gamma (n_2)} \pi _{nb}^{n_2} (1-\pi _{nb})^{r-1}, r=1, 2, \cdots ; \end{aligned}$$
Scenario 4:
the PMF of the distribution with unstructured start and geometric tail
$$\begin{aligned} d_0(r) = \left\{ \begin{array}{lr} \delta _0(r), &{} \text {for } r\le q, \\ \delta _0(q)\left( \frac{1-\sum _{k=1}^q\delta _0(k)}{1-\sum _{k=1}^{q-1}\delta _0(k)}\right) ^{r-q}, &{} \text {for } r> q. \end{array}\right. \end{aligned}$$

For Scenario 3, consider the following model settings:

Setting S6: fix $n_2=4$, $\pi _{nb}=0.4$, $\mu _1=1$, $\sigma _1=\sigma _2=1$, $\pi _{11}=\pi _{12}=0.5$ and vary $\mu _2$ from 1.5 to 2.5 with an increment 0.5;

Setting S7: fix $\mu _1=1$, $\mu _2=2$, $\sigma _1=\sigma _2=1$, $\pi _{11}=\pi _{12}=0.5$, $\pi _{nb}=0.4$ and vary $n_2$ from 3 to 5 with an increment 1;

Setting S8: fix $\mu _1=1$, $\mu _2=2$, $\sigma _1=\sigma _2=1$, $\pi _{11}=\pi _{12}=0.5$, $n_2=4$ and vary $\pi _{nb}$ from 0.35 to 0.45 with an increment 0.05.

The corresponding simulation results where $L=2$ is known and unknown are displayed in Figs. 10 and 11, respectively.

For Scenario 4, consider the following model settings:

Setting S9: fix $\left( \delta _0(1), \delta _0(2), \delta _0(3), \delta _0(4)\right) =(0.1, 0.1, 0.1, 0.3)$, $q=4$, $\mu _1=1$, $\sigma _1=\sigma _2=1$, $\pi _{11}=\pi _{12}=0.5$, and vary $\mu _2$ from 1.5 to 2.5 with an increment 0.5;

Setting S10: fix $\left( \delta _0(1), \delta _0(2), \delta _0(3)\right) =(0.1, 0.1, 0.1)$, $q=4$, $\mu _1=1$, $\mu _2=2$, $\sigma _1=\sigma _2=1$, $\pi _{11}=\pi _{12}=0.5$, and vary $\delta _0(4)$ from 0.2 to 0.4 with an increment 0.1.

The corresponding simulation results where $L=2$ is known and unknown are displayed in Figs. 12 and 13, respectively. Similarly, we can conclude that the oracle and data-driven SMLIS procedures perform very similar overall and both have higher power than their competing methods. Although the FDR yielded by the data-driven SMLIS procedure exceeds 0.1 slightly, it is still acceptable.

1.7 A.7 The robustness of the SMLIS procedure

In this section, we carry out simulations to investigate the robustness of the SMLIS procedure when observations are generated from the HMM. Specifically, the underlying states of hypotheses $\{\theta _i\}^m_{i=1}$ are generated from a Markov chain with the initial probabilities ${\mathbb {P}}(\theta _1=0)=1$, ${\mathbb {P}}(\theta _1=1)=0$, and transition probability matrix $A=(0.8, 0.2; a_{10}, 1-a_{10})$. The observations $\{z_i\}^m_{i=1}$ are generated from the two-component mixture model (4) with $F_0\sim N(0, 1)$. Without loss of generality, consider the following settings with $L=1$ and $m=5000$.

Setting S11: fix $\mu _1=2$, $\sigma _1=1$, and vary $a_{10}$ from 0.1 to 0.4 with an increment 0.025;

Setting S12: fix $a_{10}=0.25$, $\sigma _1=1$, and vary $\mu _1$ from 1.5 to 2.5 with an increment 0.1.

The simulation results are listed in Fig. 14. We can conclude that the data-driven SMLIS procedure performs almost the same as the LIS procedure even if observations are generated from the HMM.

1.8 A.8 The empirical PDF of the non-null

In this section, we plot the empirical probability density function (PDF) of the non-null, that is, the empirical PDF of the z-values of the associated SNPs in Simulation II. It is clear to see from Fig. 15 that the empirical PDF for the non-null is bimodal and hence we select $L=2$ for both the SMLIS and LIS procedures.

1.9 A.9 The analysis of the running time of SMLIS

In this section, to test the scalability and efficiency of the SMLIS procedure, we further examine its running time with the number of tests m varied from 5000 to 10000. Without loss of generality, consider the Settings 1–3. The corresponding simulation results are displayed in Fig. 16. We can see from Fig. 16 that the running time of the SMLIS procedure is approximated as a linear function of m for the same settings of the other parameters. This indicates that even with tens of thousands of tests, the running time of the SMLIS procedure is still acceptable.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, J., Wang, P. Large-scale dependent multiple testing via hidden semi-Markov models. Comput Stat 39, 1093–1126 (2024). https://doi.org/10.1007/s00180-023-01367-z

Download citation

Received: 09 July 2022
Accepted: 09 May 2023
Published: 31 May 2023
Issue Date: May 2024
DOI: https://doi.org/10.1007/s00180-023-01367-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Large-scale dependent multiple testing via hidden semi-Markov models

Abstract

Access this article

Similar content being viewed by others

Fast and covariate-adaptive method amplifies detection power in large-scale multiple hypothesis testing

Multiple Hypothesis Tests: A Bayesian Approach

Selective inference for false discovery proportion in a hidden Markov model

References

Funding