Controlling the false discovery rate by a Latent Gaussian Copula Knockoff procedure

Vásquez, Alejandro Román; Márquez Urbina, José Ulises; González Farías, Graciela; Escarela, Gabriel

doi:10.1007/s00180-023-01346-4

Controlling the false discovery rate by a Latent Gaussian Copula Knockoff procedure

Original paper
Published: 25 March 2023

Volume 39, pages 1435–1458, (2024)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Alejandro Román Vásquez¹,
José Ulises Márquez Urbina^1,2,
Graciela González Farías ORCID: orcid.org/0000-0001-7434-500X¹ &
…
Gabriel Escarela³

716 Accesses
2 Citations
Explore all metrics

Abstract

The penalized Lasso Cox proportional hazards model has been widely used to identify prognosis biomarkers in high-dimension settings. However, this method tends to select many false positives, affecting its interpretability. In order to improve the reproducibility, we develop a knockoff procedure that consists on wrapping the Lasso Cox model with the model-X knockoff, resulting in a powerful tool for variable selection that allows for the control of the false discovery rate in the presence of finite sample guarantees. In this paper, we propose a novel approach to sample valid knockoffs for ordinal and continuous variables whose distributions can be skewed or heavy-tailed, which employs a Latent Mixed Gaussian Copula model to account for the dependence structure between the variables, leading to what we call the Latent Gaussian Copula Knockoff (LGCK) procedure. We then combine the LGCK method with the Lasso coefficient difference (LCD) statistic as the importance metric. To our knowledge, our proposal is the first knockoff framework for jointly considering ordinal and continuous data in a non-Gaussian setting and a survival context. We illustrate the proposed methodology’s effectiveness by applying it to a real lung cancer gene expression dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Penalized Cox’s proportional hazards model for high-dimensional survival data with grouped predictors

Article 30 September 2021

Accounting for grouped predictor variables or pathways in high-dimensional penalized Cox regression models

Article Open access 02 July 2020

A three-stage approach to identify biomarker signatures for cancer genetic data with survival endpoints

Article Open access 27 March 2024

Data availability statement

The real dataset used in this study is available at http://lce.biohpc.swmed.edu/.

Code availability

The code for reproducing simulations experiments and the real data application is available at https://github.com/AlejandroRomanVasquez/LGCK-LCD.

References

Barber RF, Candès EJ, Samworth RJ (2020) Robust inference with knockoffs. Ann Stat 48(3):1409–1431. https://doi.org/10.1214/19-AOS1852
Article MathSciNet Google Scholar
Bates S, Candès E, Janson L, Wang W (2021) Metropolized knockoff sampling. J Am Stat Assoc 116(535):1413–1427. https://doi.org/10.1080/01621459.2020.1729163
Article MathSciNet Google Scholar
Bender R, Augustin T, Blettner M (2005) Generating survival times to simulate cox proportional hazards models. Stat Med 24(11):1713–1723. https://doi.org/10.1002/sim.2059
Article MathSciNet Google Scholar
Berti P, Dreassi E, Leisen F, Pratelli L, Rigo P (2023) New perspectives on knockoffs construction. J Stat Plan Inference 223:1–14. https://doi.org/10.1016/j.jspi.2022.07.006
Article MathSciNet Google Scholar
Bommert A, Welchowski T, Schmid M, Rahnenführer J (2022) Benchmark of filter methods for feature selection in high-dimensional gene expression survival data. Br Bioinformat 23(1):bbab354. https://doi.org/10.1093/bib/bbab354
Article Google Scholar
Bourgon R, Gentleman R, Huber W (2010) Independent filtering increases detection power for high-throughput experiments. Proc Natl Acad Sci 107(21):9546–9551. https://doi.org/10.1073/pnas.0914005107
Article Google Scholar
Cai L, Lin S, Girard L, Zhou Y, Yang L, Ci B, Zhou Q, Luo D, Yao B, Tang H et al (2019) Lce: an open web portal to explore gene expression and clinical associations in lung cancer. Oncogene 38(14):2551–2564. https://doi.org/10.1038/s41388-018-0588-2
Article Google Scholar
Candes E, Fan Y, Janson L, Lv J (2018) Panning for gold:‘model-x’knockoffs for high dimensional controlled variable selection. J R Stat Soc: Ser B (Stat Methodol) 80(3):551–577. https://doi.org/10.1111/rssb.12265
Article MathSciNet Google Scholar
Carroll KJ (2003) On the use and utility of the Weibull model in the analysis of survival data. Control Clin Trials 24(6):682–701. https://doi.org/10.1016/S0197-2456(03)00072-2
Article Google Scholar
Collett D (2015) Modelling survival data in medical research. CRC Press
Dong Y, Li D, Zheng Z, Zhou J (2022) Reproducible feature selection in high-dimensional accelerated failure time models. Stat Prob Lett 181:109275. https://doi.org/10.1016/j.spl.2021.109275
Article MathSciNet Google Scholar
Egger M, Higgins JP, Smith GD (2022) Systematic reviews in health research: meta-analysis in context. Wiley
Fan J, Liu H, Ning Y, Zou H (2017) High dimensional semiparametric latent graphical model for mixed data. J R Stat Soc: Ser B (Stat Methodol) 79(2):405–421. https://doi.org/10.1111/rssb.12168
Article MathSciNet Google Scholar
Feng H, Ning Y (2019) High-dimensional mixed graphical model with ordinal data: parameter estimation and statistical inference. In: The 22nd international conference on artificial intelligence and statistics. PMLR, pp 654–663. https://proceedings.mlr.press/v89/feng19a.html
Foygel R, Drton M (2010) Extended bayesian information criteria for gaussian graphical models. In: Lafferty J, Williams C, Shawe-Taylor J, Zemel R, Culotta A (eds) Advances in neural information processing systems, vol 23. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2010/file/072b030ba126b2f4b2374f342be9ed44-Paper.pdf
Goeman JJ (2010) L1 penalized estimation in the cox proportional hazards model. Biom J 52(1):70–84. https://doi.org/10.1002/bimj.200900028
Article MathSciNet Google Scholar
Hackstadt AJ, Hess AM (2009) Filtering for increased power for microarray data analysis. BMC Bioinformat 10(1):1–12. https://doi.org/10.1186/1471-2105-10-11
Article Google Scholar
Huang YJ, Lu TP, Hsiao CK (2020) Application of graphical lasso in estimating network structure in gene set. Ann Transl Med. https://doi.org/10.21037/atm-20-6490
Huang M, Müller CL, Gaynanova I (2021) latentcor: an R package for estimating latent correlations from mixed data types. arXiv preprint arXiv:2108.09180
Hyndman RJ, Fan Y (1996) Sample quantiles in statistical packages. Am Stat 50(4):361–365. https://doi.org/10.1080/00031305.1996.10473566
Article Google Scholar
Jardillier R, Chatelain F, Guyon L (2018) Bioinformatics methods to select prognostic biomarker genes from large scale datasets: a review. Biotechnol J 13(12):1800103. https://doi.org/10.1002/biot.201800103
Article Google Scholar
Jardillier R, Chatelain F, Guyon L (2020) Benchmark of lasso-like penalties in the cox model for tcga datasets reveal improved performance with pre-filtering and wide differences between cancers. bioRxiv https://doi.org/10.1101/2020.03.09.984070
Joe H (2014) Dependence modeling with copulas. CRC Press
Jordon J, Yoon J, van der Schaar M (2018) Knockoffgan: generating knockoffs for feature selection using generative adversarial networks. In: International conference on learning representations
Kattan MW (2003) Comparison of cox regression with other methods for determining prediction models and nomograms. J Urol 170(6S):S6–S10. https://doi.org/10.1097/01.ju.0000094764.56269.2d
Article Google Scholar
Kim HM, Mallick BK (2003) Moments of random vectors with skew t distribution and their quadratic forms. Stat Prob Lett 63(4):417–423. https://doi.org/10.1016/S0167-7152(03)00121-4
Article MathSciNet Google Scholar
Kormaksson M, Kelly LJ, Zhu X, Haemmerle S, Pricop L, Ohlssen D (2021) Sequential knockoffs for continuous and categorical predictors: with application to a large psoriatic arthritis clinical trial pool. Stat Med 40(14):3313–3328. https://doi.org/10.1002/sim.8955
Article MathSciNet Google Scholar
Liu H, Lafferty J, Wasserman L (2009) The nonparanormal: semiparametric estimation of high dimensional undirected graphs. J Mach Learn Res 10(80):2295–2328. http://jmlr.org/papers/v10/liu09a.html
Omurlu IK, Ture M, Tokatli F (2009) The comparisons of random survival forests and cox regression analysis with simulation and an application related to breast cancer. Expert Syst Appl 36(4):8582–8588. https://doi.org/10.1016/j.eswa.2008.10.023
Article Google Scholar
Quan X, Booth JG, Wells MT (2018) Rank-based approach for estimating correlations in mixed ordinal data. arXiv preprint arXiv:1809.06255
Roberts S, Nowak G (2014) Stabilizing the lasso against cross-validation variability. Comput Stat Data Anal 70:198–211. https://doi.org/10.1016/j.csda.2013.09.008
Article MathSciNet Google Scholar
Romano Y, Sesia M, Candès E (2020) Deep knockoffs. J Am Stat Assoc 115(532):1861–1872. https://doi.org/10.1080/01621459.2019.1660174
Article MathSciNet Google Scholar
Rousseaux S, Debernardi A, Jacquiau B, Vitte AL, Vesin A, Nagy-Mignotte H, Moro-Sibilot D, Brichon PY, Lantuejoul S, Hainaut P, et al. (2013) Ectopic activation of germline and placental genes identifies aggressive metastasis-prone lung cancers. Sci Transl Med 5(186):186ra66–186ra66. https://doi.org/10.1126/scitranslmed.3005723
Schaipp F, Müller CL, Vlasovets O (2021) Gglasso—a python package for general graphical lasso computation. arXiv preprint arXiv:2110.10521
Scott A, Salgia R (2008) Biomarkers in lung cancer: from early detection to novel therapeutics and decision making. Biomark Med. https://doi.org/10.2217/17520363.2.6.577
Sechidis K, Kormaksson M, Ohlssen D (2021) Using knockoffs for controlled predictive biomarker identification. Stat Med 40(25):5453–5473. https://doi.org/10.1002/sim.9134
Article MathSciNet Google Scholar
Simon N, Friedman J, Hastie T, Tibshirani R (2011) Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw 39(5):1. https://doi.org/10.18637/jss.v039.i05
Spector A, Janson L (2022) Powerful knockoffs via minimizing reconstructability. Ann Stat 50(1):252–276. https://doi.org/10.1214/21-AOS2104
Article MathSciNet Google Scholar
Sudarshan M, Tansey W, Ranganath R (2020) Deep direct likelihood knockoffs. Adv Neural Inf Process Syst 33:5036–5046
Google Scholar
Ternès N, Rotolo F, Michiels S (2016) Empirical extensions of the lasso penalty to reduce the false discovery rate in high-dimensional cox regression models. Stat Med 35(15):2561–2573. https://doi.org/10.1002/sim.6927
Article MathSciNet Google Scholar
Tibshirani R (1997) The lasso method for variable selection in the cox model. Stat Med 16(4):385–395. https://doi.org/10.1002/(SICI)1097-0258(19970228)16:4<385::AID-SIM380>3.0.CO;2-3
Article Google Scholar
Xiang A, Lapuerta P, Ryutov A, Buckley J, Azen S (2000) Comparison of the performance of neural network methods and cox regression for censored survival data. Comput Stat Data Anal 34(2):243–257. https://doi.org/10.1016/S0167-9473(99)00098-5
Article Google Scholar
Xue L, Zou H (2012) Regularized rank-based estimation of high-dimensional nonparanormal graphical models. Ann Stat 40(5):2541–2571. https://doi.org/10.1214/12-AOS1041
Article MathSciNet Google Scholar
Yoon G, Carroll RJ, Gaynanova I (2020) Sparse semiparametric canonical correlation analysis for data of mixed types. Biometrika 107(3):609–625. https://doi.org/10.1093/biomet/asaa007
Article MathSciNet Google Scholar
Yoon G, Müller CL, Gaynanova I (2021) Fast computation of latent correlations. J Comput Graph Stat 30(4):1249–1256. https://doi.org/10.1080/10618600.2021.1882468
Article MathSciNet Google Scholar
Zhao H, Duan ZH (2019) Cancer genetic network inference using gaussian graphical models. Bioinform Biol Insights 13:1177932219839402. https://doi.org/10.1177/1177932219839402
Article Google Scholar

Download references

Acknowledgements

Alejandro Román Vásquez acknowledges a grant from Consejo Nacional de Ciencia y Tecnología (CONACyT) Estancias Posdoctorales por México 2021 at Centro de Investigación en Matemáticas. Additionally, all the authors would like to thank Miguel Bedolla for revising the manuscript.

Author information

Authors and Affiliations

Centro de Investigación en Matemáticas A.C., Unidad Monterrey, 66629, Monterrey, Nuevo León, Mexico
Alejandro Román Vásquez, José Ulises Márquez Urbina & Graciela González Farías
Consejo Nacional de Ciencia y Tecnología, 03940, Mexico City, Mexico
José Ulises Márquez Urbina
Departamento de Matemáticas, Universidad Autónoma Metropolitana, Iztapalapa, 09340, Mexico City, Mexico
Gabriel Escarela

Authors

Alejandro Román Vásquez
View author publications
You can also search for this author in PubMed Google Scholar
José Ulises Márquez Urbina
View author publications
You can also search for this author in PubMed Google Scholar
Graciela González Farías
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Escarela
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Graciela González Farías.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Computational implementations of the proposed methods

In the following, we describe specific aspects of the programming implementation of the LGCK procedure. We also clarify the estimation of the Lasso coefficient-difference (LCD) statistic and the computation of the data-dependent threshold.

Step 1: estimation of the latent correlation matrix. The estimation of the latent correlation matrix ${\varvec{\varSigma }}$ can be done using the SPSVERBc6 function from the latentcor SPSVERBc1 package (Huang et al. 2021). It works for mixed data, ordinal and continuous, under the Latent Mixed Gaussian copula model. The Original Computation Scheme Yoon et al. (2021) is set through putting the argument SPSVERBc8. For the ordinal case, the estimation method in the function SPSVERBc6 is limited to the binary and ternary data types. Thus, an ordinal variable with 4 or more levels must be treated as continuous. The corresponding bridge functions F for the combinations of variables (binary-continuous, ternary-continuous, binary-binary, binary-ternary, and ternary-ternary), and the equations for the estimators of the cutoffs D for binary and ternary variables can be found in the mathematical framework of the SPSVERBc10 package.

Step 2: estimation of the precision matrix of the latent correlation matrix. The SPSVERBc2 package SPSVERBc12 is a helpful library for solving general graphical Lasso problems (Schaipp et al. 2021). Its main class is glasso_problem which performs the task at hand with a simplified procedure. Creating a glasso_problem object requires an empirical covariance matrix ${\varvec{S}}$ and the sample size n as arguments. The optimal value for the penalization parameter is determined using the function model_selection(), which implements a grid search based on the extended BIC criterion (Foygel and Drton 2010).

Step 3: nonparametric transformation strategy to obtain marginal normality. The specific quantities involved in this step can be estimated using some functions from the base package of R. Concretely, the function ecdf() from the Stats package in R can be used to compute ${\hat{F}}_j$.

Step 4: sampling Gaussian knockoffs using the MRC approach. The Knockpy Python package can be employed to sample MVR knockoffs. This versatile library makes it easy to apply knockoff-based inference in only a few lines of code. A GaussianSampler object needs to be created using the transformed vector ${\varvec{X}}^{\text {norm} }$, the estimated latent correlation matrix ${\varvec{{\hat{\varSigma }}}}_\text {Lasso}$ and setting method equal to SPSVERBc13 in order to sample Gaussian minimum variance-based reconstructability knockoffs $\tilde{{\varvec{X}}}^{\text {norm} }$.

Step 5: reversing transformation to obtain the non-Gaussian Knockoffs. The sample quantiles can be computed utilizing the function SPSVERBc14 from the SPSVERBc15 package in SPSVERBc1. The nine quantile types described in Hyndman and Fan (1996) can be set through the argument type, where the recommended median-unbiased is selected using the number 8. The transformation to get the binary-ternary knockoffs can be done using conditional statements.

LCD statistic and data-dependent threshold The Lasso Cox model may be trained using the SPSVERBc17 SPSVERBc1 package. Then, the computation of the Lasso coefficient-difference (LCD) statistic can be easily done. The data-dependent threshold calculation can be carried out using the function data_dependent_threshhold() from the SPSVERBc2’s SPSVERBc20 package, which completes the knockoff methodology.

B Additional simulation results: low dimensional case ($p<n$)

In this section, we complement the simulation results by adding line plots for the empirical power and the FDR in a low-dimensional setting ($p<n$). The configurations considered include variations of the correlation coefficient of the autoregressive correlation (Fig. 6), the amplitude (Fig. 7), and the censoring rate (Fig. 8).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Vásquez, A.R., Márquez Urbina, J.U., González Farías, G. et al. Controlling the false discovery rate by a Latent Gaussian Copula Knockoff procedure. Comput Stat 39, 1435–1458 (2024). https://doi.org/10.1007/s00180-023-01346-4

Download citation

Received: 03 November 2022
Accepted: 08 March 2023
Published: 25 March 2023
Issue Date: May 2024
DOI: https://doi.org/10.1007/s00180-023-01346-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Controlling the false discovery rate by a Latent Gaussian Copula Knockoff procedure

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Penalized Cox’s proportional hazards model for high-dimensional survival data with grouped predictors

Accounting for grouped predictor variables or pathways in high-dimensional penalized Cox regression models

A three-stage approach to identify biomarker signatures for cancer genetic data with survival endpoints

Data availability statement

Code availability

References

Acknowledgements