Abstract
The penalized Lasso Cox proportional hazards model has been widely used to identify prognosis biomarkers in high-dimension settings. However, this method tends to select many false positives, affecting its interpretability. In order to improve the reproducibility, we develop a knockoff procedure that consists on wrapping the Lasso Cox model with the model-X knockoff, resulting in a powerful tool for variable selection that allows for the control of the false discovery rate in the presence of finite sample guarantees. In this paper, we propose a novel approach to sample valid knockoffs for ordinal and continuous variables whose distributions can be skewed or heavy-tailed, which employs a Latent Mixed Gaussian Copula model to account for the dependence structure between the variables, leading to what we call the Latent Gaussian Copula Knockoff (LGCK) procedure. We then combine the LGCK method with the Lasso coefficient difference (LCD) statistic as the importance metric. To our knowledge, our proposal is the first knockoff framework for jointly considering ordinal and continuous data in a non-Gaussian setting and a survival context. We illustrate the proposed methodology’s effectiveness by applying it to a real lung cancer gene expression dataset.





Similar content being viewed by others
Data availability statement
The real dataset used in this study is available at http://lce.biohpc.swmed.edu/.
Code availability
The code for reproducing simulations experiments and the real data application is available at https://github.com/AlejandroRomanVasquez/LGCK-LCD.
References
Barber RF, Candès EJ, Samworth RJ (2020) Robust inference with knockoffs. Ann Stat 48(3):1409–1431. https://doi.org/10.1214/19-AOS1852
Bates S, Candès E, Janson L, Wang W (2021) Metropolized knockoff sampling. J Am Stat Assoc 116(535):1413–1427. https://doi.org/10.1080/01621459.2020.1729163
Bender R, Augustin T, Blettner M (2005) Generating survival times to simulate cox proportional hazards models. Stat Med 24(11):1713–1723. https://doi.org/10.1002/sim.2059
Berti P, Dreassi E, Leisen F, Pratelli L, Rigo P (2023) New perspectives on knockoffs construction. J Stat Plan Inference 223:1–14. https://doi.org/10.1016/j.jspi.2022.07.006
Bommert A, Welchowski T, Schmid M, Rahnenführer J (2022) Benchmark of filter methods for feature selection in high-dimensional gene expression survival data. Br Bioinformat 23(1):bbab354. https://doi.org/10.1093/bib/bbab354
Bourgon R, Gentleman R, Huber W (2010) Independent filtering increases detection power for high-throughput experiments. Proc Natl Acad Sci 107(21):9546–9551. https://doi.org/10.1073/pnas.0914005107
Cai L, Lin S, Girard L, Zhou Y, Yang L, Ci B, Zhou Q, Luo D, Yao B, Tang H et al (2019) Lce: an open web portal to explore gene expression and clinical associations in lung cancer. Oncogene 38(14):2551–2564. https://doi.org/10.1038/s41388-018-0588-2
Candes E, Fan Y, Janson L, Lv J (2018) Panning for gold:‘model-x’knockoffs for high dimensional controlled variable selection. J R Stat Soc: Ser B (Stat Methodol) 80(3):551–577. https://doi.org/10.1111/rssb.12265
Carroll KJ (2003) On the use and utility of the Weibull model in the analysis of survival data. Control Clin Trials 24(6):682–701. https://doi.org/10.1016/S0197-2456(03)00072-2
Collett D (2015) Modelling survival data in medical research. CRC Press
Dong Y, Li D, Zheng Z, Zhou J (2022) Reproducible feature selection in high-dimensional accelerated failure time models. Stat Prob Lett 181:109275. https://doi.org/10.1016/j.spl.2021.109275
Egger M, Higgins JP, Smith GD (2022) Systematic reviews in health research: meta-analysis in context. Wiley
Fan J, Liu H, Ning Y, Zou H (2017) High dimensional semiparametric latent graphical model for mixed data. J R Stat Soc: Ser B (Stat Methodol) 79(2):405–421. https://doi.org/10.1111/rssb.12168
Feng H, Ning Y (2019) High-dimensional mixed graphical model with ordinal data: parameter estimation and statistical inference. In: The 22nd international conference on artificial intelligence and statistics. PMLR, pp 654–663. https://proceedings.mlr.press/v89/feng19a.html
Foygel R, Drton M (2010) Extended bayesian information criteria for gaussian graphical models. In: Lafferty J, Williams C, Shawe-Taylor J, Zemel R, Culotta A (eds) Advances in neural information processing systems, vol 23. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2010/file/072b030ba126b2f4b2374f342be9ed44-Paper.pdf
Goeman JJ (2010) L1 penalized estimation in the cox proportional hazards model. Biom J 52(1):70–84. https://doi.org/10.1002/bimj.200900028
Hackstadt AJ, Hess AM (2009) Filtering for increased power for microarray data analysis. BMC Bioinformat 10(1):1–12. https://doi.org/10.1186/1471-2105-10-11
Huang YJ, Lu TP, Hsiao CK (2020) Application of graphical lasso in estimating network structure in gene set. Ann Transl Med. https://doi.org/10.21037/atm-20-6490
Huang M, Müller CL, Gaynanova I (2021) latentcor: an R package for estimating latent correlations from mixed data types. arXiv preprint arXiv:2108.09180
Hyndman RJ, Fan Y (1996) Sample quantiles in statistical packages. Am Stat 50(4):361–365. https://doi.org/10.1080/00031305.1996.10473566
Jardillier R, Chatelain F, Guyon L (2018) Bioinformatics methods to select prognostic biomarker genes from large scale datasets: a review. Biotechnol J 13(12):1800103. https://doi.org/10.1002/biot.201800103
Jardillier R, Chatelain F, Guyon L (2020) Benchmark of lasso-like penalties in the cox model for tcga datasets reveal improved performance with pre-filtering and wide differences between cancers. bioRxiv https://doi.org/10.1101/2020.03.09.984070
Joe H (2014) Dependence modeling with copulas. CRC Press
Jordon J, Yoon J, van der Schaar M (2018) Knockoffgan: generating knockoffs for feature selection using generative adversarial networks. In: International conference on learning representations
Kattan MW (2003) Comparison of cox regression with other methods for determining prediction models and nomograms. J Urol 170(6S):S6–S10. https://doi.org/10.1097/01.ju.0000094764.56269.2d
Kim HM, Mallick BK (2003) Moments of random vectors with skew t distribution and their quadratic forms. Stat Prob Lett 63(4):417–423. https://doi.org/10.1016/S0167-7152(03)00121-4
Kormaksson M, Kelly LJ, Zhu X, Haemmerle S, Pricop L, Ohlssen D (2021) Sequential knockoffs for continuous and categorical predictors: with application to a large psoriatic arthritis clinical trial pool. Stat Med 40(14):3313–3328. https://doi.org/10.1002/sim.8955
Liu H, Lafferty J, Wasserman L (2009) The nonparanormal: semiparametric estimation of high dimensional undirected graphs. J Mach Learn Res 10(80):2295–2328. http://jmlr.org/papers/v10/liu09a.html
Omurlu IK, Ture M, Tokatli F (2009) The comparisons of random survival forests and cox regression analysis with simulation and an application related to breast cancer. Expert Syst Appl 36(4):8582–8588. https://doi.org/10.1016/j.eswa.2008.10.023
Quan X, Booth JG, Wells MT (2018) Rank-based approach for estimating correlations in mixed ordinal data. arXiv preprint arXiv:1809.06255
Roberts S, Nowak G (2014) Stabilizing the lasso against cross-validation variability. Comput Stat Data Anal 70:198–211. https://doi.org/10.1016/j.csda.2013.09.008
Romano Y, Sesia M, Candès E (2020) Deep knockoffs. J Am Stat Assoc 115(532):1861–1872. https://doi.org/10.1080/01621459.2019.1660174
Rousseaux S, Debernardi A, Jacquiau B, Vitte AL, Vesin A, Nagy-Mignotte H, Moro-Sibilot D, Brichon PY, Lantuejoul S, Hainaut P, et al. (2013) Ectopic activation of germline and placental genes identifies aggressive metastasis-prone lung cancers. Sci Transl Med 5(186):186ra66–186ra66. https://doi.org/10.1126/scitranslmed.3005723
Schaipp F, Müller CL, Vlasovets O (2021) Gglasso—a python package for general graphical lasso computation. arXiv preprint arXiv:2110.10521
Scott A, Salgia R (2008) Biomarkers in lung cancer: from early detection to novel therapeutics and decision making. Biomark Med. https://doi.org/10.2217/17520363.2.6.577
Sechidis K, Kormaksson M, Ohlssen D (2021) Using knockoffs for controlled predictive biomarker identification. Stat Med 40(25):5453–5473. https://doi.org/10.1002/sim.9134
Simon N, Friedman J, Hastie T, Tibshirani R (2011) Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw 39(5):1. https://doi.org/10.18637/jss.v039.i05
Spector A, Janson L (2022) Powerful knockoffs via minimizing reconstructability. Ann Stat 50(1):252–276. https://doi.org/10.1214/21-AOS2104
Sudarshan M, Tansey W, Ranganath R (2020) Deep direct likelihood knockoffs. Adv Neural Inf Process Syst 33:5036–5046
Ternès N, Rotolo F, Michiels S (2016) Empirical extensions of the lasso penalty to reduce the false discovery rate in high-dimensional cox regression models. Stat Med 35(15):2561–2573. https://doi.org/10.1002/sim.6927
Tibshirani R (1997) The lasso method for variable selection in the cox model. Stat Med 16(4):385–395. https://doi.org/10.1002/(SICI)1097-0258(19970228)16:4<385::AID-SIM380>3.0.CO;2-3
Xiang A, Lapuerta P, Ryutov A, Buckley J, Azen S (2000) Comparison of the performance of neural network methods and cox regression for censored survival data. Comput Stat Data Anal 34(2):243–257. https://doi.org/10.1016/S0167-9473(99)00098-5
Xue L, Zou H (2012) Regularized rank-based estimation of high-dimensional nonparanormal graphical models. Ann Stat 40(5):2541–2571. https://doi.org/10.1214/12-AOS1041
Yoon G, Carroll RJ, Gaynanova I (2020) Sparse semiparametric canonical correlation analysis for data of mixed types. Biometrika 107(3):609–625. https://doi.org/10.1093/biomet/asaa007
Yoon G, Müller CL, Gaynanova I (2021) Fast computation of latent correlations. J Comput Graph Stat 30(4):1249–1256. https://doi.org/10.1080/10618600.2021.1882468
Zhao H, Duan ZH (2019) Cancer genetic network inference using gaussian graphical models. Bioinform Biol Insights 13:1177932219839402. https://doi.org/10.1177/1177932219839402
Acknowledgements
Alejandro Román Vásquez acknowledges a grant from Consejo Nacional de Ciencia y Tecnología (CONACyT) Estancias Posdoctorales por México 2021 at Centro de Investigación en Matemáticas. Additionally, all the authors would like to thank Miguel Bedolla for revising the manuscript.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Computational implementations of the proposed methods
In the following, we describe specific aspects of the programming implementation of the LGCK procedure. We also clarify the estimation of the Lasso coefficient-difference (LCD) statistic and the computation of the data-dependent threshold.
Step 1: estimation of the latent correlation matrix. The estimation of the latent correlation matrix \({\varvec{\varSigma }}\) can be done using the SPSVERBc6 function from the latentcor SPSVERBc1 package (Huang et al. 2021). It works for mixed data, ordinal and continuous, under the Latent Mixed Gaussian copula model. The Original Computation Scheme Yoon et al. (2021) is set through putting the argument SPSVERBc8. For the ordinal case, the estimation method in the function SPSVERBc6 is limited to the binary and ternary data types. Thus, an ordinal variable with 4 or more levels must be treated as continuous. The corresponding bridge functions F for the combinations of variables (binary-continuous, ternary-continuous, binary-binary, binary-ternary, and ternary-ternary), and the equations for the estimators of the cutoffs D for binary and ternary variables can be found in the mathematical framework of the SPSVERBc10 package.
Step 2: estimation of the precision matrix of the latent correlation matrix. The SPSVERBc2 package SPSVERBc12 is a helpful library for solving general graphical Lasso problems (Schaipp et al. 2021). Its main class is glasso_problem which performs the task at hand with a simplified procedure. Creating a glasso_problem object requires an empirical covariance matrix \({\varvec{S}}\) and the sample size n as arguments. The optimal value for the penalization parameter is determined using the function model_selection(), which implements a grid search based on the extended BIC criterion (Foygel and Drton 2010).
Step 3: nonparametric transformation strategy to obtain marginal normality. The specific quantities involved in this step can be estimated using some functions from the base package of R. Concretely, the function ecdf() from the Stats package in R can be used to compute \({\hat{F}}_j\).
Step 4: sampling Gaussian knockoffs using the MRC approach. The Knockpy Python package can be employed to sample MVR knockoffs. This versatile library makes it easy to apply knockoff-based inference in only a few lines of code. A GaussianSampler object needs to be created using the transformed vector \({\varvec{X}}^{\text {norm} }\), the estimated latent correlation matrix \({\varvec{{\hat{\varSigma }}}}_\text {Lasso}\) and setting method equal to SPSVERBc13 in order to sample Gaussian minimum variance-based reconstructability knockoffs \(\tilde{{\varvec{X}}}^{\text {norm} }\).
Step 5: reversing transformation to obtain the non-Gaussian Knockoffs. The sample quantiles can be computed utilizing the function SPSVERBc14 from the SPSVERBc15 package in SPSVERBc1. The nine quantile types described in Hyndman and Fan (1996) can be set through the argument type, where the recommended median-unbiased is selected using the number 8. The transformation to get the binary-ternary knockoffs can be done using conditional statements.
LCD statistic and data-dependent threshold The Lasso Cox model may be trained using the SPSVERBc17 SPSVERBc1 package. Then, the computation of the Lasso coefficient-difference (LCD) statistic can be easily done. The data-dependent threshold calculation can be carried out using the function data_dependent_threshhold() from the SPSVERBc2’s SPSVERBc20 package, which completes the knockoff methodology.
B Additional simulation results: low dimensional case (\(p<n\))
In this section, we complement the simulation results by adding line plots for the empirical power and the FDR in a low-dimensional setting (\(p<n\)). The configurations considered include variations of the correlation coefficient of the autoregressive correlation (Fig. 6), the amplitude (Fig. 7), and the censoring rate (Fig. 8).
The figures illustrate the empirical power and the FDR as a function of the autocorrelation coefficient \(\rho\) in the low dimensional case (\(p<n\)). The results corresponding to the LGCK-LCD procedure appear in orange, while the results for the Lasso Cox model are in blue. The parameter conditions are the same as in Fig. 2, except for the number of variables \(p=200\). Each point in the graphs represents the average value across 200 repetitions
The graphs show the empirical power and the FDR as a function of the absolute value |a| of the coefficient’s amplitude in the low dimensional case (\(p<n\)). The results corresponding to the LGCK-LCD procedure appear in orange, while the results for the Lasso Cox model are in blue. The parameter conditions are the same as in Fig. 3, except for the number of variables \(p=200\). Each point in the graphs represents the average value across 200 repetitions
The plots present the empirical power and the FDR as the censoring rate changes in the low dimensional case (\(p<n\)). The results corresponding to the LGCK-LCD procedure appear in orange, while the results for the Lasso Cox model are in blue. The parameter conditions are the same as in Fig. 4, except for the number of variables \(p=200\). Each point in the graphs represents the average value across 200 repetitions
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Vásquez, A.R., Márquez Urbina, J.U., González Farías, G. et al. Controlling the false discovery rate by a Latent Gaussian Copula Knockoff procedure. Comput Stat 39, 1435–1458 (2024). https://doi.org/10.1007/s00180-023-01346-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-023-01346-4