Abstract
As the high throughput technologies rapidly develop, multiple types of genomic data become available within and across different studies. It has become a challenging task in modern statistical research to use all types of genomic data to infer some disease-prone genetic information. In this work, we propose an integrative analysis of multiple and different types of genomic data, clinical covariates and survival data under a framework of an accelerated failure time with frailty model. The proposed integrative approach aims to answer some aspects of the complex problem in genomic data analysis by finding relevant genomic features and inferring patients’ survival time using identified features. The proposed integrative approach is developed using a weighted least-squares with a sparse group LASSO penalty as the objective function to simultaneously estimate and select the relevant features. Extensive simulation studies are conducted to assess the performance of the proposed method with two types of genomic data, DNA methylation data and copy number variation data, on 600 genes and three clinical covariates. The simulation results show promises of the proposed method. The proposed method is applied to the analysis of the Cancer Genome Atlas data on Glioblastoma, a lethal brain cancer, and biologically interpretable results are obtained.
Similar content being viewed by others
References
Auvergne R, Sim F, Wang S, Chandler-Militello D, Burch J, Al Fanek Y et al (2013) Transcriptional differences between normal and glioma-derived glial progenitor cells identify a core set of dysregulated genes. Cell Rep 3:2127–2141
Baldi P, Brunak S, Chauvin Y, Andersen CAF, Nielsen H (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinform Rev 16(5):412–424
Brennan CW, Verhaak RG (2013) TCGA research network the somatic genomic landscape of glioblastoma. Cell 155(2):462–477
Chekouo T, Stingo F, Doecke J, Do KA (2017) A bayesian integrative approach for multi-platform genomic data: a kidney cancer case study. Biometrics 2:615–624
Cheng S, Tu Y, Zhang S (2013) Foxm1 promotes glioma cells progression by up-regulating anxa1 expression. PLoS One 8:e72376
Du P, Zhang X, Huang C, Jafari N, Kibbe W, Hou L, Lin S (2010) Comparison of beta-value and m-value methods for quantifying methylation levels by microarray analysis. BMC Bioinform 11:587
Duchateau L, Janssen P (2008) The frailty model. Springer, New York
Fan J, Li R (2002) Variable selection for cox’s proportional hazards model and frailty model. Ann Stat 30(1):74–99
Friedman J, Hastie T, Tibshirani R (2010) A note on the group lasso and a sparse group lasso. arXiv:1001.0736
Huang T, Wu B, Lizardi P, Zhao H (2005) Detection of dna copy number alterations using penalized least squares regression. Bioinformatics 21(20):3811–3817
Jacko AM, Nan L, Li S, Tan J, Zhao J, Kass DJ et al (2016) De-ubiquitinating enzyme, usp11, promotes transforming growth factor beta-1 signaling through stabilization of transforming growth factor beta receptor ii. Cell Dealth Dis 7:e2474
Kaplan E, Meier P (1958) Nonparametric estimator from incomplete observations. J Am Stat Assoc 53:457–481
Ke J, Dai C, Wu W, Gao J, Xia A, Liu G et al (2014) Usp11 regulates p53 stability by deubiquitinating p53. J Zhejiang Univ Sci B 15(4):1032–1038
Koul D, Parthasarathy R, Shen R, Davies M, Jasser S, Chintala S et al (2001) Suppression of matrix metalloproteinase-2 gene expression and invasion in human glioma cells by mmac/pten. Oncogene 20:6669–6678
Lee EJ, Rath P, Liu J, Ryu D, Pei L, Noonepalle SK et al (2015) Identification of global dna methylation signatures in glioblastoma-derived cancer stem cells. J Genet Genom 42:355–371
Lin D, Zhang J, Li J, He H, Deng H, Wang Y (2014) Integrative analysis of multiple diverse omics datasets by sparse group multitask regression. Front Cell Dev Biol 2:62
Liu J, Huang J, Ma S (2013) Incorporating network structure in integrative analysis of cancer prognosis data. Genet Epidemiol 37(2):173–183
Liu J, Huang J, Ma S (2014) Integrative analysis of cancer diagnosis studies with composite penalization. Scand J Stat Theory Appl 41(1):87–103
Luan Y, Li H (2008) Group additive regression models for genomic data analysis. Biostatistics 9(1):100–113
Ma S, Huang J, Song X (2011) Integrative analysis and variable selection with multiple high-dimensional data sets. Biostatistics 12(4):763–775
Masood N, Yasmin A (2017) Entangling relation of micro rna-let7, mirna-200 and mirna-125 with various cancers. Pathol Oncol Res 23(4):707–715
Matthews B (1975) Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochim Biophys Acta 405:442–451
Mermel CH, Schumacher SE, Hill B, Meyerson ML, Beroukhim R, Getz G (2011) Gistic2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol 12(4):R41
Mo Q, Wang S, Seshan VE, Olshen AB, Schultz N, Sander C, Powers RS, Ladany M, Shen R (2013) Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc Natl Acad Sci 110(11):4245–4250
Qu H, Zheng L, Pu J, Mei H, Xiang X, Zhao X et al (2015) mirna-558 promotes tumorigenesis and aggressiveness of neuroblastoma cells through activating the transcription of heparanase. Hum Mol Genet 24:2539–2551
Richardson S, Tseng G, Sun W (2016) Statistical methods in integrative genomics. Ann Rev Stat Appl 3:181–209
Ruano Y, Mollejo M, Camacho F, Rodriguez A, Fiano C, Ribalta T et al (2008) Identification of survival-related genes of the phosphatidylinositol 3’-kinase signaling pathway in glioblastoma multiforme. Cancer 112:1575–1584
Shen R, Olshen AB, Ladanyi M (2009) Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25:2906–2912
Shen R, Wang S, Mo Q (2013) Sparse integrative clustering of multiple omics data sets. Ann Appl Stat 7(1):269–294
Shi X, Liu J, Huang J, Zhou Y, Shia B, Ma S (2014) Integrative analysis of high-throughput cancer studies with contrasted penalization. Genet Epidemiol 38(2):144–151
Shih J, Louis T (1995) Assessing gamma frailty models for clustered failure time data. Lifetime Data Anal 1:205–220
Shohet J, Ghosh R, Coarfa C, Ludwig A, Benham A, Chen Z et al (2011) A genome-wide search for promoters that respond to increased mycn reveals both new oncogenic and tumor suppressor micrornas associated with aggressive neuroblastoma. Cancer Res 71:3841–3851
Simon N, Friedman J, Hastie T, Tibshirani R (2013) A sparse-group lasso. J Comput Graph Stat 22(2):231–245
Son J, Jeong HO, Park D, No S, Lee E, Lee J et al (2017) mir-10a and mir-204 as a potential prognostic indicator in low-grade gliomas. Cancer Inform. https://doi.org/10.1177/1176935117702878
Sperandio S, Tardito S, Surzycki A, Latterich M, de Belle I (2009) Toe1 interacts with p53 to modulate its transactivation potential. Febs Lett 583:2165–2170
Stute W (1996) Distributional convergence under random censorship when covariables are present. Scand J Stat 23:461–471
Sun H, Wang S (2012) Penalized logistic regression for high-dimensional dna methylation data with case-control studies. Bioinformatics 28(10):1368–1375
Tanikawa C, Furukawa Y, Yoshida N, Arakawa H, Nakamura Y, Matsuda K (2009) Xedar as a putative colorectal tumor suppressor that mediates p53-regulated anoikis pathway. Oncogene 28:3081–3092
Tanikawa C, Ri C, Kumar V, Nakamura Y, Matsuda K (2010) Crosstalk of eda-a2/xedar in the p53 signaling pathway. Mol Cancer Res Mcr 8:855–863
Tibshirani R (1997) The LASSO method for variable selection in the cox model. Stat Med 16:385–395
Wang W, Baladandayuthapani V, Morris J, Broom B, Manyam G, Do KA (2013) ibag: integrative Bayesian analysis of high-dimensional multiplatform genomics data. Bioinformatics 29:149–159
Wei LJ (1992) The accelerated failure time model: a useful alternative to the cox regression model in survival analysis. Stat Med 11:1871–1879
Wu HC, Lin YC, Liu CH, Chung HC, Wang YT, Lin YW et al (2014) Usp11 regulates pml stability to control notch-induced malignancy in brain tumours. Nat Commun 5:3214
Xu D, Ma P, Gao G, Gui Y, Niu X, Jin B (2015a) Microrna-383 expression regulates proliferation, migration, invasion, and apoptosis in human glioma cells. Tumor Biol J Int Soc Oncodev Biol Med 36:7743–7753
Xu S, Liu S, Cui W, Shi Y, Liu Q, Duan J et al (2015b) Aldehyde dehydrogenase 1a1 circumscribes high invasive glioma cells and predicts poor prognosis. Am J Cancer Res 5:1471–1483
Yamaguchi T, Kimura J, Miki Y, Yoshida K (2007) The deubiquitinating enzyme usp11 controls an ikappab kinase alpha (ikkalpha)-p53 signaling pathway in response to tumor necrosis factor alpha (tnfalpha). J Biol Chem 282:33943–33948
Ying Z, Li Y, Wu J, Zhu X, Yang Y, Tian H et al (2013) Loss of mir-204 expression enhances glioma migration and stem cell-like phenotype. Cancer Res 73:990–999
Zhang CH (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38:894–942
Zhang Q, Zhang S, Liu J, Huang J, Ma S (2015) Penalized integrative analysis under the accelerated failure time model. arXiv:1501.02458
Zhao Q, Shi X, Huang J, Liu J, Li Y, Ma S (2015) Integrative analysis of “-omics” data using penalty functions. Wiley Interdiscip Rev Comput Stat 7(1):99–108
Zhu R, Zhao Q, Zhao H, Ma S (2016) Integrating multidimensional omics data for cancer outcome. Biostatistics 17(4):605–618
Acknowledgements
The authors would like to thank the annymous referees and the editors for their constructive comments that led to improvement of this manuscript. Part of the work was done while S. Deng was a postdoctoral fellow at the Medical College of Georgia. H. Shi is a Georgia Cancer Coalition Distinguished Scientist.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix1: Brief summary of the further data processing procedure
Appendix1: Brief summary of the further data processing procedure
The datasets were further processed as follows.
From the methylation data, one can get a probe file, which gives the information for the used probes, including the “probe id”, “gene names”, “chrom”, “chromStart”, “chromEnd” and “strand”.
First, we deleted the probes that are not associated with genes.
In the second step, we grouped the probes according to genes. If one group had more then one gene names, we chose the gene name that was the same as the above group as the representative if one of these gene names was the same as the above group, or just chose one gene name as the representative if there is not any gene name that was the same as the above group; then deleted the remaining if they were the same as the below gene group, or reserved the remaining if they were different from the below gene group and added them into the bottom of the original probe file as new groups.
In the third step, we matched the genes in the probe file with those in the CNV data file, resulting in 288984 probes annotated to 18191 common genes.
For the CNV data with the common genes, matching the samples with those in the original clinical data file, we obtained the final used CNV data file with \(p_2=18191\) genes and \(n_2=519\) samples: \({\varvec{X}}^{2}_{n_2 \times p_2}\) (the 2-nd genome data).
For the methylation data, deleting the “NA”s in the file, matching the probes in the methylation file with those in the well-handled probe file, and then matching the samples with those in the original clinical dataset, resulted in a data set of \(p_1=288984\) probes and \(n_1=95\) samples to be used as \({\varvec{X}}^{1}_{n_1\times p_1}\) (the 1-st genome data).
For the clinical covariates to be analyzed with the methylation data, for patient \(i (i=1,\ldots , n_1)\), define \(Z^{1}_{i1}\) as the age at procedure (mean=61.54, max=85.6, min=23.4) , \(Z^{1}_{i2}\) as the gender indicator to be equal to 1 if the patient is female and 0 otherwise (40 females, 55 males), \(Z^{1}_{i3}\) to be equal to 1 if the IDH1 status is “WT”, 2 if the IDH1 status is “R132G”, 3 if the IDH1 status is “R132H” and 4 otherwise (86 “WT”, 1 “R132G”, 4 “R132H”, 4 others), \(Z^{1}_{i4}\) to be equal to 1 if therapy class is “TMZ Chemoradiation, TMZ Chemo”, 2 if the therapy class is “Nonstandard Radiation”, 3 if the therapy class is “Nonstandard Radiation, TMZ Chemo”, 4 if the therapy class is “Standard Radiation, TMZ Chemo”, 5 if the therapy class is “Standard Radiation”, 6 if the therapy class is “Standard Radiation, Alkylating Chemo”, 7 if the therapy class is “Unspecified Therapy”, 8 if the therapy class is “Unspecified Radiation”, 9 if the therapy class is “Alkylating Chemo”, 10 if the therapy class is “TMZ Chemo” (the number of samples are 42, 4, 11, 14, 4, 1, 6, 11, 1, 1, respectively).
The observed failure time and failure time indicator for the final matching samples can be obtained from the clinical data set along with the methylation. Define \(Y^1_{i}\) as the observed failure time in days (min = 24, max = 1788) and \(\delta ^1_i\) as the event indicator, which is equal to 1 for death and 0 for censoring (the censoring rate is \(57\%\)). For the clinical covariates to be analyzed with the CNV data, for patient \(j (j=1,\ldots , n_2)\), \(Z^2_{j1}, Z^2_{j2}, Z^2_{j3}, Z^2_{j4}, Y^2_j\) and \(\delta ^2_j\) are defined in the same way as the above.
The summary information for these covariates are as follows. The mean of \(Z^2_{1}\) is equal to 58.25, maximum 89.30 and minimum 10.90. There are 202 females and 317 males for \(Z^2_{2}\). The IDH1 status of 375 samples is “WT”, that of 1 sample is “R132G”, that of 26 samples is “R132H”, and that of 116 is “others”. The number of samples for the 10 different therapy class are 209, 29, 19, 85, 54, 28, 13, 58, 7, 4 respectively. The minimum and maximum value for \(Y^2\) is equal to 3 and 3881, and the censoring rate is \(53\%\) for this clinical data.
Rights and permissions
About this article
Cite this article
Deng, S., Chen, J. & Shi, H. Integrative analysis of multiple types of genomic data using an accelerated failure time frailty model. Comput Stat 36, 1499–1532 (2021). https://doi.org/10.1007/s00180-020-01060-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-020-01060-5