Skip to main content
Log in

Integrative analysis of multiple types of genomic data using an accelerated failure time frailty model

  • Original paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

As the high throughput technologies rapidly develop, multiple types of genomic data become available within and across different studies. It has become a challenging task in modern statistical research to use all types of genomic data to infer some disease-prone genetic information. In this work, we propose an integrative analysis of multiple and different types of genomic data, clinical covariates and survival data under a framework of an accelerated failure time with frailty model. The proposed integrative approach aims to answer some aspects of the complex problem in genomic data analysis by finding relevant genomic features and inferring patients’ survival time using identified features. The proposed integrative approach is developed using a weighted least-squares with a sparse group LASSO penalty as the objective function to simultaneously estimate and select the relevant features. Extensive simulation studies are conducted to assess the performance of the proposed method with two types of genomic data, DNA methylation data and copy number variation data, on 600 genes and three clinical covariates. The simulation results show promises of the proposed method. The proposed method is applied to the analysis of the Cancer Genome Atlas data on Glioblastoma, a lethal brain cancer, and biologically interpretable results are obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Auvergne R, Sim F, Wang S, Chandler-Militello D, Burch J, Al Fanek Y et al (2013) Transcriptional differences between normal and glioma-derived glial progenitor cells identify a core set of dysregulated genes. Cell Rep 3:2127–2141

    Google Scholar 

  • Baldi P, Brunak S, Chauvin Y, Andersen CAF, Nielsen H (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinform Rev 16(5):412–424

    Google Scholar 

  • Brennan CW, Verhaak RG (2013) TCGA research network the somatic genomic landscape of glioblastoma. Cell 155(2):462–477

    Google Scholar 

  • Chekouo T, Stingo F, Doecke J, Do KA (2017) A bayesian integrative approach for multi-platform genomic data: a kidney cancer case study. Biometrics 2:615–624

    MathSciNet  MATH  Google Scholar 

  • Cheng S, Tu Y, Zhang S (2013) Foxm1 promotes glioma cells progression by up-regulating anxa1 expression. PLoS One 8:e72376

    Google Scholar 

  • Du P, Zhang X, Huang C, Jafari N, Kibbe W, Hou L, Lin S (2010) Comparison of beta-value and m-value methods for quantifying methylation levels by microarray analysis. BMC Bioinform 11:587

    Google Scholar 

  • Duchateau L, Janssen P (2008) The frailty model. Springer, New York

    MATH  Google Scholar 

  • Fan J, Li R (2002) Variable selection for cox’s proportional hazards model and frailty model. Ann Stat 30(1):74–99

    MathSciNet  MATH  Google Scholar 

  • Friedman J, Hastie T, Tibshirani R (2010) A note on the group lasso and a sparse group lasso. arXiv:1001.0736

  • Huang T, Wu B, Lizardi P, Zhao H (2005) Detection of dna copy number alterations using penalized least squares regression. Bioinformatics 21(20):3811–3817

    Google Scholar 

  • Jacko AM, Nan L, Li S, Tan J, Zhao J, Kass DJ et al (2016) De-ubiquitinating enzyme, usp11, promotes transforming growth factor beta-1 signaling through stabilization of transforming growth factor beta receptor ii. Cell Dealth Dis 7:e2474

    Google Scholar 

  • Kaplan E, Meier P (1958) Nonparametric estimator from incomplete observations. J Am Stat Assoc 53:457–481

    MATH  Google Scholar 

  • Ke J, Dai C, Wu W, Gao J, Xia A, Liu G et al (2014) Usp11 regulates p53 stability by deubiquitinating p53. J Zhejiang Univ Sci B 15(4):1032–1038

    Google Scholar 

  • Koul D, Parthasarathy R, Shen R, Davies M, Jasser S, Chintala S et al (2001) Suppression of matrix metalloproteinase-2 gene expression and invasion in human glioma cells by mmac/pten. Oncogene 20:6669–6678

    Google Scholar 

  • Lee EJ, Rath P, Liu J, Ryu D, Pei L, Noonepalle SK et al (2015) Identification of global dna methylation signatures in glioblastoma-derived cancer stem cells. J Genet Genom 42:355–371

    Google Scholar 

  • Lin D, Zhang J, Li J, He H, Deng H, Wang Y (2014) Integrative analysis of multiple diverse omics datasets by sparse group multitask regression. Front Cell Dev Biol 2:62

    Google Scholar 

  • Liu J, Huang J, Ma S (2013) Incorporating network structure in integrative analysis of cancer prognosis data. Genet Epidemiol 37(2):173–183

    Google Scholar 

  • Liu J, Huang J, Ma S (2014) Integrative analysis of cancer diagnosis studies with composite penalization. Scand J Stat Theory Appl 41(1):87–103

    MathSciNet  MATH  Google Scholar 

  • Luan Y, Li H (2008) Group additive regression models for genomic data analysis. Biostatistics 9(1):100–113

    MATH  Google Scholar 

  • Ma S, Huang J, Song X (2011) Integrative analysis and variable selection with multiple high-dimensional data sets. Biostatistics 12(4):763–775

    MATH  Google Scholar 

  • Masood N, Yasmin A (2017) Entangling relation of micro rna-let7, mirna-200 and mirna-125 with various cancers. Pathol Oncol Res 23(4):707–715

    Google Scholar 

  • Matthews B (1975) Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochim Biophys Acta 405:442–451

    Google Scholar 

  • Mermel CH, Schumacher SE, Hill B, Meyerson ML, Beroukhim R, Getz G (2011) Gistic2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol 12(4):R41

    Google Scholar 

  • Mo Q, Wang S, Seshan VE, Olshen AB, Schultz N, Sander C, Powers RS, Ladany M, Shen R (2013) Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc Natl Acad Sci 110(11):4245–4250

    Google Scholar 

  • Qu H, Zheng L, Pu J, Mei H, Xiang X, Zhao X et al (2015) mirna-558 promotes tumorigenesis and aggressiveness of neuroblastoma cells through activating the transcription of heparanase. Hum Mol Genet 24:2539–2551

    Google Scholar 

  • Richardson S, Tseng G, Sun W (2016) Statistical methods in integrative genomics. Ann Rev Stat Appl 3:181–209

    Google Scholar 

  • Ruano Y, Mollejo M, Camacho F, Rodriguez A, Fiano C, Ribalta T et al (2008) Identification of survival-related genes of the phosphatidylinositol 3’-kinase signaling pathway in glioblastoma multiforme. Cancer 112:1575–1584

    Google Scholar 

  • Shen R, Olshen AB, Ladanyi M (2009) Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25:2906–2912

    Google Scholar 

  • Shen R, Wang S, Mo Q (2013) Sparse integrative clustering of multiple omics data sets. Ann Appl Stat 7(1):269–294

    MathSciNet  MATH  Google Scholar 

  • Shi X, Liu J, Huang J, Zhou Y, Shia B, Ma S (2014) Integrative analysis of high-throughput cancer studies with contrasted penalization. Genet Epidemiol 38(2):144–151

    Google Scholar 

  • Shih J, Louis T (1995) Assessing gamma frailty models for clustered failure time data. Lifetime Data Anal 1:205–220

    MathSciNet  MATH  Google Scholar 

  • Shohet J, Ghosh R, Coarfa C, Ludwig A, Benham A, Chen Z et al (2011) A genome-wide search for promoters that respond to increased mycn reveals both new oncogenic and tumor suppressor micrornas associated with aggressive neuroblastoma. Cancer Res 71:3841–3851

    Google Scholar 

  • Simon N, Friedman J, Hastie T, Tibshirani R (2013) A sparse-group lasso. J Comput Graph Stat 22(2):231–245

    MathSciNet  Google Scholar 

  • Son J, Jeong HO, Park D, No S, Lee E, Lee J et al (2017) mir-10a and mir-204 as a potential prognostic indicator in low-grade gliomas. Cancer Inform. https://doi.org/10.1177/1176935117702878

    Article  Google Scholar 

  • Sperandio S, Tardito S, Surzycki A, Latterich M, de Belle I (2009) Toe1 interacts with p53 to modulate its transactivation potential. Febs Lett 583:2165–2170

    Google Scholar 

  • Stute W (1996) Distributional convergence under random censorship when covariables are present. Scand J Stat 23:461–471

    MathSciNet  MATH  Google Scholar 

  • Sun H, Wang S (2012) Penalized logistic regression for high-dimensional dna methylation data with case-control studies. Bioinformatics 28(10):1368–1375

    Google Scholar 

  • Tanikawa C, Furukawa Y, Yoshida N, Arakawa H, Nakamura Y, Matsuda K (2009) Xedar as a putative colorectal tumor suppressor that mediates p53-regulated anoikis pathway. Oncogene 28:3081–3092

    Google Scholar 

  • Tanikawa C, Ri C, Kumar V, Nakamura Y, Matsuda K (2010) Crosstalk of eda-a2/xedar in the p53 signaling pathway. Mol Cancer Res Mcr 8:855–863

    Google Scholar 

  • Tibshirani R (1997) The LASSO method for variable selection in the cox model. Stat Med 16:385–395

    Google Scholar 

  • Wang W, Baladandayuthapani V, Morris J, Broom B, Manyam G, Do KA (2013) ibag: integrative Bayesian analysis of high-dimensional multiplatform genomics data. Bioinformatics 29:149–159

    Google Scholar 

  • Wei LJ (1992) The accelerated failure time model: a useful alternative to the cox regression model in survival analysis. Stat Med 11:1871–1879

    Google Scholar 

  • Wu HC, Lin YC, Liu CH, Chung HC, Wang YT, Lin YW et al (2014) Usp11 regulates pml stability to control notch-induced malignancy in brain tumours. Nat Commun 5:3214

    Google Scholar 

  • Xu D, Ma P, Gao G, Gui Y, Niu X, Jin B (2015a) Microrna-383 expression regulates proliferation, migration, invasion, and apoptosis in human glioma cells. Tumor Biol J Int Soc Oncodev Biol Med 36:7743–7753

    Google Scholar 

  • Xu S, Liu S, Cui W, Shi Y, Liu Q, Duan J et al (2015b) Aldehyde dehydrogenase 1a1 circumscribes high invasive glioma cells and predicts poor prognosis. Am J Cancer Res 5:1471–1483

    Google Scholar 

  • Yamaguchi T, Kimura J, Miki Y, Yoshida K (2007) The deubiquitinating enzyme usp11 controls an ikappab kinase alpha (ikkalpha)-p53 signaling pathway in response to tumor necrosis factor alpha (tnfalpha). J Biol Chem 282:33943–33948

    Google Scholar 

  • Ying Z, Li Y, Wu J, Zhu X, Yang Y, Tian H et al (2013) Loss of mir-204 expression enhances glioma migration and stem cell-like phenotype. Cancer Res 73:990–999

    Google Scholar 

  • Zhang CH (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38:894–942

    MathSciNet  MATH  Google Scholar 

  • Zhang Q, Zhang S, Liu J, Huang J, Ma S (2015) Penalized integrative analysis under the accelerated failure time model. arXiv:1501.02458

  • Zhao Q, Shi X, Huang J, Liu J, Li Y, Ma S (2015) Integrative analysis of “-omics” data using penalty functions. Wiley Interdiscip Rev Comput Stat 7(1):99–108

    MathSciNet  Google Scholar 

  • Zhu R, Zhao Q, Zhao H, Ma S (2016) Integrating multidimensional omics data for cancer outcome. Biostatistics 17(4):605–618

    MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the annymous referees and the editors for their constructive comments that led to improvement of this manuscript. Part of the work was done while S. Deng was a postdoctoral fellow at the Medical College of Georgia. H. Shi is a Georgia Cancer Coalition Distinguished Scientist.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jie Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix1: Brief summary of the further data processing procedure

Appendix1: Brief summary of the further data processing procedure

The datasets were further processed as follows.

From the methylation data, one can get a probe file, which gives the information for the used probes, including the “probe id”, “gene names”, “chrom”, “chromStart”, “chromEnd” and “strand”.

First, we deleted the probes that are not associated with genes.

In the second step, we grouped the probes according to genes. If one group had more then one gene names, we chose the gene name that was the same as the above group as the representative if one of these gene names was the same as the above group, or just chose one gene name as the representative if there is not any gene name that was the same as the above group; then deleted the remaining if they were the same as the below gene group, or reserved the remaining if they were different from the below gene group and added them into the bottom of the original probe file as new groups.

In the third step, we matched the genes in the probe file with those in the CNV data file, resulting in 288984 probes annotated to 18191 common genes.

For the CNV data with the common genes, matching the samples with those in the original clinical data file, we obtained the final used CNV data file with \(p_2=18191\) genes and \(n_2=519\) samples: \({\varvec{X}}^{2}_{n_2 \times p_2}\) (the 2-nd genome data).

For the methylation data, deleting the “NA”s in the file, matching the probes in the methylation file with those in the well-handled probe file, and then matching the samples with those in the original clinical dataset, resulted in a data set of \(p_1=288984\) probes and \(n_1=95\) samples to be used as \({\varvec{X}}^{1}_{n_1\times p_1}\) (the 1-st genome data).

For the clinical covariates to be analyzed with the methylation data, for patient \(i (i=1,\ldots , n_1)\), define \(Z^{1}_{i1}\) as the age at procedure (mean=61.54, max=85.6, min=23.4) , \(Z^{1}_{i2}\) as the gender indicator to be equal to 1 if the patient is female and 0 otherwise (40 females, 55 males), \(Z^{1}_{i3}\) to be equal to 1 if the IDH1 status is “WT”, 2 if the IDH1 status is “R132G”, 3 if the IDH1 status is “R132H” and 4 otherwise (86 “WT”, 1 “R132G”, 4 “R132H”, 4 others), \(Z^{1}_{i4}\) to be equal to 1 if therapy class is “TMZ Chemoradiation, TMZ Chemo”, 2 if the therapy class is “Nonstandard Radiation”, 3 if the therapy class is “Nonstandard Radiation, TMZ Chemo”, 4 if the therapy class is “Standard Radiation, TMZ Chemo”, 5 if the therapy class is “Standard Radiation”, 6 if the therapy class is “Standard Radiation, Alkylating Chemo”, 7 if the therapy class is “Unspecified Therapy”, 8 if the therapy class is “Unspecified Radiation”, 9 if the therapy class is “Alkylating Chemo”, 10 if the therapy class is “TMZ Chemo” (the number of samples are 42, 4, 11, 14, 4, 1, 6, 11, 1, 1, respectively).

The observed failure time and failure time indicator for the final matching samples can be obtained from the clinical data set along with the methylation. Define \(Y^1_{i}\) as the observed failure time in days (min = 24, max = 1788) and \(\delta ^1_i\) as the event indicator, which is equal to 1 for death and 0 for censoring (the censoring rate is \(57\%\)). For the clinical covariates to be analyzed with the CNV data, for patient \(j (j=1,\ldots , n_2)\), \(Z^2_{j1}, Z^2_{j2}, Z^2_{j3}, Z^2_{j4}, Y^2_j\) and \(\delta ^2_j\) are defined in the same way as the above.

The summary information for these covariates are as follows. The mean of \(Z^2_{1}\) is equal to 58.25, maximum 89.30 and minimum 10.90. There are 202 females and 317 males for \(Z^2_{2}\). The IDH1 status of 375 samples is “WT”, that of 1 sample is “R132G”, that of 26 samples is “R132H”, and that of 116 is “others”. The number of samples for the 10 different therapy class are 209, 29, 19, 85, 54, 28, 13, 58, 7, 4 respectively. The minimum and maximum value for \(Y^2\) is equal to 3 and 3881, and the censoring rate is \(53\%\) for this clinical data.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Deng, S., Chen, J. & Shi, H. Integrative analysis of multiple types of genomic data using an accelerated failure time frailty model. Comput Stat 36, 1499–1532 (2021). https://doi.org/10.1007/s00180-020-01060-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-020-01060-5

Keywords

Navigation