Integrative analysis of multiple types of genomic data using an accelerated failure time frailty model

Deng, Shirong; Chen, Jie; Shi, Huidong

doi:10.1007/s00180-020-01060-5

Integrative analysis of multiple types of genomic data using an accelerated failure time frailty model

Original paper
Published: 03 February 2021

Volume 36, pages 1499–1532, (2021)
Cite this article

Computational Statistics Aims and scope Submit manuscript

221 Accesses
Explore all metrics

Abstract

As the high throughput technologies rapidly develop, multiple types of genomic data become available within and across different studies. It has become a challenging task in modern statistical research to use all types of genomic data to infer some disease-prone genetic information. In this work, we propose an integrative analysis of multiple and different types of genomic data, clinical covariates and survival data under a framework of an accelerated failure time with frailty model. The proposed integrative approach aims to answer some aspects of the complex problem in genomic data analysis by finding relevant genomic features and inferring patients’ survival time using identified features. The proposed integrative approach is developed using a weighted least-squares with a sparse group LASSO penalty as the objective function to simultaneously estimate and select the relevant features. Extensive simulation studies are conducted to assess the performance of the proposed method with two types of genomic data, DNA methylation data and copy number variation data, on 600 genes and three clinical covariates. The simulation results show promises of the proposed method. The proposed method is applied to the analysis of the Cancer Genome Atlas data on Glioblastoma, a lethal brain cancer, and biologically interpretable results are obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RETRACTED ARTICLE: Robust Model Selection and Estimation for Censored Survival Data with High Dimensional Genomic Covariates

Article 28 May 2019

Guorong Chen, Sijian Wang, … Huanxue Pan

Integrative Analysis of Multiple Cancer Prognosis Datasets Under the Heterogeneity Model

Low-dimensional confounder adjustment and high-dimensional penalized estimation for survival analysis

Article 13 October 2015

Xiaochao Xia, Binyan Jiang, … Wenyang Zhang

References

Auvergne R, Sim F, Wang S, Chandler-Militello D, Burch J, Al Fanek Y et al (2013) Transcriptional differences between normal and glioma-derived glial progenitor cells identify a core set of dysregulated genes. Cell Rep 3:2127–2141
Google Scholar
Baldi P, Brunak S, Chauvin Y, Andersen CAF, Nielsen H (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinform Rev 16(5):412–424
Google Scholar
Brennan CW, Verhaak RG (2013) TCGA research network the somatic genomic landscape of glioblastoma. Cell 155(2):462–477
Google Scholar
Chekouo T, Stingo F, Doecke J, Do KA (2017) A bayesian integrative approach for multi-platform genomic data: a kidney cancer case study. Biometrics 2:615–624
MathSciNet MATH Google Scholar
Cheng S, Tu Y, Zhang S (2013) Foxm1 promotes glioma cells progression by up-regulating anxa1 expression. PLoS One 8:e72376
Google Scholar
Du P, Zhang X, Huang C, Jafari N, Kibbe W, Hou L, Lin S (2010) Comparison of beta-value and m-value methods for quantifying methylation levels by microarray analysis. BMC Bioinform 11:587
Google Scholar
Duchateau L, Janssen P (2008) The frailty model. Springer, New York
MATH Google Scholar
Fan J, Li R (2002) Variable selection for cox’s proportional hazards model and frailty model. Ann Stat 30(1):74–99
MathSciNet MATH Google Scholar
Friedman J, Hastie T, Tibshirani R (2010) A note on the group lasso and a sparse group lasso. arXiv:1001.0736
Huang T, Wu B, Lizardi P, Zhao H (2005) Detection of dna copy number alterations using penalized least squares regression. Bioinformatics 21(20):3811–3817
Google Scholar
Jacko AM, Nan L, Li S, Tan J, Zhao J, Kass DJ et al (2016) De-ubiquitinating enzyme, usp11, promotes transforming growth factor beta-1 signaling through stabilization of transforming growth factor beta receptor ii. Cell Dealth Dis 7:e2474
Google Scholar
Kaplan E, Meier P (1958) Nonparametric estimator from incomplete observations. J Am Stat Assoc 53:457–481
MATH Google Scholar
Ke J, Dai C, Wu W, Gao J, Xia A, Liu G et al (2014) Usp11 regulates p53 stability by deubiquitinating p53. J Zhejiang Univ Sci B 15(4):1032–1038
Google Scholar
Koul D, Parthasarathy R, Shen R, Davies M, Jasser S, Chintala S et al (2001) Suppression of matrix metalloproteinase-2 gene expression and invasion in human glioma cells by mmac/pten. Oncogene 20:6669–6678
Google Scholar
Lee EJ, Rath P, Liu J, Ryu D, Pei L, Noonepalle SK et al (2015) Identification of global dna methylation signatures in glioblastoma-derived cancer stem cells. J Genet Genom 42:355–371
Google Scholar
Lin D, Zhang J, Li J, He H, Deng H, Wang Y (2014) Integrative analysis of multiple diverse omics datasets by sparse group multitask regression. Front Cell Dev Biol 2:62
Google Scholar
Liu J, Huang J, Ma S (2013) Incorporating network structure in integrative analysis of cancer prognosis data. Genet Epidemiol 37(2):173–183
Google Scholar
Liu J, Huang J, Ma S (2014) Integrative analysis of cancer diagnosis studies with composite penalization. Scand J Stat Theory Appl 41(1):87–103
MathSciNet MATH Google Scholar
Luan Y, Li H (2008) Group additive regression models for genomic data analysis. Biostatistics 9(1):100–113
MATH Google Scholar
Ma S, Huang J, Song X (2011) Integrative analysis and variable selection with multiple high-dimensional data sets. Biostatistics 12(4):763–775
MATH Google Scholar
Masood N, Yasmin A (2017) Entangling relation of micro rna-let7, mirna-200 and mirna-125 with various cancers. Pathol Oncol Res 23(4):707–715
Google Scholar
Matthews B (1975) Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochim Biophys Acta 405:442–451
Google Scholar
Mermel CH, Schumacher SE, Hill B, Meyerson ML, Beroukhim R, Getz G (2011) Gistic2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol 12(4):R41
Google Scholar
Mo Q, Wang S, Seshan VE, Olshen AB, Schultz N, Sander C, Powers RS, Ladany M, Shen R (2013) Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc Natl Acad Sci 110(11):4245–4250
Google Scholar
Qu H, Zheng L, Pu J, Mei H, Xiang X, Zhao X et al (2015) mirna-558 promotes tumorigenesis and aggressiveness of neuroblastoma cells through activating the transcription of heparanase. Hum Mol Genet 24:2539–2551
Google Scholar
Richardson S, Tseng G, Sun W (2016) Statistical methods in integrative genomics. Ann Rev Stat Appl 3:181–209
Google Scholar
Ruano Y, Mollejo M, Camacho F, Rodriguez A, Fiano C, Ribalta T et al (2008) Identification of survival-related genes of the phosphatidylinositol 3’-kinase signaling pathway in glioblastoma multiforme. Cancer 112:1575–1584
Google Scholar
Shen R, Olshen AB, Ladanyi M (2009) Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25:2906–2912
Google Scholar
Shen R, Wang S, Mo Q (2013) Sparse integrative clustering of multiple omics data sets. Ann Appl Stat 7(1):269–294
MathSciNet MATH Google Scholar
Shi X, Liu J, Huang J, Zhou Y, Shia B, Ma S (2014) Integrative analysis of high-throughput cancer studies with contrasted penalization. Genet Epidemiol 38(2):144–151
Google Scholar
Shih J, Louis T (1995) Assessing gamma frailty models for clustered failure time data. Lifetime Data Anal 1:205–220
MathSciNet MATH Google Scholar
Shohet J, Ghosh R, Coarfa C, Ludwig A, Benham A, Chen Z et al (2011) A genome-wide search for promoters that respond to increased mycn reveals both new oncogenic and tumor suppressor micrornas associated with aggressive neuroblastoma. Cancer Res 71:3841–3851
Google Scholar
Simon N, Friedman J, Hastie T, Tibshirani R (2013) A sparse-group lasso. J Comput Graph Stat 22(2):231–245
MathSciNet Google Scholar
Son J, Jeong HO, Park D, No S, Lee E, Lee J et al (2017) mir-10a and mir-204 as a potential prognostic indicator in low-grade gliomas. Cancer Inform. https://doi.org/10.1177/1176935117702878
Article Google Scholar
Sperandio S, Tardito S, Surzycki A, Latterich M, de Belle I (2009) Toe1 interacts with p53 to modulate its transactivation potential. Febs Lett 583:2165–2170
Google Scholar
Stute W (1996) Distributional convergence under random censorship when covariables are present. Scand J Stat 23:461–471
MathSciNet MATH Google Scholar
Sun H, Wang S (2012) Penalized logistic regression for high-dimensional dna methylation data with case-control studies. Bioinformatics 28(10):1368–1375
Google Scholar
Tanikawa C, Furukawa Y, Yoshida N, Arakawa H, Nakamura Y, Matsuda K (2009) Xedar as a putative colorectal tumor suppressor that mediates p53-regulated anoikis pathway. Oncogene 28:3081–3092
Google Scholar
Tanikawa C, Ri C, Kumar V, Nakamura Y, Matsuda K (2010) Crosstalk of eda-a2/xedar in the p53 signaling pathway. Mol Cancer Res Mcr 8:855–863
Google Scholar
Tibshirani R (1997) The LASSO method for variable selection in the cox model. Stat Med 16:385–395
Google Scholar
Wang W, Baladandayuthapani V, Morris J, Broom B, Manyam G, Do KA (2013) ibag: integrative Bayesian analysis of high-dimensional multiplatform genomics data. Bioinformatics 29:149–159
Google Scholar
Wei LJ (1992) The accelerated failure time model: a useful alternative to the cox regression model in survival analysis. Stat Med 11:1871–1879
Google Scholar
Wu HC, Lin YC, Liu CH, Chung HC, Wang YT, Lin YW et al (2014) Usp11 regulates pml stability to control notch-induced malignancy in brain tumours. Nat Commun 5:3214
Google Scholar
Xu D, Ma P, Gao G, Gui Y, Niu X, Jin B (2015a) Microrna-383 expression regulates proliferation, migration, invasion, and apoptosis in human glioma cells. Tumor Biol J Int Soc Oncodev Biol Med 36:7743–7753
Google Scholar
Xu S, Liu S, Cui W, Shi Y, Liu Q, Duan J et al (2015b) Aldehyde dehydrogenase 1a1 circumscribes high invasive glioma cells and predicts poor prognosis. Am J Cancer Res 5:1471–1483
Google Scholar
Yamaguchi T, Kimura J, Miki Y, Yoshida K (2007) The deubiquitinating enzyme usp11 controls an ikappab kinase alpha (ikkalpha)-p53 signaling pathway in response to tumor necrosis factor alpha (tnfalpha). J Biol Chem 282:33943–33948
Google Scholar
Ying Z, Li Y, Wu J, Zhu X, Yang Y, Tian H et al (2013) Loss of mir-204 expression enhances glioma migration and stem cell-like phenotype. Cancer Res 73:990–999
Google Scholar
Zhang CH (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38:894–942
MathSciNet MATH Google Scholar
Zhang Q, Zhang S, Liu J, Huang J, Ma S (2015) Penalized integrative analysis under the accelerated failure time model. arXiv:1501.02458
Zhao Q, Shi X, Huang J, Liu J, Li Y, Ma S (2015) Integrative analysis of “-omics” data using penalty functions. Wiley Interdiscip Rev Comput Stat 7(1):99–108
MathSciNet Google Scholar
Zhu R, Zhao Q, Zhao H, Ma S (2016) Integrating multidimensional omics data for cancer outcome. Biostatistics 17(4):605–618
MathSciNet Google Scholar

Download references

Acknowledgements

The authors would like to thank the annymous referees and the editors for their constructive comments that led to improvement of this manuscript. Part of the work was done while S. Deng was a postdoctoral fellow at the Medical College of Georgia. H. Shi is a Georgia Cancer Coalition Distinguished Scientist.

Author information

Authors and Affiliations

School of Mathematics and Statistics, Wuhan University, Wuhan, 430072, Hubei, China
Shirong Deng
Division of Biostatistics and Data Science, Department of Population Health Sciences, Medical College of Georgia, Augusta University, Augusta, GA, USA
Jie Chen
Department of Biochemistry and Molecular Biology and Georgia Cancer Center, Augusta University, Augusta, GA, USA
Huidong Shi

Authors

Shirong Deng
View author publications
You can also search for this author in PubMed Google Scholar
Jie Chen
View author publications
You can also search for this author in PubMed Google Scholar
Huidong Shi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jie Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix1: Brief summary of the further data processing procedure

The datasets were further processed as follows.

From the methylation data, one can get a probe file, which gives the information for the used probes, including the “probe id”, “gene names”, “chrom”, “chromStart”, “chromEnd” and “strand”.

First, we deleted the probes that are not associated with genes.

In the second step, we grouped the probes according to genes. If one group had more then one gene names, we chose the gene name that was the same as the above group as the representative if one of these gene names was the same as the above group, or just chose one gene name as the representative if there is not any gene name that was the same as the above group; then deleted the remaining if they were the same as the below gene group, or reserved the remaining if they were different from the below gene group and added them into the bottom of the original probe file as new groups.

In the third step, we matched the genes in the probe file with those in the CNV data file, resulting in 288984 probes annotated to 18191 common genes.

For the CNV data with the common genes, matching the samples with those in the original clinical data file, we obtained the final used CNV data file with \(p_2=18191\) genes and \(n_2=519\) samples: \({\varvec{X}}^{2}_{n_2 \times p_2}\) (the 2-nd genome data).

For the methylation data, deleting the “NA”s in the file, matching the probes in the methylation file with those in the well-handled probe file, and then matching the samples with those in the original clinical dataset, resulted in a data set of \(p_1=288984\) probes and \(n_1=95\) samples to be used as \({\varvec{X}}^{1}_{n_1\times p_1}\) (the 1-st genome data).

For the clinical covariates to be analyzed with the methylation data, for patient \(i (i=1,\ldots , n_1)\), define \(Z^{1}_{i1}\) as the age at procedure (mean=61.54, max=85.6, min=23.4) , \(Z^{1}_{i2}\) as the gender indicator to be equal to 1 if the patient is female and 0 otherwise (40 females, 55 males), \(Z^{1}_{i3}\) to be equal to 1 if the IDH1 status is “WT”, 2 if the IDH1 status is “R132G”, 3 if the IDH1 status is “R132H” and 4 otherwise (86 “WT”, 1 “R132G”, 4 “R132H”, 4 others), \(Z^{1}_{i4}\) to be equal to 1 if therapy class is “TMZ Chemoradiation, TMZ Chemo”, 2 if the therapy class is “Nonstandard Radiation”, 3 if the therapy class is “Nonstandard Radiation, TMZ Chemo”, 4 if the therapy class is “Standard Radiation, TMZ Chemo”, 5 if the therapy class is “Standard Radiation”, 6 if the therapy class is “Standard Radiation, Alkylating Chemo”, 7 if the therapy class is “Unspecified Therapy”, 8 if the therapy class is “Unspecified Radiation”, 9 if the therapy class is “Alkylating Chemo”, 10 if the therapy class is “TMZ Chemo” (the number of samples are 42, 4, 11, 14, 4, 1, 6, 11, 1, 1, respectively).

The observed failure time and failure time indicator for the final matching samples can be obtained from the clinical data set along with the methylation. Define \(Y^1_{i}\) as the observed failure time in days (min = 24, max = 1788) and \(\delta ^1_i\) as the event indicator, which is equal to 1 for death and 0 for censoring (the censoring rate is \(57\%\)). For the clinical covariates to be analyzed with the CNV data, for patient \(j (j=1,\ldots , n_2)\), \(Z^2_{j1}, Z^2_{j2}, Z^2_{j3}, Z^2_{j4}, Y^2_j\) and \(\delta ^2_j\) are defined in the same way as the above.

The summary information for these covariates are as follows. The mean of \(Z^2_{1}\) is equal to 58.25, maximum 89.30 and minimum 10.90. There are 202 females and 317 males for \(Z^2_{2}\). The IDH1 status of 375 samples is “WT”, that of 1 sample is “R132G”, that of 26 samples is “R132H”, and that of 116 is “others”. The number of samples for the 10 different therapy class are 209, 29, 19, 85, 54, 28, 13, 58, 7, 4 respectively. The minimum and maximum value for \(Y^2\) is equal to 3 and 3881, and the censoring rate is \(53\%\) for this clinical data.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Deng, S., Chen, J. & Shi, H. Integrative analysis of multiple types of genomic data using an accelerated failure time frailty model. Comput Stat 36, 1499–1532 (2021). https://doi.org/10.1007/s00180-020-01060-5

Download citation

Received: 18 June 2019
Accepted: 15 December 2020
Published: 03 February 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s00180-020-01060-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Integrative analysis of multiple types of genomic data using an accelerated failure time frailty model

Abstract

Access this article

Similar content being viewed by others

RETRACTED ARTICLE: Robust Model Selection and Estimation for Censored Survival Data with High Dimensional Genomic Covariates

Integrative Analysis of Multiple Cancer Prognosis Datasets Under the Heterogeneity Model

Low-dimensional confounder adjustment and high-dimensional penalized estimation for survival analysis

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix1: Brief summary of the further data processing procedure

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Integrative analysis of multiple types of genomic data using an accelerated failure time frailty model

Abstract

Access this article

Similar content being viewed by others

RETRACTED ARTICLE: Robust Model Selection and Estimation for Censored Survival Data with High Dimensional Genomic Covariates

Integrative Analysis of Multiple Cancer Prognosis Datasets Under the Heterogeneity Model

Low-dimensional confounder adjustment and high-dimensional penalized estimation for survival analysis

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix1: Brief summary of the further data processing procedure

Appendix1: Brief summary of the further data processing procedure

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation