The benefit of data-based model complexity selection via prediction error curves in time-to-event data

Porzelius, Christine; Schumacher, Martin; Binder, Harald

doi:10.1007/s00180-011-0236-6

The benefit of data-based model complexity selection via prediction error curves in time-to-event data

Original Paper
Published: 12 February 2011

Volume 26, pages 293–302, (2011)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Christine Porzelius^1,2,
Martin Schumacher^1,2 &
Harald Binder^1,2

177 Accesses
Explore all metrics

Abstract

The fitting of predictive survival models usually involves determination of model complexity parameters. Up to now, there was no general applicable model selection criterion for semi- or non-parametric approaches. The integrated prediction error curve, an estimator of the integrated Brier score, has the ability to close this gap and allows a reasonable, data-based choice of complexity parameters for any kind of model where risk predictions can be obtained. Random survival forests are used as example throughout the article. Here, a critical complexity parameter might be the number of candidate variables at each node. Model selection by our integrated prediction error curve criterion is compared to a frequently used rule of thumb, investigating the potential benefit regarding prediction performance. For that, simulated microarray survival data as well as two real data sets of patients with diffuse large-B-cell lymphoma and of patients with neuroblastoma are used. It is shown, that the optimal parameter value depends on the amount of information in the data and that a data-based selection can therefore be beneficial in several settings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Random Forests for Survival Analysis and High-Dimensional Data

A Comparison of Cox Model and Machine Learning Techniques in the High-Dimensional Survival Data

Identification of interactions of binary variables associated with survival time using survivalFS

Article 29 January 2019

References

Binder H, Schumacher M (2008) Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples. Stat Appl Genet Mol Biol. 7(1):12. http://www.bepress.com/sagmb/vol7/iss1/art12
Binder H, Schumacher M (2008) Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinform 9: 14
Article Google Scholar
Breiman L (2001) Random forests. Mach Learn 45: 5–32
Article MATH Google Scholar
Breiman L (2002) Manual on setting up, using, and understanding random forests v3.1. http://oz.berkeley.edu/users/breiman/Using_random_forests_V3.1.pdf
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. Wadsworth & Brooks, Monterey
MATH Google Scholar
Brier GW (1950) Verification of forecasts expressed in terms of probability. Mon Weather Rev 78: 1–3
Article Google Scholar
Cox DR (1972) Regression models and life-tables (with discussion). J Roy Stat Soc Ser B Methodol 34: 187–220
MATH Google Scholar
Efron B, Tibshirani R (1997) Improvements on cross-validation: the .632+ bootstrap method. J Am Stat Assoc 92(438): 548–560
Article MathSciNet MATH Google Scholar
Gerds TA, Cai T, Schumacher M (2008) The performance of risk prediction models. Biom J 50(4): 457–479
Article MathSciNet Google Scholar
Gerds TA, Schumacher M (2006) Consistent estimation of the expected Brier score in general survival models with right-censored event times. Biom J 48: 1029–1040
Article MathSciNet Google Scholar
Gerds TA, Schumacher M (2007) Efron-type measures of prediction error for survival analysis. Biometrics 63(4): 1283–1287. doi:10.1111/j.1541-0420.2007.00832.x
MathSciNet MATH Google Scholar
Gneiting T, Raftery A (2007) Strictly proper scoring rules, prediction and estimation. J Am Stat Assoc 102(477): 359–378
Article MathSciNet MATH Google Scholar
Graf E, Schmoor C, Sauerbrei W, Schumacher M (1999) Assessment and comparison of prognostic classification schemes for survival data. Stat Med 18: 2529–2545
Article Google Scholar
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York
MATH Google Scholar
Ishwaran H, Kogalur UB (2008) randomSurvivalForest: Ishwaran and Kogalur’s random survival forest. http://www.bio.ri.ccf.org/Resume/Pages/Ishwaran/ishwaran.html, http://www.kogalur-shear.com, R package version 3.5.1
Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS (2008) Random survival forests. Ann Appl Stat 2(3): 841–860
Article MathSciNet MATH Google Scholar
Ishwaran H, Kogalur UB, Gorodeski EZ, Minn AJ, Lauer MS (2010) High-dimensional variable selection for survival data. J Am Stat Assoc 105(489): 205–217
Article Google Scholar
Knaus J, Porzelius C, Binder H, Schwarzer G (2009) Easier parallel computing in R with snowfall and sfCluster. R J 1: 54–59
Google Scholar
Oberthür A, Kaderali L, Kahlert Y, Hero B, Westermann F, Berthold F, Brors B, Eils R, Fischer M (2008) Subclassification and individual survival time prediction from gene expression data of neuroblastoma patients by using CASPAR. Clin Cancer Res 14(20): 6590–6601
Article Google Scholar
Porzelius C, Binder H (2009) peperr: Parallelised estimation of prediction error. http://cran.r-project.org, R package version 1.1-4
Porzelius C, Binder H, Schumacher M. (2009) Parallelized prediction error estimation for evaluation of high-dimensional models. Bioinformatics 25(6): 827–829 http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp062v1, doi:10.1093/bioinformatics/btp062
Google Scholar
Porzelius C, Schumacher M, Binder H (2010) A general, prediction error based criterion for selecting model complexity for high-dimensional survival models. Stat Med 29: 830–838
Article Google Scholar
R Development Core Team (2009) R: A language and environment for statistical computing. Vienna, Austria http://www.R-project.org, ISBN 3-900051-07-0
Radespiel-Tröger M, Gefeller O, Rabenstein T, Hothorn T (2006) Association between split selection instability and predictive error in survival trees. Methods Inf Med 45(5): 548–556
Google Scholar
Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyna RD, Muller-Hermelink HK, Smeland EB, Staudt LM (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. New England J Med 346(25): 1937–1946
Article Google Scholar
Schumacher M, Binder H, Gerds TA (2007) Assessment of survival prediction models based on microarray data. Bioinformatics 23(14): 1768–1774
Article Google Scholar
Simon R, Radmacher M, Dobbin K, McShane L (2003) Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 95(1): 14–18
Article Google Scholar
Strobl C, Boulesteix AL, Kneib T, Augustin T, Hothorn T (2008) Conditional variable importance for random forests. BMC Bioinform 9(1): 307
Article Google Scholar
Strobl C, Boulesteix AL, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform 8(1): 25
Article Google Scholar
Zhu M (2008) Kernels and ensembles: perspectives on statistical learning. Am Stat 62(2): 97–109
Article Google Scholar
Ziegler A, König IR, Thompson JR (2008) Biostatistical aspects of genome-wide association studies. Biometr J 50(1): 8–28
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg, Stefan-Meier-Str. 26, 79104, Freiburg, Germany
Christine Porzelius, Martin Schumacher & Harald Binder
Freiburg Center for Data Analysis and Modeling, University of Freiburg, Eckerstr. 1, 79104, Freiburg, Germany
Christine Porzelius, Martin Schumacher & Harald Binder

Authors

Christine Porzelius
View author publications
You can also search for this author in PubMed Google Scholar
Martin Schumacher
View author publications
You can also search for this author in PubMed Google Scholar
Harald Binder
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christine Porzelius.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Porzelius, C., Schumacher, M. & Binder, H. The benefit of data-based model complexity selection via prediction error curves in time-to-event data. Comput Stat 26, 293–302 (2011). https://doi.org/10.1007/s00180-011-0236-6

Download citation

Received: 29 September 2009
Accepted: 29 January 2011
Published: 12 February 2011
Issue Date: June 2011
DOI: https://doi.org/10.1007/s00180-011-0236-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The benefit of data-based model complexity selection via prediction error curves in time-to-event data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Random Forests for Survival Analysis and High-Dimensional Data

A Comparison of Cox Model and Machine Learning Techniques in the High-Dimensional Survival Data

Identification of interactions of binary variables associated with survival time using survivalFS

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

The benefit of data-based model complexity selection via prediction error curves in time-to-event data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Random Forests for Survival Analysis and High-Dimensional Data

A Comparison of Cox Model and Machine Learning Techniques in the High-Dimensional Survival Data

Identification of interactions of binary variables associated with survival time using survivalFS

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation