Skip to main content
Log in

The benefit of data-based model complexity selection via prediction error curves in time-to-event data

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

The fitting of predictive survival models usually involves determination of model complexity parameters. Up to now, there was no general applicable model selection criterion for semi- or non-parametric approaches. The integrated prediction error curve, an estimator of the integrated Brier score, has the ability to close this gap and allows a reasonable, data-based choice of complexity parameters for any kind of model where risk predictions can be obtained. Random survival forests are used as example throughout the article. Here, a critical complexity parameter might be the number of candidate variables at each node. Model selection by our integrated prediction error curve criterion is compared to a frequently used rule of thumb, investigating the potential benefit regarding prediction performance. For that, simulated microarray survival data as well as two real data sets of patients with diffuse large-B-cell lymphoma and of patients with neuroblastoma are used. It is shown, that the optimal parameter value depends on the amount of information in the data and that a data-based selection can therefore be beneficial in several settings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Binder H, Schumacher M (2008) Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples. Stat Appl Genet Mol Biol. 7(1):12. http://www.bepress.com/sagmb/vol7/iss1/art12

  • Binder H, Schumacher M (2008) Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinform 9: 14

    Article  Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45: 5–32

    Article  MATH  Google Scholar 

  • Breiman L (2002) Manual on setting up, using, and understanding random forests v3.1. http://oz.berkeley.edu/users/breiman/Using_random_forests_V3.1.pdf

  • Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. Wadsworth & Brooks, Monterey

    MATH  Google Scholar 

  • Brier GW (1950) Verification of forecasts expressed in terms of probability. Mon Weather Rev 78: 1–3

    Article  Google Scholar 

  • Cox DR (1972) Regression models and life-tables (with discussion). J Roy Stat Soc Ser B Methodol 34: 187–220

    MATH  Google Scholar 

  • Efron B, Tibshirani R (1997) Improvements on cross-validation: the .632+ bootstrap method. J Am Stat Assoc 92(438): 548–560

    Article  MathSciNet  MATH  Google Scholar 

  • Gerds TA, Cai T, Schumacher M (2008) The performance of risk prediction models. Biom J 50(4): 457–479

    Article  MathSciNet  Google Scholar 

  • Gerds TA, Schumacher M (2006) Consistent estimation of the expected Brier score in general survival models with right-censored event times. Biom J 48: 1029–1040

    Article  MathSciNet  Google Scholar 

  • Gerds TA, Schumacher M (2007) Efron-type measures of prediction error for survival analysis. Biometrics 63(4): 1283–1287. doi:10.1111/j.1541-0420.2007.00832.x

    MathSciNet  MATH  Google Scholar 

  • Gneiting T, Raftery A (2007) Strictly proper scoring rules, prediction and estimation. J Am Stat Assoc 102(477): 359–378

    Article  MathSciNet  MATH  Google Scholar 

  • Graf E, Schmoor C, Sauerbrei W, Schumacher M (1999) Assessment and comparison of prognostic classification schemes for survival data. Stat Med 18: 2529–2545

    Article  Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York

    MATH  Google Scholar 

  • Ishwaran H, Kogalur UB (2008) randomSurvivalForest: Ishwaran and Kogalur’s random survival forest. http://www.bio.ri.ccf.org/Resume/Pages/Ishwaran/ishwaran.html, http://www.kogalur-shear.com, R package version 3.5.1

  • Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS (2008) Random survival forests. Ann Appl Stat 2(3): 841–860

    Article  MathSciNet  MATH  Google Scholar 

  • Ishwaran H, Kogalur UB, Gorodeski EZ, Minn AJ, Lauer MS (2010) High-dimensional variable selection for survival data. J Am Stat Assoc 105(489): 205–217

    Article  Google Scholar 

  • Knaus J, Porzelius C, Binder H, Schwarzer G (2009) Easier parallel computing in R with snowfall and sfCluster. R J 1: 54–59

    Google Scholar 

  • Oberthür A, Kaderali L, Kahlert Y, Hero B, Westermann F, Berthold F, Brors B, Eils R, Fischer M (2008) Subclassification and individual survival time prediction from gene expression data of neuroblastoma patients by using CASPAR. Clin Cancer Res 14(20): 6590–6601

    Article  Google Scholar 

  • Porzelius C, Binder H (2009) peperr: Parallelised estimation of prediction error. http://cran.r-project.org, R package version 1.1-4

  • Porzelius C, Binder H, Schumacher M. (2009) Parallelized prediction error estimation for evaluation of high-dimensional models. Bioinformatics 25(6): 827–829 http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp062v1, doi:10.1093/bioinformatics/btp062

    Google Scholar 

  • Porzelius C, Schumacher M, Binder H (2010) A general, prediction error based criterion for selecting model complexity for high-dimensional survival models. Stat Med 29: 830–838

    Article  Google Scholar 

  • R Development Core Team (2009) R: A language and environment for statistical computing. Vienna, Austria http://www.R-project.org, ISBN 3-900051-07-0

  • Radespiel-Tröger M, Gefeller O, Rabenstein T, Hothorn T (2006) Association between split selection instability and predictive error in survival trees. Methods Inf Med 45(5): 548–556

    Google Scholar 

  • Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyna RD, Muller-Hermelink HK, Smeland EB, Staudt LM (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. New England J Med 346(25): 1937–1946

    Article  Google Scholar 

  • Schumacher M, Binder H, Gerds TA (2007) Assessment of survival prediction models based on microarray data. Bioinformatics 23(14): 1768–1774

    Article  Google Scholar 

  • Simon R, Radmacher M, Dobbin K, McShane L (2003) Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 95(1): 14–18

    Article  Google Scholar 

  • Strobl C, Boulesteix AL, Kneib T, Augustin T, Hothorn T (2008) Conditional variable importance for random forests. BMC Bioinform 9(1): 307

    Article  Google Scholar 

  • Strobl C, Boulesteix AL, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform 8(1): 25

    Article  Google Scholar 

  • Zhu M (2008) Kernels and ensembles: perspectives on statistical learning. Am Stat 62(2): 97–109

    Article  Google Scholar 

  • Ziegler A, König IR, Thompson JR (2008) Biostatistical aspects of genome-wide association studies. Biometr J 50(1): 8–28

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christine Porzelius.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Porzelius, C., Schumacher, M. & Binder, H. The benefit of data-based model complexity selection via prediction error curves in time-to-event data. Comput Stat 26, 293–302 (2011). https://doi.org/10.1007/s00180-011-0236-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-011-0236-6

Keywords

Navigation