Abstract
The fitting of predictive survival models usually involves determination of model complexity parameters. Up to now, there was no general applicable model selection criterion for semi- or non-parametric approaches. The integrated prediction error curve, an estimator of the integrated Brier score, has the ability to close this gap and allows a reasonable, data-based choice of complexity parameters for any kind of model where risk predictions can be obtained. Random survival forests are used as example throughout the article. Here, a critical complexity parameter might be the number of candidate variables at each node. Model selection by our integrated prediction error curve criterion is compared to a frequently used rule of thumb, investigating the potential benefit regarding prediction performance. For that, simulated microarray survival data as well as two real data sets of patients with diffuse large-B-cell lymphoma and of patients with neuroblastoma are used. It is shown, that the optimal parameter value depends on the amount of information in the data and that a data-based selection can therefore be beneficial in several settings.
Similar content being viewed by others
References
Binder H, Schumacher M (2008) Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples. Stat Appl Genet Mol Biol. 7(1):12. http://www.bepress.com/sagmb/vol7/iss1/art12
Binder H, Schumacher M (2008) Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinform 9: 14
Breiman L (2001) Random forests. Mach Learn 45: 5–32
Breiman L (2002) Manual on setting up, using, and understanding random forests v3.1. http://oz.berkeley.edu/users/breiman/Using_random_forests_V3.1.pdf
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. Wadsworth & Brooks, Monterey
Brier GW (1950) Verification of forecasts expressed in terms of probability. Mon Weather Rev 78: 1–3
Cox DR (1972) Regression models and life-tables (with discussion). J Roy Stat Soc Ser B Methodol 34: 187–220
Efron B, Tibshirani R (1997) Improvements on cross-validation: the .632+ bootstrap method. J Am Stat Assoc 92(438): 548–560
Gerds TA, Cai T, Schumacher M (2008) The performance of risk prediction models. Biom J 50(4): 457–479
Gerds TA, Schumacher M (2006) Consistent estimation of the expected Brier score in general survival models with right-censored event times. Biom J 48: 1029–1040
Gerds TA, Schumacher M (2007) Efron-type measures of prediction error for survival analysis. Biometrics 63(4): 1283–1287. doi:10.1111/j.1541-0420.2007.00832.x
Gneiting T, Raftery A (2007) Strictly proper scoring rules, prediction and estimation. J Am Stat Assoc 102(477): 359–378
Graf E, Schmoor C, Sauerbrei W, Schumacher M (1999) Assessment and comparison of prognostic classification schemes for survival data. Stat Med 18: 2529–2545
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York
Ishwaran H, Kogalur UB (2008) randomSurvivalForest: Ishwaran and Kogalur’s random survival forest. http://www.bio.ri.ccf.org/Resume/Pages/Ishwaran/ishwaran.html, http://www.kogalur-shear.com, R package version 3.5.1
Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS (2008) Random survival forests. Ann Appl Stat 2(3): 841–860
Ishwaran H, Kogalur UB, Gorodeski EZ, Minn AJ, Lauer MS (2010) High-dimensional variable selection for survival data. J Am Stat Assoc 105(489): 205–217
Knaus J, Porzelius C, Binder H, Schwarzer G (2009) Easier parallel computing in R with snowfall and sfCluster. R J 1: 54–59
Oberthür A, Kaderali L, Kahlert Y, Hero B, Westermann F, Berthold F, Brors B, Eils R, Fischer M (2008) Subclassification and individual survival time prediction from gene expression data of neuroblastoma patients by using CASPAR. Clin Cancer Res 14(20): 6590–6601
Porzelius C, Binder H (2009) peperr: Parallelised estimation of prediction error. http://cran.r-project.org, R package version 1.1-4
Porzelius C, Binder H, Schumacher M. (2009) Parallelized prediction error estimation for evaluation of high-dimensional models. Bioinformatics 25(6): 827–829 http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp062v1, doi:10.1093/bioinformatics/btp062
Porzelius C, Schumacher M, Binder H (2010) A general, prediction error based criterion for selecting model complexity for high-dimensional survival models. Stat Med 29: 830–838
R Development Core Team (2009) R: A language and environment for statistical computing. Vienna, Austria http://www.R-project.org, ISBN 3-900051-07-0
Radespiel-Tröger M, Gefeller O, Rabenstein T, Hothorn T (2006) Association between split selection instability and predictive error in survival trees. Methods Inf Med 45(5): 548–556
Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyna RD, Muller-Hermelink HK, Smeland EB, Staudt LM (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. New England J Med 346(25): 1937–1946
Schumacher M, Binder H, Gerds TA (2007) Assessment of survival prediction models based on microarray data. Bioinformatics 23(14): 1768–1774
Simon R, Radmacher M, Dobbin K, McShane L (2003) Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 95(1): 14–18
Strobl C, Boulesteix AL, Kneib T, Augustin T, Hothorn T (2008) Conditional variable importance for random forests. BMC Bioinform 9(1): 307
Strobl C, Boulesteix AL, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform 8(1): 25
Zhu M (2008) Kernels and ensembles: perspectives on statistical learning. Am Stat 62(2): 97–109
Ziegler A, König IR, Thompson JR (2008) Biostatistical aspects of genome-wide association studies. Biometr J 50(1): 8–28
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Porzelius, C., Schumacher, M. & Binder, H. The benefit of data-based model complexity selection via prediction error curves in time-to-event data. Comput Stat 26, 293–302 (2011). https://doi.org/10.1007/s00180-011-0236-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-011-0236-6