Abstract
Cross validation (CV) has been widely used for choosing and evaluating statistical models. The main purpose of this study is to explore the behavior of CV in tree-based models. We achieve this goal by an experimental approach, which compares a cross-validated tree classifier with the Bayes classifier that is ideal for the underlying distribution. The main observation of this study is that the difference between the testing and training errors from a cross-validated tree classifier and the Bayes classifier empirically has a linear regression relation. The slope and the coefficient of determination of the regression model can serve as performance measure of a cross-validated tree classifier. Moreover, simulation reveals that the performance of a cross-validated tree classifier depends on the geometry, parameters of the underlying distributions, and sample sizes. Our study can explain, evaluate, and justify the use of CV in tree-based models when the sample size is relatively small.
Similar content being viewed by others
References
M. Anthony, S.B. Holden, Cross-validation for binary classification by real-valued functions: theoretical analysis, in Proceedings of the 11th Annual Conference on Computational Learning Theory, (1998) pp. 218–229
L. Breiman, P. Spector, Submodel selection and evaluation in regression: the X-random case. Int. Stat. Rev. 60(3), 291–319 (1992)
L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and Regression Trees (Wadsworth International Group, Belmont, 1984)
L.P. Devroye, T.J. Wagner, Distribution-free inqualities for the deleted and holdout error estimates. IEEE Trans. Inf. Theory 25(2), 202–207 (1979a)
L.P. Devroye, T.J. Wagner, Distribution-free performance bounds for potential function rules. IEEE Trans. Inf. Theory 25(5), 601–604 (1979b)
B. Efron, Estimating the error rate of a prediction rule: improvement of cross-validation. J. Am. Stat. Assoc. 78(382), 316–331 (1983).
B. Efron, How biased is the apparent error rate of a prediction rule? J. Am. Stat. Assoc. 81(394), 461–470 (1986)
T. Hastie, R. Tibshirani, J.H. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, New York, 2001)
X. Huo, S.B. Kim, K.-L. Tsui, S. Wang, FBP: a frontier-based tree-pruning algorithm. INFORMS J. Comput. 18(4), 494–505 (2006)
M. Kearns, D. Ron, Algorithmic stability and sanity check bounds for leave-one-out cross validation bounds. Nueral Comput. 11(6), 1427–1453 (1999)
R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection, in International Joint Conference of Artificial Intelligence (IJCAI) (1995), pp. 1137–1145.
W.H. Rogers, T.J. Wagner, A finite sample distribution-free performance bound for local discimination rule. Ann. Stat. 6, 506–514 (1978)
J. Shao, Linear-model selection by cross-validation. J. Am. Stat. Assoc. 88(422), 486–494 (1993)
J. Shao, Bootstrap model selection. J. Am. Stat. Assoc. 91(434), 655–665 (1996)
J. Shao, Convergence rates of the generalization information criterion. J. Nonparametr. Stat. 9(3), 217–225 (1998)
M. Stone, Cross-validatory choice and assessment of statistical prediction. J. Roy. Stat. Soc. B 36(2), 111–133 (1974)
P. Zhang, On the distributional properties of model selection criteria. J. Am. Stati. Assoc. 87(419), 732–737 (1992)
P. Zhang, Model selection via multifold cross validation. Ann. Stat. 21(1), 299–313 (1993a)
P. Zhang, On the convergence rate of model selection criteria. Commun. Stat. Theory Methods 22(10), 2765–2775 (1993b)
Acknowledgements
The authors would like to thank the editors and two anonymous reviewers whose comments helped significantly to improve the quality of this paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kim, S.B., Huo, X. & Tsui, KL. A finite-sample simulation study of cross validation in tree-based models. Inf Technol Manag 10, 223–233 (2009). https://doi.org/10.1007/s10799-009-0052-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10799-009-0052-7