Skip to main content
Log in

A finite-sample simulation study of cross validation in tree-based models

  • Published:
Information Technology and Management Aims and scope Submit manuscript

Abstract

Cross validation (CV) has been widely used for choosing and evaluating statistical models. The main purpose of this study is to explore the behavior of CV in tree-based models. We achieve this goal by an experimental approach, which compares a cross-validated tree classifier with the Bayes classifier that is ideal for the underlying distribution. The main observation of this study is that the difference between the testing and training errors from a cross-validated tree classifier and the Bayes classifier empirically has a linear regression relation. The slope and the coefficient of determination of the regression model can serve as performance measure of a cross-validated tree classifier. Moreover, simulation reveals that the performance of a cross-validated tree classifier depends on the geometry, parameters of the underlying distributions, and sample sizes. Our study can explain, evaluate, and justify the use of CV in tree-based models when the sample size is relatively small.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • M. Anthony, S.B. Holden, Cross-validation for binary classification by real-valued functions: theoretical analysis, in Proceedings of the 11th Annual Conference on Computational Learning Theory, (1998) pp. 218–229

  • L. Breiman, P. Spector, Submodel selection and evaluation in regression: the X-random case. Int. Stat. Rev. 60(3), 291–319 (1992)

    Article  Google Scholar 

  • L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and Regression Trees (Wadsworth International Group, Belmont, 1984)

    Google Scholar 

  • L.P. Devroye, T.J. Wagner, Distribution-free inqualities for the deleted and holdout error estimates. IEEE Trans. Inf. Theory 25(2), 202–207 (1979a)

    Article  Google Scholar 

  • L.P. Devroye, T.J. Wagner, Distribution-free performance bounds for potential function rules. IEEE Trans. Inf. Theory 25(5), 601–604 (1979b)

    Article  Google Scholar 

  • B. Efron, Estimating the error rate of a prediction rule: improvement of cross-validation. J. Am. Stat. Assoc. 78(382), 316–331 (1983).

    Article  Google Scholar 

  • B. Efron, How biased is the apparent error rate of a prediction rule? J. Am. Stat. Assoc. 81(394), 461–470 (1986)

    Article  Google Scholar 

  • T. Hastie, R. Tibshirani, J.H. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, New York, 2001)

    Google Scholar 

  • X. Huo, S.B. Kim, K.-L. Tsui, S. Wang, FBP: a frontier-based tree-pruning algorithm. INFORMS J. Comput. 18(4), 494–505 (2006)

    Article  Google Scholar 

  • M. Kearns, D. Ron, Algorithmic stability and sanity check bounds for leave-one-out cross validation bounds. Nueral Comput. 11(6), 1427–1453 (1999)

    Article  Google Scholar 

  • R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection, in International Joint Conference of Artificial Intelligence (IJCAI) (1995), pp. 1137–1145.

  • W.H. Rogers, T.J. Wagner, A finite sample distribution-free performance bound for local discimination rule. Ann. Stat. 6, 506–514 (1978)

    Article  Google Scholar 

  • J. Shao, Linear-model selection by cross-validation. J. Am. Stat. Assoc. 88(422), 486–494 (1993)

    Article  Google Scholar 

  • J. Shao, Bootstrap model selection. J. Am. Stat. Assoc. 91(434), 655–665 (1996)

    Article  Google Scholar 

  • J. Shao, Convergence rates of the generalization information criterion. J. Nonparametr. Stat. 9(3), 217–225 (1998)

    Article  Google Scholar 

  • M. Stone, Cross-validatory choice and assessment of statistical prediction. J. Roy. Stat. Soc. B 36(2), 111–133 (1974)

    Google Scholar 

  • P. Zhang, On the distributional properties of model selection criteria. J. Am. Stati. Assoc. 87(419), 732–737 (1992)

    Article  Google Scholar 

  • P. Zhang, Model selection via multifold cross validation. Ann. Stat. 21(1), 299–313 (1993a)

    Article  Google Scholar 

  • P. Zhang, On the convergence rate of model selection criteria. Commun. Stat. Theory Methods 22(10), 2765–2775 (1993b)

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the editors and two anonymous reviewers whose comments helped significantly to improve the quality of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Seoung Bum Kim.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, S.B., Huo, X. & Tsui, KL. A finite-sample simulation study of cross validation in tree-based models. Inf Technol Manag 10, 223–233 (2009). https://doi.org/10.1007/s10799-009-0052-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10799-009-0052-7

Keywords

Navigation