A finite-sample simulation study of cross validation in tree-based models

Kim, Seoung Bum; Huo, Xiaoming; Tsui, Kwok-Leung

doi:10.1007/s10799-009-0052-7

A finite-sample simulation study of cross validation in tree-based models

Published: 01 July 2009

Volume 10, pages 223–233, (2009)
Cite this article

Information Technology and Management Aims and scope Submit manuscript

Seoung Bum Kim¹,
Xiaoming Huo² &
Kwok-Leung Tsui²

183 Accesses
3 Citations
Explore all metrics

Abstract

Cross validation (CV) has been widely used for choosing and evaluating statistical models. The main purpose of this study is to explore the behavior of CV in tree-based models. We achieve this goal by an experimental approach, which compares a cross-validated tree classifier with the Bayes classifier that is ideal for the underlying distribution. The main observation of this study is that the difference between the testing and training errors from a cross-validated tree classifier and the Bayes classifier empirically has a linear regression relation. The slope and the coefficient of determination of the regression model can serve as performance measure of a cross-validated tree classifier. Moreover, simulation reveals that the performance of a cross-validated tree classifier depends on the geometry, parameters of the underlying distributions, and sample sizes. Our study can explain, evaluate, and justify the use of CV in tree-based models when the sample size is relatively small.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?

Article 13 June 2020

Comparison of Bayesian predictive methods for model selection

Article Open access 07 April 2016

References

M. Anthony, S.B. Holden, Cross-validation for binary classification by real-valued functions: theoretical analysis, in Proceedings of the 11th Annual Conference on Computational Learning Theory, (1998) pp. 218–229
L. Breiman, P. Spector, Submodel selection and evaluation in regression: the X-random case. Int. Stat. Rev. 60(3), 291–319 (1992)
Article Google Scholar
L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and Regression Trees (Wadsworth International Group, Belmont, 1984)
Google Scholar
L.P. Devroye, T.J. Wagner, Distribution-free inqualities for the deleted and holdout error estimates. IEEE Trans. Inf. Theory 25(2), 202–207 (1979a)
Article Google Scholar
L.P. Devroye, T.J. Wagner, Distribution-free performance bounds for potential function rules. IEEE Trans. Inf. Theory 25(5), 601–604 (1979b)
Article Google Scholar
B. Efron, Estimating the error rate of a prediction rule: improvement of cross-validation. J. Am. Stat. Assoc. 78(382), 316–331 (1983).
Article Google Scholar
B. Efron, How biased is the apparent error rate of a prediction rule? J. Am. Stat. Assoc. 81(394), 461–470 (1986)
Article Google Scholar
T. Hastie, R. Tibshirani, J.H. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, New York, 2001)
Google Scholar
X. Huo, S.B. Kim, K.-L. Tsui, S. Wang, FBP: a frontier-based tree-pruning algorithm. INFORMS J. Comput. 18(4), 494–505 (2006)
Article Google Scholar
M. Kearns, D. Ron, Algorithmic stability and sanity check bounds for leave-one-out cross validation bounds. Nueral Comput. 11(6), 1427–1453 (1999)
Article Google Scholar
R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection, in International Joint Conference of Artificial Intelligence (IJCAI) (1995), pp. 1137–1145.
W.H. Rogers, T.J. Wagner, A finite sample distribution-free performance bound for local discimination rule. Ann. Stat. 6, 506–514 (1978)
Article Google Scholar
J. Shao, Linear-model selection by cross-validation. J. Am. Stat. Assoc. 88(422), 486–494 (1993)
Article Google Scholar
J. Shao, Bootstrap model selection. J. Am. Stat. Assoc. 91(434), 655–665 (1996)
Article Google Scholar
J. Shao, Convergence rates of the generalization information criterion. J. Nonparametr. Stat. 9(3), 217–225 (1998)
Article Google Scholar
M. Stone, Cross-validatory choice and assessment of statistical prediction. J. Roy. Stat. Soc. B 36(2), 111–133 (1974)
Google Scholar
P. Zhang, On the distributional properties of model selection criteria. J. Am. Stati. Assoc. 87(419), 732–737 (1992)
Article Google Scholar
P. Zhang, Model selection via multifold cross validation. Ann. Stat. 21(1), 299–313 (1993a)
Article Google Scholar
P. Zhang, On the convergence rate of model selection criteria. Commun. Stat. Theory Methods 22(10), 2765–2775 (1993b)
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank the editors and two anonymous reviewers whose comments helped significantly to improve the quality of this paper.

Author information

Authors and Affiliations

Department of Industrial Systems and Information Engineering, Korea University, Seoul, Korea
Seoung Bum Kim
Department of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia, 30332, USA
Xiaoming Huo & Kwok-Leung Tsui

Authors

Seoung Bum Kim
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoming Huo
View author publications
You can also search for this author in PubMed Google Scholar
Kwok-Leung Tsui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Seoung Bum Kim.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, S.B., Huo, X. & Tsui, KL. A finite-sample simulation study of cross validation in tree-based models. Inf Technol Manag 10, 223–233 (2009). https://doi.org/10.1007/s10799-009-0052-7

Download citation

Received: 07 March 2008
Accepted: 08 June 2009
Published: 01 July 2009
Issue Date: December 2009
DOI: https://doi.org/10.1007/s10799-009-0052-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A finite-sample simulation study of cross validation in tree-based models

Abstract

Access this article

Similar content being viewed by others

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?

Comparison of Bayesian predictive methods for model selection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A finite-sample simulation study of cross validation in tree-based models

Abstract

Access this article

Similar content being viewed by others

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?

Comparison of Bayesian predictive methods for model selection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation