Abstract
An ensemble of random decision trees is a popular classification technique, especially known for its ability to scale to large domains. In this paper, we provide an efficient strategy to compute bounds on the moments of the generalization error computed over all datasets of a particular size drawn from an underlying distribution, for this classification technique. Being able to estimate these moments can help us gain insights into the performance of this model. As we will see in the experimental section, these bounds tend to be significantly tighter than the state-of-the-art Breiman’s bounds based on strength and correlation and hence more useful in practice.
Similar content being viewed by others
Notes
These probabilities and \( P\left[ Y(x)\!\ne \!y \right] \) are conditioned on \(x\). We omit explicitly writing the conditional since it improves readability and is obvious from the context.
This is after splitting the continuous attributes.
Partitioned into 3 categories high, medium and low.
References
Anandkumar A, Foster D, Hsu D, Kakade S, Liu Y (2012) A spectral algorithm for latent dirichlet allocation. In: NIPS. Lake Tahoe, USA, pp 926–934
Boots B, Gordon G (2012) Two manifold problems with applications to nonlinear system identification. In: ICML. Edinburgh, Scotland, UK, p 338
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Bshouty N, Long P (2010) Finding planted partitions in nearly linear time using arrested spectral clustering. In: ICML. Haifa, Israel, pp 135–142
Buttrey S, Kobayashi I (2003) On strength and correlation in random forests. In : Proceedings of the 2003 joint statistical meetings, section on statistical computing
Connor-Linton J (2003) Chi square tutorial. http://www.georgetown.edu/faculty/ballc/webtools/web_chi_tut.html
Dhurandhar A, Dobra A (2008) Probabilistic characterization of random decision trees. J Mach Learn Res 9:2321–2348
Dhurandhar A, Dobra A (2009) Semi-analytical method for analyzing models and model selection measures based on moment analysis. ACM Trans Knowl Discov Data Min
Dhurandhar A, Dobra A (2012) Distribution free bounds for relational classification. Knowl Inf Syst
Dhurandhar A, Dobra A (2012) Probabilistic characterization of nearest neighbor classifiers. Int J Mach Learn Cybern
Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, New York, p 654
Fan W, Wang H, Yu PS, Ma S (2003) Is random model better? On its accuracy and efficiency. In: ICDM ’03: proceedings of the third IEEE international conference on data mining, IEEE Computer Society, Washington, DC, USA, pp 51–58
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42
Hastie T, Tibshirani R, Friedman J (2001) Elements of statistical learning, 2nd edn. Springer, Berlin
Langford John (December 2005) Tutorial on practical prediction theory for classification. J Mach Learn Res 6:273–306
Liu F, Ting K, Fan W (2005) Maximizing tree diversity by building complete-random decision trees. In: PAKDD, pp 605–610
McAllester D (1999) Pac-bayesian model averaging. In: Proceedings of the twelfth annual conference on computational learning theory. ACM Press, pp 164–170
Mcallester D (2003) Simplified pac-bayesian margin bounds. In COLT, pp 203–215
Roy S, Bose R (1953) Simultaneous confidence interval estimation. Ann Math Stat 24(3):513–536
Sison C, Glaz J (1995) Simultaneous confidence intervals and sample size determination for multinomial proportions. JASA 90(429):366–369
Tong Y (1980) Probabilistic inequalities for multivariate distributions, 1st edn. Academic Press, Waltham
Zhang K, Fan W (2008) Forecasting skewed biased stochastic ozone days: analyses, solutions and beyond. Knowl Inf Syst 14(3):299–326
Zhang X, Yuan Q, Zhao S, Fan W, Zheng W, Wang Z (2010) Multi-label classification without the multi-label cost. In: SDM ’10: proceedings of the siam conference on data mining, pp 778–789
Acknowledgments
I would like to thank the editor and the anonymous reviewers for their constructive comments. I would also like to thank Katherine Dhurandhar for proofreading the paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Dhurandhar, A. Bounds on the moments for an ensemble of random decision trees. Knowl Inf Syst 44, 279–298 (2015). https://doi.org/10.1007/s10115-014-0768-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-014-0768-5