Skip to main content
Log in

Bounds on the moments for an ensemble of random decision trees

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

An ensemble of random decision trees is a popular classification technique, especially known for its ability to scale to large domains. In this paper, we provide an efficient strategy to compute bounds on the moments of the generalization error computed over all datasets of a particular size drawn from an underlying distribution, for this classification technique. Being able to estimate these moments can help us gain insights into the performance of this model. As we will see in the experimental section, these bounds tend to be significantly tighter than the state-of-the-art Breiman’s bounds based on strength and correlation and hence more useful in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

Notes

  1. These probabilities and \( P\left[ Y(x)\!\ne \!y \right] \) are conditioned on \(x\). We omit explicitly writing the conditional since it improves readability and is obvious from the context.

  2. For further details refer to [3] and [5].

  3. This is after splitting the continuous attributes.

  4. Partitioned into 3 categories high, medium and low.

References

  1. Anandkumar A, Foster D, Hsu D, Kakade S, Liu Y (2012) A spectral algorithm for latent dirichlet allocation. In: NIPS. Lake Tahoe, USA, pp 926–934

  2. Boots B, Gordon G (2012) Two manifold problems with applications to nonlinear system identification. In: ICML. Edinburgh, Scotland, UK, p 338

  3. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  4. Bshouty N, Long P (2010) Finding planted partitions in nearly linear time using arrested spectral clustering. In: ICML. Haifa, Israel, pp 135–142

  5. Buttrey S, Kobayashi I (2003) On strength and correlation in random forests. In : Proceedings of the 2003 joint statistical meetings, section on statistical computing

  6. Connor-Linton J (2003) Chi square tutorial. http://www.georgetown.edu/faculty/ballc/webtools/web_chi_tut.html

  7. Dhurandhar A, Dobra A (2008) Probabilistic characterization of random decision trees. J Mach Learn Res 9:2321–2348

    Google Scholar 

  8. Dhurandhar A, Dobra A (2009) Semi-analytical method for analyzing models and model selection measures based on moment analysis. ACM Trans Knowl Discov Data Min

  9. Dhurandhar A, Dobra A (2012) Distribution free bounds for relational classification. Knowl Inf Syst

  10. Dhurandhar A, Dobra A (2012) Probabilistic characterization of nearest neighbor classifiers. Int J Mach Learn Cybern

  11. Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, New York, p 654

  12. Fan W, Wang H, Yu PS, Ma S (2003) Is random model better? On its accuracy and efficiency. In: ICDM ’03: proceedings of the third IEEE international conference on data mining, IEEE Computer Society, Washington, DC, USA, pp 51–58

  13. Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42

    Article  Google Scholar 

  14. Hastie T, Tibshirani R, Friedman J (2001) Elements of statistical learning, 2nd edn. Springer, Berlin

    Book  Google Scholar 

  15. Langford John (December 2005) Tutorial on practical prediction theory for classification. J Mach Learn Res 6:273–306

    Google Scholar 

  16. Liu F, Ting K, Fan W (2005) Maximizing tree diversity by building complete-random decision trees. In: PAKDD, pp 605–610

  17. McAllester D (1999) Pac-bayesian model averaging. In: Proceedings of the twelfth annual conference on computational learning theory. ACM Press, pp 164–170

  18. Mcallester D (2003) Simplified pac-bayesian margin bounds. In COLT, pp 203–215

  19. Roy S, Bose R (1953) Simultaneous confidence interval estimation. Ann Math Stat 24(3):513–536

    Article  Google Scholar 

  20. Sison C, Glaz J (1995) Simultaneous confidence intervals and sample size determination for multinomial proportions. JASA 90(429):366–369

    Article  Google Scholar 

  21. Tong Y (1980) Probabilistic inequalities for multivariate distributions, 1st edn. Academic Press, Waltham

    Google Scholar 

  22. Zhang K, Fan W (2008) Forecasting skewed biased stochastic ozone days: analyses, solutions and beyond. Knowl Inf Syst 14(3):299–326

    Article  Google Scholar 

  23. Zhang X, Yuan Q, Zhao S, Fan W, Zheng W, Wang Z (2010) Multi-label classification without the multi-label cost. In: SDM ’10: proceedings of the siam conference on data mining, pp 778–789

Download references

Acknowledgments

I would like to thank the editor and the anonymous reviewers for their constructive comments. I would also like to thank Katherine Dhurandhar for proofreading the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amit Dhurandhar.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dhurandhar, A. Bounds on the moments for an ensemble of random decision trees. Knowl Inf Syst 44, 279–298 (2015). https://doi.org/10.1007/s10115-014-0768-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-014-0768-5

Keywords

Navigation