Skip to main content

On the Good Behaviour of Extremely Randomized Trees in Random Forest-Distance Computation

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases: Research Track (ECML PKDD 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14172))

Abstract

Originally introduced in the context of supervised classification, ensembles of Extremely Randomized Trees (ERT) have shown to provide surprisingly effective models also in unsupervised settings, e.g., for anomaly detection (via Isolation Forests) and for distance computation. In this paper, we focus on this latter application of ERT, namely in the context of Random Forest (RF) distance computation. We aim at narrowing the gap between the established empirical evidence of the good behaviour of ERT and the still limited theoretical understanding of their (somehow) surprisingly good performance when compared to more involved methodologies. Our main contribution is the following: we assume the existence of a proper representation on a given domain, i.e., a vectorial representation of the objects which satisfies the Compactness Hypothesis formulated by Arkadev and Braverman in 1967. Under such a hypothesis, given the “true” distance between two objects, we show how to derive a bound on the approximation guaranteed by two main RF-distances obtained by employing ensembles of ERTs, with respect to such “true” distance. In other words, we show that there exists a constant c such that if two objects are \(\epsilon \)-close in the true distance, then with high probability they are \((c \cdot \epsilon )\)-close in the RF-distances computed with ERT forests.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Please note that to ease the computation this formulation is the squared version of the original formulation of the distance, as given in [23].

  2. 2.

    Also in this case, to simplify the computation, we remove the squared root from the original definition of the distance given in [6].

References

  1. Arkadev, A.G., Braverman, E.M.: Teaching Computers to Recognize Patterns. Academic, Transl. from the Russian by W. Turski and J.D. Cowan (1967)

    Google Scholar 

  2. Aryal, S., Ting, K., Washio, T., Haffari, G.: A comparative study of data-dependent approaches without learning in measuring similarities of data objects. Data Min. Knowl. Discov. 34(1), 124–162 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  3. Aryal, S., Ting, K.M., Haffari, G., Washio, T.: Mp-Dissimilarity: a data dependent dissimilarity measure. In: 2014 IEEE International Conference on Data Mining, pp. 707–712. IEEE (2014)

    Google Scholar 

  4. Aryal, S., Ting, K.M., Washio, T., Haffari, G.: Data-dependent dissimilarity measure: an effective alternative to geometric distance measures. Knowl. Inf. Syst. 53(2), 479–506 (2017). https://doi.org/10.1007/s10115-017-1046-0

    Article  Google Scholar 

  5. Bicego, M., Escolano, F.: On learning random forests for random forest-clustering. In: Proceedings of the International Conference on Pattern Recognition (ICPR), pp. 3451–3458. IEEE (2021)

    Google Scholar 

  6. Bicego, M., Cicalese, F., Mensi, A.: RatioRF: a novel measure for random forest clustering based on the Tversky’s ratio model. IEEE Trans. Knowl. Data Eng. 35(1), 830–841 (2023)

    Google Scholar 

  7. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)

    Article  MATH  Google Scholar 

  8. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and regression trees. Wadsworth (1984)

    Google Scholar 

  9. Breiman, L.: Some infinity theory for predictor ensembles. Tech. Rep. CiteSeer (2000)

    Google Scholar 

  10. Criminisi, A., Shotton, J., Konukoglu, E.: Decision forests: a unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Found. Trends Comput. Graph. Vis. 7(2–3), 81–227 (2012)

    MATH  Google Scholar 

  11. Davies, A., Ghahramani, Z.: The random forest Kernel and other Kernels for big data from random partitions. arXiv preprint arXiv:1402.4293 (2014)

  12. Domingues, R., Filippone, M., Michiardi, P., Zouaoui, J.: A comparative evaluation of outlier detection algorithms: experiments and analyses. Pattern Recognit. 74, 406–421 (2018)

    Article  MATH  Google Scholar 

  13. Duin, R.P., Pekalska, E.: Dissimilarity representation for pattern recognition. Foundations and applications, vol. 64. World scientific (2005)

    Google Scholar 

  14. Duin, R.: Compactness and complexity of pattern recognition problems. In: Proceedings of the International Symposium on Pattern Recognition “In Memoriam Pierre Devijver”, pp. 124–128. Royal Military Academy (1999)

    Google Scholar 

  15. Emmott, A.F., Das, S., Dietterich, T., Fern, A., Wong, W.K.: Systematic construction of anomaly detection benchmarks from real data. In: Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description, pp. 16–21 (2013)

    Google Scholar 

  16. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)

    Article  MATH  Google Scholar 

  17. Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422. IEEE (2008)

    Google Scholar 

  18. Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation-based anomaly detection. ACM Trans. Knowl. Discov. Data (TKDD) 6(1), 1–39 (2012)

    Article  Google Scholar 

  19. Mitzenmacher, M., Upfal, E.: Probability and computing: randomized algorithms and probabilistic analysis. Cambridge University Press (2005)

    Google Scholar 

  20. Moosmann, F., Triggs, B., Jurie, F.: Fast discriminative visual codebooks using randomized clustering forests. In: Advances in Neural Information Processing Systems 19, pp. 985–992 (2006)

    Google Scholar 

  21. Quinlan, J.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc. (1993)

    Google Scholar 

  22. Scornet, E.: Random forests and Kernel methods. IEEE Trans. Inf. Theory 62(3), 1485–1500 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  23. Shi, T., Horvath, S.: Unsupervised learning with random forest predictors. J. Comput. Graph. Stat. 15(1), 118–138 (2006)

    Article  MathSciNet  Google Scholar 

  24. Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image categorization and segmentation. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008) (2008)

    Google Scholar 

  25. Ting, K.M., Wells, J.R., Washio, T.: Isolation Kernel: the X factor in efficient and effective large scale online kernel learning. Data Min. Knowl. Disc. 35(6), 2282–2312 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  26. Ting, K.M., Xu, B.C., Washio, T., Zhou, Z.H.: Isolation distributional Kernel: a new tool for kernel based anomaly detection. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 198–206 (2020)

    Google Scholar 

  27. Ting, K.M., Zhu, Y., Carman, M., Zhu, Y., Washio, T., Zhou, Z.H.: Lowest probability mass neighbour algorithms: relaxing the metric constraint in distance-based neighbourhood algorithms. Mach. Learn. 108, 331–376 (2019). https://doi.org/10.1007/s10994-018-5737-x

    Article  MathSciNet  MATH  Google Scholar 

  28. Ting, K.M., Zhu, Y., Zhou, Z.H.: Isolation Kernel and its effect on SVM. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2329–2337 (2018)

    Google Scholar 

  29. Ting, K., Zhu, Y., Carman, M., Zhu, Y., Zhou, Z.H.: Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining, pp. 1205–1214 (2016)

    Google Scholar 

  30. Tversky, A.: Features of similarity. Psychol. Rev. 84(4), 327 (1977)

    Google Scholar 

  31. Wells, J.R., Aryal, S., Ting, K.M.: Simple supervised dissimilarity measure: bolstering iForest-induced similarity with class information without learning. Knowl. Inf. Syst. 62, 3203–3216 (2020)

    Article  Google Scholar 

  32. Zhu, X., Loy, C., Gong, S.: Constructing robust affinity graphs for spectral clustering. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, CVPR 2014, pp. 1450–1457 (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manuele Bicego .

Editor information

Editors and Affiliations

Ethics declarations

Ethical Statement

We do not see any evident ethical implication of our submission. Our paper is mainly theoretical, not involving the collection and processing of personal data, or the inference of personal information. We do not see any potential use of our work for military applications.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bicego, M., Cicalese, F. (2023). On the Good Behaviour of Extremely Randomized Trees in Random Forest-Distance Computation. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14172. Springer, Cham. https://doi.org/10.1007/978-3-031-43421-1_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43421-1_38

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43420-4

  • Online ISBN: 978-3-031-43421-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics