On the Good Behaviour of Extremely Randomized Trees in Random Forest-Distance Computation

Bicego, Manuele; Cicalese, Ferdinando

doi:10.1007/978-3-031-43421-1_38

Manuele Bicego¹² &
Ferdinando Cicalese¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14172))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1305 Accesses
1 Citations

Abstract

Originally introduced in the context of supervised classification, ensembles of Extremely Randomized Trees (ERT) have shown to provide surprisingly effective models also in unsupervised settings, e.g., for anomaly detection (via Isolation Forests) and for distance computation. In this paper, we focus on this latter application of ERT, namely in the context of Random Forest (RF) distance computation. We aim at narrowing the gap between the established empirical evidence of the good behaviour of ERT and the still limited theoretical understanding of their (somehow) surprisingly good performance when compared to more involved methodologies. Our main contribution is the following: we assume the existence of a proper representation on a given domain, i.e., a vectorial representation of the objects which satisfies the Compactness Hypothesis formulated by Arkadev and Braverman in 1967. Under such a hypothesis, given the “true” distance between two objects, we show how to derive a bound on the approximation guaranteed by two main RF-distances obtained by employing ensembles of ERTs, with respect to such “true” distance. In other words, we show that there exists a constant c such that if two objects are $\epsilon $-close in the true distance, then with high probability they are $(c \cdot \epsilon )$-close in the RF-distances computed with ERT forests.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Novel Anomaly Score for Isolation Forests

Online Anomaly Detection Using Random Forest

Randomized outlier detection with trees

Article Open access 15 December 2020

Notes

1.
Please note that to ease the computation this formulation is the squared version of the original formulation of the distance, as given in [23].
2.
Also in this case, to simplify the computation, we remove the squared root from the original definition of the distance given in [6].

References

Arkadev, A.G., Braverman, E.M.: Teaching Computers to Recognize Patterns. Academic, Transl. from the Russian by W. Turski and J.D. Cowan (1967)
Google Scholar
Aryal, S., Ting, K., Washio, T., Haffari, G.: A comparative study of data-dependent approaches without learning in measuring similarities of data objects. Data Min. Knowl. Discov. 34(1), 124–162 (2020)
Article MathSciNet MATH Google Scholar
Aryal, S., Ting, K.M., Haffari, G., Washio, T.: Mp-Dissimilarity: a data dependent dissimilarity measure. In: 2014 IEEE International Conference on Data Mining, pp. 707–712. IEEE (2014)
Google Scholar
Aryal, S., Ting, K.M., Washio, T., Haffari, G.: Data-dependent dissimilarity measure: an effective alternative to geometric distance measures. Knowl. Inf. Syst. 53(2), 479–506 (2017). https://doi.org/10.1007/s10115-017-1046-0
Article Google Scholar
Bicego, M., Escolano, F.: On learning random forests for random forest-clustering. In: Proceedings of the International Conference on Pattern Recognition (ICPR), pp. 3451–3458. IEEE (2021)
Google Scholar
Bicego, M., Cicalese, F., Mensi, A.: RatioRF: a novel measure for random forest clustering based on the Tversky’s ratio model. IEEE Trans. Knowl. Data Eng. 35(1), 830–841 (2023)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Article MATH Google Scholar
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and regression trees. Wadsworth (1984)
Google Scholar
Breiman, L.: Some infinity theory for predictor ensembles. Tech. Rep. CiteSeer (2000)
Google Scholar
Criminisi, A., Shotton, J., Konukoglu, E.: Decision forests: a unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Found. Trends Comput. Graph. Vis. 7(2–3), 81–227 (2012)
MATH Google Scholar
Davies, A., Ghahramani, Z.: The random forest Kernel and other Kernels for big data from random partitions. arXiv preprint arXiv:1402.4293 (2014)
Domingues, R., Filippone, M., Michiardi, P., Zouaoui, J.: A comparative evaluation of outlier detection algorithms: experiments and analyses. Pattern Recognit. 74, 406–421 (2018)
Article MATH Google Scholar
Duin, R.P., Pekalska, E.: Dissimilarity representation for pattern recognition. Foundations and applications, vol. 64. World scientific (2005)
Google Scholar
Duin, R.: Compactness and complexity of pattern recognition problems. In: Proceedings of the International Symposium on Pattern Recognition “In Memoriam Pierre Devijver”, pp. 124–128. Royal Military Academy (1999)
Google Scholar
Emmott, A.F., Das, S., Dietterich, T., Fern, A., Wong, W.K.: Systematic construction of anomaly detection benchmarks from real data. In: Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description, pp. 16–21 (2013)
Google Scholar
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)
Article MATH Google Scholar
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422. IEEE (2008)
Google Scholar
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation-based anomaly detection. ACM Trans. Knowl. Discov. Data (TKDD) 6(1), 1–39 (2012)
Article Google Scholar
Mitzenmacher, M., Upfal, E.: Probability and computing: randomized algorithms and probabilistic analysis. Cambridge University Press (2005)
Google Scholar
Moosmann, F., Triggs, B., Jurie, F.: Fast discriminative visual codebooks using randomized clustering forests. In: Advances in Neural Information Processing Systems 19, pp. 985–992 (2006)
Google Scholar
Quinlan, J.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc. (1993)
Google Scholar
Scornet, E.: Random forests and Kernel methods. IEEE Trans. Inf. Theory 62(3), 1485–1500 (2016)
Article MathSciNet MATH Google Scholar
Shi, T., Horvath, S.: Unsupervised learning with random forest predictors. J. Comput. Graph. Stat. 15(1), 118–138 (2006)
Article MathSciNet Google Scholar
Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image categorization and segmentation. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008) (2008)
Google Scholar
Ting, K.M., Wells, J.R., Washio, T.: Isolation Kernel: the X factor in efficient and effective large scale online kernel learning. Data Min. Knowl. Disc. 35(6), 2282–2312 (2021)
Article MathSciNet MATH Google Scholar
Ting, K.M., Xu, B.C., Washio, T., Zhou, Z.H.: Isolation distributional Kernel: a new tool for kernel based anomaly detection. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 198–206 (2020)
Google Scholar
Ting, K.M., Zhu, Y., Carman, M., Zhu, Y., Washio, T., Zhou, Z.H.: Lowest probability mass neighbour algorithms: relaxing the metric constraint in distance-based neighbourhood algorithms. Mach. Learn. 108, 331–376 (2019). https://doi.org/10.1007/s10994-018-5737-x
Article MathSciNet MATH Google Scholar
Ting, K.M., Zhu, Y., Zhou, Z.H.: Isolation Kernel and its effect on SVM. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2329–2337 (2018)
Google Scholar
Ting, K., Zhu, Y., Carman, M., Zhu, Y., Zhou, Z.H.: Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining, pp. 1205–1214 (2016)
Google Scholar
Tversky, A.: Features of similarity. Psychol. Rev. 84(4), 327 (1977)
Google Scholar
Wells, J.R., Aryal, S., Ting, K.M.: Simple supervised dissimilarity measure: bolstering iForest-induced similarity with class information without learning. Knowl. Inf. Syst. 62, 3203–3216 (2020)
Article Google Scholar
Zhu, X., Loy, C., Gong, S.: Constructing robust affinity graphs for spectral clustering. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, CVPR 2014, pp. 1450–1457 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, University of Verona, Verona, 37134, Italy
Manuele Bicego & Ferdinando Cicalese

Authors

Manuele Bicego
View author publications
You can also search for this author in PubMed Google Scholar
Ferdinando Cicalese
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manuele Bicego .

Editor information

Editors and Affiliations

University of Michigan, Ann Arbor, MI, USA
Danai Koutra
University of Vienna, Vienna, Austria
Claudia Plant
Max Planck Institute for Software Systems, Kaiserslautern, Germany
Manuel Gomez Rodriguez
Politecnico di Torino, Turin, Italy
Elena Baralis
CENTAI, Turin, Italy
Francesco Bonchi

Ethics declarations

Ethical Statement

We do not see any evident ethical implication of our submission. Our paper is mainly theoretical, not involving the collection and processing of personal data, or the inference of personal information. We do not see any potential use of our work for military applications.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bicego, M., Cicalese, F. (2023). On the Good Behaviour of Extremely Randomized Trees in Random Forest-Distance Computation. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14172. Springer, Cham. https://doi.org/10.1007/978-3-031-43421-1_38

Download citation

DOI: https://doi.org/10.1007/978-3-031-43421-1_38
Published: 18 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43420-4
Online ISBN: 978-3-031-43421-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

On the Good Behaviour of Extremely Randomized Trees in Random Forest-Distance Computation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Novel Anomaly Score for Isolation Forests

Online Anomaly Detection Using Random Forest

Randomized outlier detection with trees

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Ethics declarations

Ethical Statement

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

On the Good Behaviour of Extremely Randomized Trees in Random Forest-Distance Computation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Novel Anomaly Score for Isolation Forests

Online Anomaly Detection Using Random Forest

Randomized outlier detection with trees

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Ethics declarations

Ethical Statement

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation