Abstract
Originally introduced in the context of supervised classification, ensembles of Extremely Randomized Trees (ERT) have shown to provide surprisingly effective models also in unsupervised settings, e.g., for anomaly detection (via Isolation Forests) and for distance computation. In this paper, we focus on this latter application of ERT, namely in the context of Random Forest (RF) distance computation. We aim at narrowing the gap between the established empirical evidence of the good behaviour of ERT and the still limited theoretical understanding of their (somehow) surprisingly good performance when compared to more involved methodologies. Our main contribution is the following: we assume the existence of a proper representation on a given domain, i.e., a vectorial representation of the objects which satisfies the Compactness Hypothesis formulated by Arkadev and Braverman in 1967. Under such a hypothesis, given the “true” distance between two objects, we show how to derive a bound on the approximation guaranteed by two main RF-distances obtained by employing ensembles of ERTs, with respect to such “true” distance. In other words, we show that there exists a constant c such that if two objects are \(\epsilon \)-close in the true distance, then with high probability they are \((c \cdot \epsilon )\)-close in the RF-distances computed with ERT forests.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arkadev, A.G., Braverman, E.M.: Teaching Computers to Recognize Patterns. Academic, Transl. from the Russian by W. Turski and J.D. Cowan (1967)
Aryal, S., Ting, K., Washio, T., Haffari, G.: A comparative study of data-dependent approaches without learning in measuring similarities of data objects. Data Min. Knowl. Discov. 34(1), 124–162 (2020)
Aryal, S., Ting, K.M., Haffari, G., Washio, T.: Mp-Dissimilarity: a data dependent dissimilarity measure. In: 2014 IEEE International Conference on Data Mining, pp. 707–712. IEEE (2014)
Aryal, S., Ting, K.M., Washio, T., Haffari, G.: Data-dependent dissimilarity measure: an effective alternative to geometric distance measures. Knowl. Inf. Syst. 53(2), 479–506 (2017). https://doi.org/10.1007/s10115-017-1046-0
Bicego, M., Escolano, F.: On learning random forests for random forest-clustering. In: Proceedings of the International Conference on Pattern Recognition (ICPR), pp. 3451–3458. IEEE (2021)
Bicego, M., Cicalese, F., Mensi, A.: RatioRF: a novel measure for random forest clustering based on the Tversky’s ratio model. IEEE Trans. Knowl. Data Eng. 35(1), 830–841 (2023)
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and regression trees. Wadsworth (1984)
Breiman, L.: Some infinity theory for predictor ensembles. Tech. Rep. CiteSeer (2000)
Criminisi, A., Shotton, J., Konukoglu, E.: Decision forests: a unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Found. Trends Comput. Graph. Vis. 7(2–3), 81–227 (2012)
Davies, A., Ghahramani, Z.: The random forest Kernel and other Kernels for big data from random partitions. arXiv preprint arXiv:1402.4293 (2014)
Domingues, R., Filippone, M., Michiardi, P., Zouaoui, J.: A comparative evaluation of outlier detection algorithms: experiments and analyses. Pattern Recognit. 74, 406–421 (2018)
Duin, R.P., Pekalska, E.: Dissimilarity representation for pattern recognition. Foundations and applications, vol. 64. World scientific (2005)
Duin, R.: Compactness and complexity of pattern recognition problems. In: Proceedings of the International Symposium on Pattern Recognition “In Memoriam Pierre Devijver”, pp. 124–128. Royal Military Academy (1999)
Emmott, A.F., Das, S., Dietterich, T., Fern, A., Wong, W.K.: Systematic construction of anomaly detection benchmarks from real data. In: Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description, pp. 16–21 (2013)
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422. IEEE (2008)
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation-based anomaly detection. ACM Trans. Knowl. Discov. Data (TKDD) 6(1), 1–39 (2012)
Mitzenmacher, M., Upfal, E.: Probability and computing: randomized algorithms and probabilistic analysis. Cambridge University Press (2005)
Moosmann, F., Triggs, B., Jurie, F.: Fast discriminative visual codebooks using randomized clustering forests. In: Advances in Neural Information Processing Systems 19, pp. 985–992 (2006)
Quinlan, J.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc. (1993)
Scornet, E.: Random forests and Kernel methods. IEEE Trans. Inf. Theory 62(3), 1485–1500 (2016)
Shi, T., Horvath, S.: Unsupervised learning with random forest predictors. J. Comput. Graph. Stat. 15(1), 118–138 (2006)
Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image categorization and segmentation. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008) (2008)
Ting, K.M., Wells, J.R., Washio, T.: Isolation Kernel: the X factor in efficient and effective large scale online kernel learning. Data Min. Knowl. Disc. 35(6), 2282–2312 (2021)
Ting, K.M., Xu, B.C., Washio, T., Zhou, Z.H.: Isolation distributional Kernel: a new tool for kernel based anomaly detection. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 198–206 (2020)
Ting, K.M., Zhu, Y., Carman, M., Zhu, Y., Washio, T., Zhou, Z.H.: Lowest probability mass neighbour algorithms: relaxing the metric constraint in distance-based neighbourhood algorithms. Mach. Learn. 108, 331–376 (2019). https://doi.org/10.1007/s10994-018-5737-x
Ting, K.M., Zhu, Y., Zhou, Z.H.: Isolation Kernel and its effect on SVM. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2329–2337 (2018)
Ting, K., Zhu, Y., Carman, M., Zhu, Y., Zhou, Z.H.: Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining, pp. 1205–1214 (2016)
Tversky, A.: Features of similarity. Psychol. Rev. 84(4), 327 (1977)
Wells, J.R., Aryal, S., Ting, K.M.: Simple supervised dissimilarity measure: bolstering iForest-induced similarity with class information without learning. Knowl. Inf. Syst. 62, 3203–3216 (2020)
Zhu, X., Loy, C., Gong, S.: Constructing robust affinity graphs for spectral clustering. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, CVPR 2014, pp. 1450–1457 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Ethical Statement
We do not see any evident ethical implication of our submission. Our paper is mainly theoretical, not involving the collection and processing of personal data, or the inference of personal information. We do not see any potential use of our work for military applications.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bicego, M., Cicalese, F. (2023). On the Good Behaviour of Extremely Randomized Trees in Random Forest-Distance Computation. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14172. Springer, Cham. https://doi.org/10.1007/978-3-031-43421-1_38
Download citation
DOI: https://doi.org/10.1007/978-3-031-43421-1_38
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43420-4
Online ISBN: 978-3-031-43421-1
eBook Packages: Computer ScienceComputer Science (R0)