Distance-Based Random Forest Clustering with Missing Data

Raniero, Matteo; Bicego, Manuele; Cicalese, Ferdinando

doi:10.1007/978-3-031-06433-3_11

Matteo Raniero¹²,
Manuele Bicego¹² &
Ferdinando Cicalese¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13233))

Included in the following conference series:

International Conference on Image Analysis and Processing

1226 Accesses

Abstract

In recent years there has been an increased interest in clustering methods based on Random Forests, due to their flexibility and their capability in describing data. One problem of current RF-clustering approaches is that they are not able to directly deal with missing data, a common scenario in many application fields (e.g. Bioinformatics): the usual solution in this case is to pre-impute incomplete data before running standard clustering methods. In this paper we present the first Random Forest clustering approach able to directly deal with missing data. We start from the very recent RatioRF distance for clustering [3], which has shown to outperform all other distance-based RF clustering schemes, extending the framework in two directions, which allow the integration of missing data mechanisms directly inside the clustering pipeline. Experimental results, based on 6 standard UCI ML datasets, are promising, also in comparison with some literature alternatives.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Some experiments, not reported here, showed that empirical results would not change too much if we randomly choose one of the two paths.

References

Aryal, S., Ting, K.M., Washio, T., Haffari, G.: A comparative study of data-dependent approaches without learning in measuring similarities of data objects. Data Min. Knowl. Disc. 34(1), 124–162 (2019). https://doi.org/10.1007/s10618-019-00660-0
Article MathSciNet MATH Google Scholar
Bicego, M.: K-random forests: a K-means style algorithm for random forest clustering. In: Proceedings of International Joint Conference on Neural Networks (IJCNN 2019) (2019)
Google Scholar
Bicego, M., Cicalese, F., Mensi, A.: RatioRF: a novel measure for random forest clustering based on the Tversky’s ratio model. IEEE Trans. Knowl. Data Eng. (2022, in press). https://doi.org/10.1109/TKDE.2021.3086147, https://ieeexplore.ieee.org/document/9446631
Bicego, M., Escolano, F.: On learning random forests for random forest clustering. In: Proceedings of International Conference on Pattern Recognition, pp. 3451–3458 (2020)
Google Scholar
Boluki, S., Dadaneh, S., Qian, X., Dougherty, E.: Optimal clustering with missing values. BMC Bioinform. 20(Suppl. 12), 321 (2019)
Article Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Article Google Scholar
Chi, J., Chi, E., Baraniuk, R.: k-POD: a method for k-means clustering of missing data. Am. Stat. 70(1), 91–99 (2016)
Article MathSciNet Google Scholar
Criminisi, A., Shotton, J., Konukoglu, E.: Decision forests: a unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Found. Trends Comput. Graph. Vis. 7(2–3), 81–227 (2012)
MATH Google Scholar
Datta, S., Bhattacharjee, S., Das, S.: Clustering with missing features: a penalized dissimilarity measure based approach. Mach. Learn. 107(12), 1987–2025 (2018). https://doi.org/10.1007/s10994-018-5722-4
Article MathSciNet MATH Google Scholar
Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)
Article MathSciNet Google Scholar
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)
Article Google Scholar
Hathaway, R., Bezdek, J.: Fuzzy c-means clustering of incomplete data. IEEE Trans. Syst. Man Cybern. B (Cybern.) 31(5), 735–44 (2001)
Article Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Article Google Scholar
Jakobsen, J., Gluud, C., Wetterslev, J., Winkel, P.: When and how should multiple imputation be used for handling missing data in randomised clinical trials - a practical guide with flowcharts. BMC Med. Res. Methodol. 17, 162 (2017)
Article Google Scholar
Moosmann, F., Triggs, B., Jurie, F.: Fast discriminative visual codebooks using randomized clustering forests. In: Advances in Neural Information Processing Systems 19, pp. 985–992 (2006)
Google Scholar
Perbet, F., Stenger, B., Maki, A.: Random forest clustering and application to video segmentation. In: Proceedings of British Machine Vision Conference, BMVC 2009, pp. 1–10 (2009)
Google Scholar
Pigott, T.: A review of methods for missing data. Educ. Res. Eval. 7(4), 353–383 (2001)
Article Google Scholar
Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., Burlington (1993)
Google Scholar
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Article MathSciNet Google Scholar
Santos, M., Abreu, P., Wilk, S., Santos, J.: How distance metrics influence missing data imputation with k-nearest neighbours. Pattern Recogn. Lett. 136, 111–119 (2020)
Article Google Scholar
Shi, T., Horvath, S.: Unsupervised learning with random forest predictors. J. Comput. Graph. Stat. 15(1), 118–138 (2006)
Article MathSciNet Google Scholar
Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image categorization and segmentation. In: Proceedings of International Conference on Computer Vision and Pattern Recognition (CVPR 2008) (2008)
Google Scholar
Stekhoven, D., Buhlmann, P.: Missforest: non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2011)
Article Google Scholar
Sterne, J., et al.: Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 338, b2393 (2009)
Article Google Scholar
Ting, K., Zhu, Y., Carman, M., Zhu, Y., Zhou, Z.H.: Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In: Proceedings of International Conference on Knowledge Discovery and Data Mining, pp. 1205–1214 (2016)
Google Scholar
Troyanskaya, O., et al.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)
Article Google Scholar
Tversky, A.: Features of similarity. Psychol. Rev. 84(4), 327 (1977)
Article Google Scholar
von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)
Article MathSciNet Google Scholar
Wagstaff, K.: Clustering with missing values: no imputation required. In: Classification, Clustering, and Data Mining Applications, pp. 649–658 (2004)
Google Scholar
Wagstaff, K.: Clustering with missing values: no imputation required. In: Banks, D., McMorris, F.R., Arabie, P., Gaul, W. (eds.) Classification, Clustering, and Data Mining Applications, pp. 649–658. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-642-17103-1_61
Chapter Google Scholar
Yan, D., Chen, A., Jordan, M.: Cluster forests. Comput. Stat. Data Anal. 66, 178–192 (2013)
Article MathSciNet Google Scholar
Zhu, X., Loy, C., Gong, S.: Constructing robust affinity graphs for spectral clustering. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, CVPR 2014, pp. 1450–1457 (2014)
Google Scholar

Download references

Acknowledgements

Authors would like to thank the anonymous reviewers for providing helpful comments and suggestions.

Author information

Authors and Affiliations

Computer Science Department, University of Verona, Verona, Italy
Matteo Raniero, Manuele Bicego & Ferdinando Cicalese

Authors

Matteo Raniero
View author publications
You can also search for this author in PubMed Google Scholar
Manuele Bicego
View author publications
You can also search for this author in PubMed Google Scholar
Ferdinando Cicalese
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manuele Bicego .

Editor information

Editors and Affiliations

Boston University, Boston, MA, USA
Stan Sclaroff
National Research Council, Lecce, Italy
Cosimo Distante
National Research Council, Lecce, Italy
Marco Leo
University of Catania, Catania, Italy
Giovanni M. Farinella
Technische Universität München, Garching, Germany
Federico Tombari

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Raniero, M., Bicego, M., Cicalese, F. (2022). Distance-Based Random Forest Clustering with Missing Data. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds) Image Analysis and Processing – ICIAP 2022. ICIAP 2022. Lecture Notes in Computer Science, vol 13233. Springer, Cham. https://doi.org/10.1007/978-3-031-06433-3_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-06433-3_11
Published: 15 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06432-6
Online ISBN: 978-3-031-06433-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Distance-Based Random Forest Clustering with Missing Data