Scalable Random Forest with Data-Parallel Computing

Vázquez-Novoa, Fernando; Conejero, Javier; Tatu, Cristian; Badia, Rosa M.

doi:10.1007/978-3-031-39698-4_27

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14100))

Included in the following conference series:

European Conference on Parallel Processing

1470 Accesses
1 Citations

Abstract

In the last years, there has been a significant increment in the quantity of data available and computational resources. This leads scientific and industry communities to pursue more accurate and efficient Machine Learning (ML) models. Random Forest is a well-known algorithm in the ML field due to the good results obtained in a wide range of problems. Our objective is to create a parallel version of the algorithm that can generate a model using data distributed across different processors that computationally scales on available resources. This paper presents two novel proposals for this algorithm with a data-parallel approach. The first version is implemented using the PyCOMPSs framework and its failure management mechanism, while the second variant uses the new PyCOMPSs nesting paradigm where the parallel tasks can generate other tasks within them. Both approaches are compared between them and against MLlib Apache Spark Random Forest with strong and weak scaling tests. Our findings indicate that while the MLlib implementation is faster when executed in a small number of nodes, the scalability of both new variants is superior. We conclude that the proposed data-parallel approaches to the Random Forest algorithm can effectively generate accurate and efficient models in a distributed computing environment and offer improved scalability over existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Azizah, N., Riza, L.S., Wihardi, Y.: Implementation of random forest algorithm with parallel computing in r. J. Phys: Conf. Ser. 1280(2), 022028 (2019). https://doi.org/10.1088/1742-6596/1280/2/022028
Article Google Scholar
Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nature Commun. 5(1), 4308 (2014)
Article Google Scholar
Ben-Haim, Y., Tom-Tov, E.: A streaming parallel decision tree algorithm. J. Mach. Learn. Res. 11(2) (2010)
Google Scholar
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Cart: Classification and Regression Trees (1984). Wadsworth, Belmont, CA (1993)
Google Scholar
Chen, J., et al.: A parallel random forest algorithm for big data in a spark cloud computing environment. IEEE Trans. Parallel Distrib. Syst. 28(4), 919–933 (2016)
Article Google Scholar
Cid-Fuentes, J.Á., Solà, S., Álvarez, P., Castro-Ginard, A., Badia, R.M.: dislib: Large scale high performance machine learning in python. In: 2019 15th International Conference on eScience (eScience), pp. 96–105. IEEE (2019)
Google Scholar
Ejarque, J., Bertran, M., Cid-Fuentes, J.Á., Conejero, J., Badia, R.M.: Managing failures in task-based parallel workflows in distributed computing environments. In: Malawski, M., Rzadca, K. (eds.) Euro-Par 2020. LNCS, vol. 12247, pp. 411–425. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57675-2_26
Chapter Google Scholar
Ho, T.K.: Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 1, pp. 278–282. IEEE (1995)
Google Scholar
Lordan, F., et al.: ServiceSs: an interoperable programming framework for the cloud. J. Grid Comput. 12(1), 67–91 (2013). https://doi.org/10.1007/s10723-013-9272-5
Article Google Scholar
Lordan, F., Lezzi, D., Badia, R.M.: Colony: parallel functions as a service on the cloud-edge continuum. In: Sousa, L., Roma, N., Tomás, P. (eds.) Euro-Par 2021. LNCS, vol. 12820, pp. 269–284. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-85665-6_17
Chapter Google Scholar
Meng, X., et al.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)
MathSciNet MATH Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Article Google Scholar
Rocklin, M.: Dask: parallel computation with blocked algorithms and task scheduling. In: Proceedings of the 14th Python in Science Conference, no. 130–136. Citeseer (2015)
Google Scholar
Salzberg, S.L.: C4. 5: Programs for machine learning by j. ross quinlan. morgan kaufmann publishers, inc., 1993 (1994)
Google Scholar
Tejedor, E., et al.: Pycompss: parallel computational workflows in python. Int. J. High Perform. Comput. Appl. 31(1), 66–82 (2017)
Article Google Scholar
Van Rossum, G., Drake, F.L.: Python 3 Reference Manual. CreateSpace, Scotts Valley, CA (2009)
Google Scholar
Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016). https://doi.org/10.1145/2934664
Article Google Scholar

Download references

Acknowledgements

This work has been supported by the Spanish Government (PID2019-107255GB) and by the MCIN/AEI /10.13039/501100011033 (CEX2021- 001148-S), by the Departament de Recerca i Universitats de la Generalitat de Catalunya to the Research Group MPiEDist (2021 SGR 00412), and by the European Commission’s Horizon 2020 Framework program and the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955558 and by MCIN/AEI/10.13039/501100011033 and the European Union NextGenerationEU/PRTR (PCI2021-121957, project eFlows4HPC), and by the European Commission through the Horizon Europe Research and Innovation program under Grant Agreement No. 101016577 (AI-Sprint project).

We thank Núria Masclans and Lluís Jofre from the Department of Fluid Mechanics of the Universitat Politècnica de Catalunya for providing the High Pressure Turbulence dataset.

Author information

Authors and Affiliations

Barcelona Supercomputing Center (BSC-CNS), Barcelona, Spain
Fernando Vázquez-Novoa, Javier Conejero, Cristian Tatu & Rosa M. Badia

Authors

Fernando Vázquez-Novoa
View author publications
You can also search for this author in PubMed Google Scholar
Javier Conejero
View author publications
You can also search for this author in PubMed Google Scholar
Cristian Tatu
View author publications
You can also search for this author in PubMed Google Scholar
Rosa M. Badia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fernando Vázquez-Novoa .

Editor information

Editors and Affiliations

University of Glasgow, Glasgow, UK
José Cano
University of Cyprus, Nicosia, Cyprus
Marios D. Dikaiakos
University of Cyprus, Nicosia, Cyprus
George A. Papadopoulos
Chalmers University of Technology, Gothenburg, Sweden
Miquel Pericàs
University of Manchester, Manchester, UK
Rizos Sakellariou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vázquez-Novoa, F., Conejero, J., Tatu, C., Badia, R.M. (2023). Scalable Random Forest with Data-Parallel Computing. In: Cano, J., Dikaiakos, M.D., Papadopoulos, G.A., Pericàs, M., Sakellariou, R. (eds) Euro-Par 2023: Parallel Processing. Euro-Par 2023. Lecture Notes in Computer Science, vol 14100. Springer, Cham. https://doi.org/10.1007/978-3-031-39698-4_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-39698-4_27
Published: 24 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39697-7
Online ISBN: 978-3-031-39698-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Scalable Random Forest with Data-Parallel Computing