Skip to main content

Scalable Random Forest with Data-Parallel Computing

  • Conference paper
  • First Online:
Euro-Par 2023: Parallel Processing (Euro-Par 2023)

Abstract

In the last years, there has been a significant increment in the quantity of data available and computational resources. This leads scientific and industry communities to pursue more accurate and efficient Machine Learning (ML) models. Random Forest is a well-known algorithm in the ML field due to the good results obtained in a wide range of problems. Our objective is to create a parallel version of the algorithm that can generate a model using data distributed across different processors that computationally scales on available resources. This paper presents two novel proposals for this algorithm with a data-parallel approach. The first version is implemented using the PyCOMPSs framework and its failure management mechanism, while the second variant uses the new PyCOMPSs nesting paradigm where the parallel tasks can generate other tasks within them. Both approaches are compared between them and against MLlib Apache Spark Random Forest with strong and weak scaling tests. Our findings indicate that while the MLlib implementation is faster when executed in a small number of nodes, the scalability of both new variants is superior. We conclude that the proposed data-parallel approaches to the Random Forest algorithm can effectively generate accurate and efficient models in a distributed computing environment and offer improved scalability over existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Azizah, N., Riza, L.S., Wihardi, Y.: Implementation of random forest algorithm with parallel computing in r. J. Phys: Conf. Ser. 1280(2), 022028 (2019). https://doi.org/10.1088/1742-6596/1280/2/022028

    Article  Google Scholar 

  2. Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nature Commun. 5(1), 4308 (2014)

    Article  Google Scholar 

  3. Ben-Haim, Y., Tom-Tov, E.: A streaming parallel decision tree algorithm. J. Mach. Learn. Res. 11(2) (2010)

    Google Scholar 

  4. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Cart: Classification and Regression Trees (1984). Wadsworth, Belmont, CA (1993)

    Google Scholar 

  5. Chen, J., et al.: A parallel random forest algorithm for big data in a spark cloud computing environment. IEEE Trans. Parallel Distrib. Syst. 28(4), 919–933 (2016)

    Article  Google Scholar 

  6. Cid-Fuentes, J.Á., Solà, S., Álvarez, P., Castro-Ginard, A., Badia, R.M.: dislib: Large scale high performance machine learning in python. In: 2019 15th International Conference on eScience (eScience), pp. 96–105. IEEE (2019)

    Google Scholar 

  7. Ejarque, J., Bertran, M., Cid-Fuentes, J.Á., Conejero, J., Badia, R.M.: Managing failures in task-based parallel workflows in distributed computing environments. In: Malawski, M., Rzadca, K. (eds.) Euro-Par 2020. LNCS, vol. 12247, pp. 411–425. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57675-2_26

    Chapter  Google Scholar 

  8. Ho, T.K.: Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 1, pp. 278–282. IEEE (1995)

    Google Scholar 

  9. Lordan, F., et al.: ServiceSs: an interoperable programming framework for the cloud. J. Grid Comput. 12(1), 67–91 (2013). https://doi.org/10.1007/s10723-013-9272-5

    Article  Google Scholar 

  10. Lordan, F., Lezzi, D., Badia, R.M.: Colony: parallel functions as a service on the cloud-edge continuum. In: Sousa, L., Roma, N., Tomás, P. (eds.) Euro-Par 2021. LNCS, vol. 12820, pp. 269–284. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-85665-6_17

    Chapter  Google Scholar 

  11. Meng, X., et al.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)

    MathSciNet  MATH  Google Scholar 

  12. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  13. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)

    Article  Google Scholar 

  14. Rocklin, M.: Dask: parallel computation with blocked algorithms and task scheduling. In: Proceedings of the 14th Python in Science Conference, no. 130–136. Citeseer (2015)

    Google Scholar 

  15. Salzberg, S.L.: C4. 5: Programs for machine learning by j. ross quinlan. morgan kaufmann publishers, inc., 1993 (1994)

    Google Scholar 

  16. Tejedor, E., et al.: Pycompss: parallel computational workflows in python. Int. J. High Perform. Comput. Appl. 31(1), 66–82 (2017)

    Article  Google Scholar 

  17. Van Rossum, G., Drake, F.L.: Python 3 Reference Manual. CreateSpace, Scotts Valley, CA (2009)

    Google Scholar 

  18. Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016). https://doi.org/10.1145/2934664

    Article  Google Scholar 

Download references

Acknowledgements

This work has been supported by the Spanish Government (PID2019-107255GB) and by the MCIN/AEI /10.13039/501100011033 (CEX2021- 001148-S), by the Departament de Recerca i Universitats de la Generalitat de Catalunya to the Research Group MPiEDist (2021 SGR 00412), and by the European Commission’s Horizon 2020 Framework program and the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955558 and by MCIN/AEI/10.13039/501100011033 and the European Union NextGenerationEU/PRTR (PCI2021-121957, project eFlows4HPC), and by the European Commission through the Horizon Europe Research and Innovation program under Grant Agreement No. 101016577 (AI-Sprint project).

We thank Núria Masclans and Lluís Jofre from the Department of Fluid Mechanics of the Universitat Politècnica de Catalunya for providing the High Pressure Turbulence dataset.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fernando Vázquez-Novoa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vázquez-Novoa, F., Conejero, J., Tatu, C., Badia, R.M. (2023). Scalable Random Forest with Data-Parallel Computing. In: Cano, J., Dikaiakos, M.D., Papadopoulos, G.A., Pericàs, M., Sakellariou, R. (eds) Euro-Par 2023: Parallel Processing. Euro-Par 2023. Lecture Notes in Computer Science, vol 14100. Springer, Cham. https://doi.org/10.1007/978-3-031-39698-4_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-39698-4_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-39697-7

  • Online ISBN: 978-3-031-39698-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics