Mining Big Data with Random Forests

Lulli, Alessandro; Oneto, Luca; Anguita, Davide

doi:10.1007/s12559-018-9615-4

Mining Big Data with Random Forests

Published: 03 January 2019

Volume 11, pages 294–316, (2019)
Cite this article

Cognitive Computation Aims and scope Submit manuscript

769 Accesses
Explore all metrics

Abstract

In the current big data era, naive implementations of well-known learning algorithms cannot efficiently and effectively deal with large datasets. Random forests (RFs) are a popular ensemble-based method for classification. RFs have been shown to be effective in many different real-world classification problems and are commonly considered one of the best learning algorithms in this context. In this paper, we develop an RF implementation called ReForeSt, which, unlike the currently available solutions, can distribute data on available machines in two different ways to optimize the computational and memory requirements of RF with arbitrarily large datasets ranging from millions of samples to millions of features. A recently proposed improved RF formulation called random rotation ensembles can be used in conjunction with model selection to automatically tune the RF hyperparameters. We perform an extensive experimental evaluation on a wide range of large datasets and several environments with different numbers of machines and numbers of cores per machine. Results demonstrate that ReForeSt, in comparison to other state-of-the-art alternatives such as MLlib, is less computationally intensive, more memory efficient, and more effective.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Double random forest

Article 02 July 2020

ReForeSt: Random Forests in Apache Spark

Sample Size Estimation for Effective Modelling of Classification Problems in Machine Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

References

Abdullah A, Hussain A, Khan IH. Introduction: dealing with big data-lessons from cognitive computing. Cogn Comput 2015;7(6):635–636.
Article Google Scholar
Anguita D, Ghio A, Oneto L, Ridella S. In-sample and out-of-sample model selection and error estimation for support vector machines. IEEE Trans Neural Netw Learn Syst 2012;23:1390–1406.
Article PubMed Google Scholar
Arlot S, Celisse A. A survey of cross-validation procedures for model selection. Stat Survey 2010;4:40–79.
Article Google Scholar
Baldi P, Sadowski P, Whiteson D. Searching for exotic particles in high-energy physics with deep learning. Nat Commun 2014;5(4308):1–9.
Google Scholar
Bernard S, Heutte L, Adam S. Influence of hyperparameters on random forest accuracy. MCS. pp. 171–180; 2009.
Bertolucci M, Carlini E, Dazzi P, Lulli A, Ricci L. Static and dynamic big data partitioning on apache spark. PARCO. pp. 489–498; 2015.
Biau G. Analysis of a random forests model. J Mach Learn Res 2012;13:1063–1095.
Google Scholar
Blackard J, Dean D. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput Electron Agric 1999;24(3):131–151.
Article Google Scholar
Blaser R, Fryzlewicz P. Random rotation ensembles. J Mach Learn Res 2015;2:1–15.
Google Scholar
Bosse T, Duell R, Memon ZA, Treur J, van der Wal CN. Agent-based modeling of emotion contagion in groups. Cogn Comput 2015;7(1):111–136.
Article Google Scholar
Breiman L. Random forests. Mach Learn 2001;45(1):5–32.
Article Google Scholar
Cambria E, Chattopadhyay A, Linn E, Mandal B, White B. Storages are not forever. Cogn Comput 2017;9(5):646–658.
Article Google Scholar
Cao L, Sun F, Liu X, Huang W, Kotagiri R, Li H. End-to-end convnet for tactile recognition using residual orthogonal tiling and pyramid convolution ensemble. Cogn Comput 2018;10(5):1–19.
Article Google Scholar
Chen J, Li K, Tang Z, Bilal K, Yu S, Weng C, Li K. A parallel random forest algorithm for big data in a spark cloud computing environment. IEEE Trans Parallel Distributed Syst 2017;28(4):919–933.
Article Google Scholar
Chung S. Sequoia forest : random forest of humongous trees. Spark summit; 2014.
Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. Commun ACM 2008;51(1): 107–113.
Article Google Scholar
Donders ART, van der Heijden GJMG, Stijnen T, Moons KGM. Review: a gentle introduction to imputation of missing values. J Clin Epidemiol 2006;59(10):1087–1091.
Article PubMed Google Scholar
Fernández-Delgado M, Cernadas E, Barro S, Amorim D. Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res 2014;15(1):3133–3181.
Google Scholar
Galton F. Vox populi (the wisdom of crowds). Nature 1907;75(7):450–451.
Article Google Scholar
Gashler M, Giraud-Carrier C, Martinez T. Decision tree ensemble: small heterogeneous is better than large homogeneous. International conference on machine learning and applications; 2008.
Genuer R, Poggi J, Tuleau-Malot C, Villa-Vialaneix N. Random forests for big data. arXiv:1511.08327; 2015.
George L. HBAse: the definitiveguide: random access to your planet-size data. Sebastopol: O’Reilly Media, Inc; 2011.
Google Scholar
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction. Berlin: Springer; 2009.
Book Google Scholar
Hernández-Lobato D, Martínez-muñoz G, Suárez A. How large should ensembles of classifiers be? Pattern Recogn 2013;46(5):1323–1336.
Article Google Scholar
Hilbert M. Big data for development: a review of promises and challenges. Dev Policy Rev 2016;34(1):135–174.
Article Google Scholar
Jin XB, Xie GS, Huang K, Hussain A. Accelerating infinite ensemble of clustering by pivot features. Cogn Comput. 2018; 1–9. https://link.springer.com/article/10.1007/s12559-018-9583-8.
Karau H, Konwinski A, Wendell P, Zaharia M. Learning spark: lightning-fast big data analysis. Sebastopol: O’Reilly Media Inc; 2015.
Google Scholar
Khan FH, Qamar U, Bashir S. Multi-objective model selection (moms)-based semi-supervised framework for sentiment analysis. Cogn Comput 2016;8(4):614–628.
Article Google Scholar
Kleiner A, Talwalkar A, Sarkar P, Jordan MI. A scalable bootstrap for massive data. J R Stat Soc Ser B Stat Methodol 2014;76(4):795–816.
Article Google Scholar
Li Y, Zhu E, Zhu X, Yin J, Zhao J. Counting pedestrian with mixed features and extreme learning machine. Cogn Comput 2014;6(3):462–476.
Article Google Scholar
Liu N, Sakamoto JT, Cao J, Koh ZX, Ho AFW, Lin Z, Ong MEH. Ensemble-based risk scoring with extreme learning machine for prediction of adverse cardiac events. Cogn Comput 2017;9(4):545–554.
Article Google Scholar
Loosli G, Canu S, Bottou L. Training invariant support vector machines using selective sampling. Large scale kernel machines; 2007.
Lulli A, Carlini E, Dazzi P, Lucchese C, Ricci L. Fast connected components computation in large graphs by vertex pruning. IEEE Trans Parallel Distributed Syst 2017;28(3):760–773.
Article Google Scholar
Lulli A, Debatty T, Dell’Amico M, Michiardi P, Ricci L. Scalable k-nn based text clustering. IEEE International conference on big data. pp. 958–963; 2015.
Lulli A, Oneto L, Anguita D. Crack random forest for arbitrary large datasets. IEEE International conference on big data (IEEE BIG DATA); 2017.
Lulli A, Oneto L, Anguita D. Reforest: random forests in apache spark. International conference on artificial neural networks; 2017.
Manjusha KK, Sankaranarayanan K, Seena P. Prediction of different dermatological conditions using naive bayesian classification. Int J Adv Res Comput Sci Softw Eng. 2014;4.
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai DB, Amde M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A. Mllib: machine learning in apache spark. J Mach Learn Res 2016;17(1):1235–1241.
Google Scholar
Ofek N, Poria S, Rokach L, Cambria E, Hussain A, Shabtai A. Unsupervised commonsense knowledge enrichment for domain-specific sentiment analysis. Cogn Comput 2016;8(3):467–477.
Article Google Scholar
Oneto L. Model selection and error estimation without the agonizing pain. WIREs DMKD. 2018;, pp (In–Press).
Oneto L, Bisio F, Cambria E, Anguita D. Statistical learning theory and elm for big social data analysis. IEEE Comput Intell Mag 2016;11(3):45–55.
Article Google Scholar
Oneto L, Bisio F, Cambria E, Anguita D. Semi-supervised learning for affective common-sense reasoning. Cogn Comput 2017;9(1):18–42.
Article Google Scholar
Oneto L, Bisio F, Cambria E, Anguita D. Slt-based elm for big social data analysis. Cogn Comput 2017;9(2):259–274.
Article Google Scholar
Oneto L, Coraddu A, Sanetti P, Karpenko O, Cipollini F, Cleophas T, Anguita D. Marine safety and data analytics: Vessel crash stop maneuvering performance prediction. International conference on artificial neural networks; 2017.
Oneto L, Fumeo E, Clerico C, Canepa R, Papa F, Dambra C, Mazzino N, Davide A. Train delay prediction systems: a big data analytics perspective. Big Data Research. 2017, pp (in–press).
Orlandi I, Oneto L, Anguita D. Random forests model selection. European symposium on artificial neural networks, computational intelligence and machine learning; 2016.
Ortín S, Pesquera L. Reservoir computing with an ensemble of time-delay reservoirs. Cogn Comput 2017; 9(3):327–336.
Article Google Scholar
Panda B, Herbach J, Basu S, Bayardo R. Planet: massively parallel learning of tree ensembles with mapreduce. International conference on very large data bases; 2009.
Reyes-Ortiz JL, Oneto L, Anguita D. Big data analytics in the cloud: spark on hadoop vs mpi/openmp on beowulf. Procedia Comput Sci 2015;53:121–130.
Article Google Scholar
Rijn J. 2014. BNG(mfeat-karhunen) - OpenML Repository. https://www.openml.org/d/252.
Rokach L, Maimon O. 2008. Data mining with decision trees: theory and applications world scientific.
Rotem D, Stockinger K, Wu K. Optimizing candidate check costs for bitmap indices. Proceedings of the 14th ACM international conference on Information and knowledge management. pp 648–655; 2005.
Ryza S. Advanced analytics with spark: patterns for learning from data at scale. Sebastopol: O’Reilly Media Inc; 2017.
Google Scholar
Segal MR. Machine learning benchmarks and random forest regression. UCSF: center For bioinformatics and molecular biostatistics; 2004.
Shalev-Shwartz S, Ben-David S. Understanding machine learning: from theory to algorithms. Cambridge: Cambridge University Press; 2014.
Book Google Scholar
Sonnenburg S, Franc V, Yom-Tov E, Sebag M. Pascal large scale learning challenge. International conference on machine learning; 2008.
Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive: a warehousing solution over a map-reduce framework. Proc VLDB Endowment 2009;2(2):1626–1629.
Article Google Scholar
Wainberg M, Alipanahi B, Frey BJ. Are random forests truly the best classifiers? J Mach Learn Res 2016;17(1):3837–3841.
Google Scholar
Wakayama R, Murata R, Kimura A, Yamashita T, Yamauchi Y, Fujiyoshi H. Distributed forests for mapreduce-based machine learning. IAPR Asian conference on pattern recognition; 2015.
Wang D, Irani D, Pu C. Evolutionary study of web spam: Webb spam corpus 2011 versus webb spam corpus 2006. International conference on collaborative computing: networking, Applications and Worksharing; 2012.
Wen G, Hou Z, Li H, Li D, Jiang L, Xun E. Ensemble of deep neural networks with probability-based fusion for facial expression recognition. Cogn Comput 2017;9(5):597–610.
Article Google Scholar
White T. Hadoop: The definitive guide. Sebastopol: O’Reilly Media Inc; 2012.
Google Scholar
Wolpert DH. The lack of a priori distinctions between learning algorithms. Neural Comput 1996;8(7):1341–1390.
Article Google Scholar
Wu X, Zhu X, Wu G, Ding W. Data mining with big data. IEEE Trans Knowl Data Eng 2014;26 (1):97–107.
Article Google Scholar
Yang B, Zhang T, Zhang Y, Liu W, Wang J, Duan K. Removal of electrooculogram artifacts from electroencephalogram using canonical correlation analysis with ensemble empirical mode decomposition. Cogn Comput 2017;9(5):626–633.
Article Google Scholar
Yu H, Hsieh C, Chang K, Lin C. Large linear classification when data cannot fit in memory. ACM Trans Knowl Discovery Data 2012;5(4):23.
Google Scholar
Yuan G, Ho C, Lin C. An improved glmnet for l1-regularized logistic regression. J Mach Learn Res 2012; 13:1999–2030.
Google Scholar
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. Proceedings of the 9th USENIX conference on networked systems design and implementation. pp. 2–2; 2012.
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: cluster computing with working sets. HotCloud 2010;10(10–10):1–9.
Google Scholar
Zhang S, Huang K, Zhang R, Hussain A. Learning from few samples with memory network. Cogn Comput 2018;10(1):15–22.
Article Google Scholar
Zhou ZH. Ensemble methods: foundations and algorithms. Boca Raton: CRC Press; 2012.
Book Google Scholar

Download references

Author information

Authors and Affiliations

DIBRIS Department, University of Genoa, Via Opera Pia 13, I-16145, Genoa, Italy
Alessandro Lulli, Luca Oneto & Davide Anguita

Authors

Alessandro Lulli
View author publications
You can also search for this author inPubMed Google Scholar
Luca Oneto
View author publications
You can also search for this author inPubMed Google Scholar
Davide Anguita
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Luca Oneto.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent

Not applicable.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lulli, A., Oneto, L. & Anguita, D. Mining Big Data with Random Forests. Cogn Comput 11, 294–316 (2019). https://doi.org/10.1007/s12559-018-9615-4

Download citation

Received: 23 June 2018
Accepted: 12 November 2018
Published: 03 January 2019
Issue Date: 15 April 2019
DOI: https://doi.org/10.1007/s12559-018-9615-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mining Big Data with Random Forests

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Double random forest

ReForeSt: Random Forests in Apache Spark

Sample Size Estimation for Effective Modelling of Classification Problems in Machine Learning

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Ethical Approval

Informed Consent

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now