Abstract
This paper proposes a scalable random forest algorithm SRF with MapReduce implementation. A breadth-first approach is used to grow decision trees for a random forest model. At each level of the trees, a pair of map and reduce functions split the nodes. A mapper is dispatched to a local machine to compute the local histograms of subspace features of the nodes from a data block. The local histograms are submitted to reducers to compute the global histograms from which the best split conditions of the nodes are calculated and sent to the controller on the master machine to update the random forest model. A random forest model is built with a sequence of map and reduce functions. Experiments on large synthetic data have shown that SRF is scalable to the number of trees and the number of examples. The SRF algorithm is able to build a random forest of 100 trees in a little more than 1 hour from 110 Gigabyte data with 1000 features and 10 million records.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
Banfield, R., Hall, L., Bowyer, K., Kegelmeyer, W.: A comparison of decision tree ensemble creation techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 173–180 (2007)
Ho, T.: Random decision forests. In: Proceedings of the Third International Conference on Document Analysis and Recognition, vol. 1, pp. 278–282. IEEE (1995)
Ho, T.: C4.5 decision forests. In: Proceedings of Fourteenth International Conference on Pattern Recognition, vol. 1, pp. 545–549. IEEE (1998)
Ho, T.: The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8), 832–844 (1998)
White, T.: Hadoop: The definitive guide. Yahoo Press (2010)
Venner, J.: Pro Hadoop. Springer (2009)
Lam, C., Warren, J.: Hadoop in action (2010)
Ben-Haim, Y., Tom-Tov, E.: A streaming parallel decision tree algorithm. The Journal of Machine Learning Research 11, 849–872 (2010)
Breiman, L.: Classification and regression trees. Chapman & Hall/CRC (1984)
Quinlan, J.: C4.5: Programs for machine learning. Morgan Kaufmann (1993)
Mehta, M., Agrawal, R., Rissanen, J.: Sliq: A Fast Scalable Classifier for Data Mining. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057, pp. 18–32. Springer, Heidelberg (1996)
Shafer, J., Agrawal, R., Mehta, M.: Sprint: A scalable parallel classifier for data mining. In: Proceedings of the International Conference on Very Large Data Bases, pp. 544–555. Citeseer (1996)
Joshi, M., Karypis, G., Kumar, V.: Scalparc: A new scalable and efficient parallel classification algorithm for mining large datasets. In: Proceedings of the First Merged International and Symposium on Parallel and Distributed Processing, Parallel Processing Symposium, IPPS/SPDP 1998, pp. 573–579. IEEE (1998)
Gehrke, J., Ganti, V., Ramakrishnan, R., Loh, W.: Boatoptimistic decision tree construction. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pp. 169–180. ACM (1999)
AlSabti, K., Ranka, S., Singh, V.: Clouds: Classification for large or out-of-core datasets. In: Conference on Knowledge Discovery and Data Mining (1998)
Jin, R., Agrawal, G.: Communication and memory efficient parallel decision tree construction. In: 3rd SIAM International Conference on Data Mining, San Francisco, CA (2003)
Panda, B., Herbach, J., Basu, S., Bayardo, R.: Planet: massively parallel learning of tree ensembles with mapreduce. Proceedings of the VLDB Endowment 2(2), 1426–1437 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, B., Chen, X., Li, M.J., Huang, J.Z., Feng, S. (2012). Scalable Random Forests for Massive Data. In: Tan, PN., Chawla, S., Ho, C.K., Bailey, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2012. Lecture Notes in Computer Science(), vol 7301. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30217-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-30217-6_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-30216-9
Online ISBN: 978-3-642-30217-6
eBook Packages: Computer ScienceComputer Science (R0)