A Parallel Approach for Decision Trees Learning from Big Data Streams

Calistru, Ionel Tudor; Cotofrei, Paul; Stoffel, Kilian

doi:10.1007/978-3-319-19027-3_1

Ionel Tudor Calistru⁷,
Paul Cotofrei⁷ &
Kilian Stoffel⁷

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 208))

Included in the following conference series:

International Conference on Business Information Systems

2513 Accesses
1 Citations
3 Altmetric

Abstract

In this paper we introduce PdsCART, a parallel decision tree learning algorithm. There are three characteristics that are important to emphasize and make this algorithm particularly interesting. Firstly, the algorithm we present here can work with streaming data, i.e. one pass over data is sufficient to construct the tree. Secondly, the algorithm is able to process in parallel a larger amount of data stream records and can therefor handle efficiently very large data sets. And thirdly, the algorithm can be implemented in the MapReduce framework. Details about the algorithm and some basic performance results are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Time Complexity Analysis to the ParDTLT Parallel Algorithm for Decision Tree Induction

Regularized and incremental decision trees for data streams

Article 02 July 2020

Leveraging Plasticity in Incremental Decision Trees

Notes

1.
We are following very closely the description of dsCART algorithm done by Leszek Rutkowski, Maciej Jaworski, Lena Pietruczuk and Piotr Duda in [24].
2.
The standard method of dividing the range of numerical attributes values into bins.
3.
http://archive.ics.uci.edu/ml - for simplicity we have considered only the numerical attributes.

References

Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Chapman & Hall/CRC, New York (1984)
MATH Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)
Google Scholar
Shafer, C., Agrawal, R., Mehta, M.: SPRINT: a scalable parallel classifier for data mining. In: Proceedings of the 22th International Conference on VLDB, pp. 544–555 (1996)
Google Scholar
Joshi, M., Karypis, G., Kumar, V.: ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets. In: Proceedings of the 12th International Parallel Processing Symposium, pp. 573–579 (1998)
Google Scholar
Sreenivas, M., Alsabti, K., Ranka, S.: Parallel out-of-core divide-and-conquer techniques with applications to classification trees. In: The 10th Symposium on Parallel and Distributed Processing, pp. 555–562 (1999)
Google Scholar
Jin, R., Agrawal, G.: Communication and memory efficient parallel decision tree construction. In: Proceedings of the 3rd SIAM International Conference on Data Mining (SDM), pp. 119–129 SIAM, (2003)
Google Scholar
Ben-Haim, Y., Tom-Tov, E.: A streaming parallel decision tree algorithm. J. Mach. Learn. Res. 11, 849–872 (2010)
MATH MathSciNet Google Scholar
Srivastava, A., Han, E., Kumar, V., Singh, V.: Parallel formulations of decision-tree classification algorithms. Data Min. Knowl. Discov. 3(3), 237–261 (1999)
Article Google Scholar
Amado, N., Gama, J., Silva, F.: Parallel implementation of decision tree learning algorithms. In: Brazdil, P.B., Jorge, A.M. (eds.) EPIA 2001. LNCS (LNAI), vol. 2258, pp. 6–13. Springer, Heidelberg (2001)
Google Scholar
Panda, B., Herbach, J., Basu, S., Bayardo, R.: PLANET Massively parallel learning of tree ensembles with MapReduce. In: Proceedings of VLDB-2009 (2009)
Google Scholar
Ye, J., Chow, J.-H., Chen, J., Zheng, Z.: Stochastic gradient boosted distributed decision trees. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 2061–2064 (2009)
Google Scholar
Tyree, S., Weinberger, K.Q., Agrawal, K., Paykin, J.: Parallel boosted regression trees for web search ranking. In: Proceedings of the 20th International Conference on World Wide Web, pp. 387–396. ACM (2011)
Google Scholar
Li, B., Chen, X., Li, M.J., Huang, J.Z., Feng, S.: Scalable random forests for massive data. In: Tan, P.-N., Chawla, S., Ho, C.K., Bailey, J. (eds.) PAKDD 2012, Part I. LNCS, vol. 7301, pp. 135–146. Springer, Heidelberg (2012)
Chapter Google Scholar
Rutkowski, L., Jaworski, M., Pietruczuk, L., Duda, P.: A new method for data stream mining based on the misclassification error. IEEE Trans. Neural Netw. Learn. Syst. 26(5), 1048–1059 (2014)
Article Google Scholar
Li, X., Barajas, J.M., Ding, Y.: Collaborative filtering on streaming data with interest-drifting. Intell. Data Anal. 11(1), 75–87 (2007)
Google Scholar
Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the 6th ACM SIGKDD Conference, pp. 71–80 (2000)
Google Scholar
Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 97–106 (2001)
Google Scholar
Bifet, A., Holmes, G., Pfahringer, G., Kirkby, R., Gavalda, R.: New ensemble methods for evolving data streams. In: Proceedings of the 15th ACM SIGKDD International Conference Knowledge Discovery and Data Mining (2009)
Google Scholar
Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: DATA STREAM MINING: A Practical Approach. University of Waikato, New Zealand (2011)
Google Scholar
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–30 (1963)
Article MATH MathSciNet Google Scholar
Rutkowski, L., Pietruczuk, L., Duda, P., Jaworski, M.: Decision trees for mining data streams based on the McDiarmid’s bound. IEEE Trans. Knowl. Data Eng. 25, 1272–1279 (2013)
Article Google Scholar
Rutkowski, L., Jaworski, M., Pietruczuk, L., Duda, P.: Decision trees for mining data streams based on the gaussian approximation. IEEE Trans. Knowl. Data Eng. 26, 108–119 (2014)
Article Google Scholar
Rutkowski, L., Jaworski, M., Pietruczuk, L., Duda, P.: The CART decision tree for mining data streams. Inf. Sci. 266, 1–15 (2014)
Article Google Scholar
Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Information Management Institute, University of Neuchatel, Neuchatel, Switzerland
Ionel Tudor Calistru, Paul Cotofrei & Kilian Stoffel

Authors

Ionel Tudor Calistru
View author publications
You can also search for this author in PubMed Google Scholar
Paul Cotofrei
View author publications
You can also search for this author in PubMed Google Scholar
Kilian Stoffel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ionel Tudor Calistru .

Editor information

Editors and Affiliations

Department of Information Systems, Poznań University of Economics, Poznań, Poland
Witold Abramowicz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Calistru, I.T., Cotofrei, P., Stoffel, K. (2015). A Parallel Approach for Decision Trees Learning from Big Data Streams. In: Abramowicz, W. (eds) Business Information Systems. BIS 2015. Lecture Notes in Business Information Processing, vol 208. Springer, Cham. https://doi.org/10.1007/978-3-319-19027-3_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-19027-3_1
Published: 16 June 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19026-6
Online ISBN: 978-3-319-19027-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics