Skip to main content

A Parallel Approach for Decision Trees Learning from Big Data Streams

  • Conference paper
  • First Online:
Business Information Systems (BIS 2015)

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 208))

Included in the following conference series:

Abstract

In this paper we introduce PdsCART, a parallel decision tree learning algorithm. There are three characteristics that are important to emphasize and make this algorithm particularly interesting. Firstly, the algorithm we present here can work with streaming data, i.e. one pass over data is sufficient to construct the tree. Secondly, the algorithm is able to process in parallel a larger amount of data stream records and can therefor handle efficiently very large data sets. And thirdly, the algorithm can be implemented in the MapReduce framework. Details about the algorithm and some basic performance results are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    We are following very closely the description of dsCART algorithm done by Leszek Rutkowski, Maciej Jaworski, Lena Pietruczuk and Piotr Duda in [24].

  2. 2.

    The standard method of dividing the range of numerical attributes values into bins.

  3. 3.

    http://archive.ics.uci.edu/ml - for simplicity we have considered only the numerical attributes.

References

  1. Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  2. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Chapman & Hall/CRC, New York (1984)

    MATH  Google Scholar 

  3. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)

    Google Scholar 

  4. Shafer, C., Agrawal, R., Mehta, M.: SPRINT: a scalable parallel classifier for data mining. In: Proceedings of the 22th International Conference on VLDB, pp. 544–555 (1996)

    Google Scholar 

  5. Joshi, M., Karypis, G., Kumar, V.: ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets. In: Proceedings of the 12th International Parallel Processing Symposium, pp. 573–579 (1998)

    Google Scholar 

  6. Sreenivas, M., Alsabti, K., Ranka, S.: Parallel out-of-core divide-and-conquer techniques with applications to classification trees. In: The 10th Symposium on Parallel and Distributed Processing, pp. 555–562 (1999)

    Google Scholar 

  7. Jin, R., Agrawal, G.: Communication and memory efficient parallel decision tree construction. In: Proceedings of the 3rd SIAM International Conference on Data Mining (SDM), pp. 119–129 SIAM, (2003)

    Google Scholar 

  8. Ben-Haim, Y., Tom-Tov, E.: A streaming parallel decision tree algorithm. J. Mach. Learn. Res. 11, 849–872 (2010)

    MATH  MathSciNet  Google Scholar 

  9. Srivastava, A., Han, E., Kumar, V., Singh, V.: Parallel formulations of decision-tree classification algorithms. Data Min. Knowl. Discov. 3(3), 237–261 (1999)

    Article  Google Scholar 

  10. Amado, N., Gama, J., Silva, F.: Parallel implementation of decision tree learning algorithms. In: Brazdil, P.B., Jorge, A.M. (eds.) EPIA 2001. LNCS (LNAI), vol. 2258, pp. 6–13. Springer, Heidelberg (2001)

    Google Scholar 

  11. Panda, B., Herbach, J., Basu, S., Bayardo, R.: PLANET Massively parallel learning of tree ensembles with MapReduce. In: Proceedings of VLDB-2009 (2009)

    Google Scholar 

  12. Ye, J., Chow, J.-H., Chen, J., Zheng, Z.: Stochastic gradient boosted distributed decision trees. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 2061–2064 (2009)

    Google Scholar 

  13. Tyree, S., Weinberger, K.Q., Agrawal, K., Paykin, J.: Parallel boosted regression trees for web search ranking. In: Proceedings of the 20th International Conference on World Wide Web, pp. 387–396. ACM (2011)

    Google Scholar 

  14. Li, B., Chen, X., Li, M.J., Huang, J.Z., Feng, S.: Scalable random forests for massive data. In: Tan, P.-N., Chawla, S., Ho, C.K., Bailey, J. (eds.) PAKDD 2012, Part I. LNCS, vol. 7301, pp. 135–146. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  15. Rutkowski, L., Jaworski, M., Pietruczuk, L., Duda, P.: A new method for data stream mining based on the misclassification error. IEEE Trans. Neural Netw. Learn. Syst. 26(5), 1048–1059 (2014)

    Article  Google Scholar 

  16. Li, X., Barajas, J.M., Ding, Y.: Collaborative filtering on streaming data with interest-drifting. Intell. Data Anal. 11(1), 75–87 (2007)

    Google Scholar 

  17. Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the 6th ACM SIGKDD Conference, pp. 71–80 (2000)

    Google Scholar 

  18. Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 97–106 (2001)

    Google Scholar 

  19. Bifet, A., Holmes, G., Pfahringer, G., Kirkby, R., Gavalda, R.: New ensemble methods for evolving data streams. In: Proceedings of the 15th ACM SIGKDD International Conference Knowledge Discovery and Data Mining (2009)

    Google Scholar 

  20. Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: DATA STREAM MINING: A Practical Approach. University of Waikato, New Zealand (2011)

    Google Scholar 

  21. Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–30 (1963)

    Article  MATH  MathSciNet  Google Scholar 

  22. Rutkowski, L., Pietruczuk, L., Duda, P., Jaworski, M.: Decision trees for mining data streams based on the McDiarmid’s bound. IEEE Trans. Knowl. Data Eng. 25, 1272–1279 (2013)

    Article  Google Scholar 

  23. Rutkowski, L., Jaworski, M., Pietruczuk, L., Duda, P.: Decision trees for mining data streams based on the gaussian approximation. IEEE Trans. Knowl. Data Eng. 26, 108–119 (2014)

    Article  Google Scholar 

  24. Rutkowski, L., Jaworski, M., Pietruczuk, L., Duda, P.: The CART decision tree for mining data streams. Inf. Sci. 266, 1–15 (2014)

    Article  Google Scholar 

  25. Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ionel Tudor Calistru .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Calistru, I.T., Cotofrei, P., Stoffel, K. (2015). A Parallel Approach for Decision Trees Learning from Big Data Streams. In: Abramowicz, W. (eds) Business Information Systems. BIS 2015. Lecture Notes in Business Information Processing, vol 208. Springer, Cham. https://doi.org/10.1007/978-3-319-19027-3_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-19027-3_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-19026-6

  • Online ISBN: 978-3-319-19027-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics