Skip to main content

Mining Decision Trees from Streams

  • Chapter
  • First Online:
Data Stream Management

Part of the book series: Data-Centric Systems and Applications ((DCSA))

Abstract

Many organizations today have more than very large databases; they have databases that grow without limit at a rate of several million records per day. Mining these continuous data streams brings unique opportunities, but also new challenges. We present a method that can semi-automatically enhance a wide class of existing learning algorithms so they can learn from such high-speed data streams in real time. The method works by sampling just enough data from the data stream to make each decision required by the learning process. The method is applicable to essentially any induction algorithm based on discrete search. In this chapter, we illustrate the use of our method by applying it to what is perhaps the most widely used form of data mining: decision tree induction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees (Wadsworth, Belmont, 1984)

    MATH  Google Scholar 

  2. J. Catlett, Megainduction: machine learning on very large databases. PhD thesis, Basser Department of Computer Science, University of Sydney, Sydney, Australia (1991)

    Google Scholar 

  3. C. Domingo, R. Gavalda, O. Watanabe, Adaptive sampling methods for scaling up knowledge discovery algorithms. Data Min. Knowl. Discov. 6, 131–152 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  4. P. Domingos, G. Hulten, A general method for scaling up machine learning algorithms and its application to clustering, in Proceedings of the Eighteenth International Conference on Machine Learning, Williamstown, MA (Morgan Kaufmann, San Mateo, 2001), pp. 106–113

    Google Scholar 

  5. P. Domingos, G. Hulten, Learning from infinite data in finite time, in Advances in Neural Information Processing Systems, vol. 14, ed. by T.G. Dietterich, S. Becker, Z. Ghahramani (MIT Press, Cambridge, 2002), pp. 673–680

    Google Scholar 

  6. Y. Freund, Self bounding learning algorithms, in Proceedings of the Eleventh Annual Conference on Computational Learning Theory, Madison, WI (Morgan Kaufmann, San Mateo, 1998)

    Google Scholar 

  7. V. Ganti, J. Gehrke, R. Ramakrishnan, DEMON: mining and monitoring evolving data, in Proceedings of the Sixteenth International Conference on Data Engineering, San Diego, CA (2000), pp. 439–448

    Google Scholar 

  8. J. Gehrke, V. Ganti, R. Ramakrishnan, W.-L. Loh, BOAT: optimistic decision tree construction, in Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Philadelphia, PA (ACM, New York, 1999), pp. 169–180

    Chapter  Google Scholar 

  9. J. Gratch, Sequential inductive learning, in Proceedings of the Thirteenth National Conference on Artificial Intelligence, Portland, OR (AAAI Press, Menlo Park, 1996), pp. 779–786

    Google Scholar 

  10. R. Greiner, PALO: a probabilistic hill-climbing algorithm. Artif. Intell. 84, 177–208 (1996)

    Article  MathSciNet  Google Scholar 

  11. W. Hoeffding, Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–30 (1963)

    Article  MathSciNet  MATH  Google Scholar 

  12. G. Hulten, P. Domingos, Mining complex models from arbitrarily large databases in constant time, in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada (ACM Press, New York, 2002), pp. 525–531

    Chapter  Google Scholar 

  13. G. Hulten, P. Domingos, VFML—a toolkit for mining high-speed time-changing data streams (2003). http://www.cs.washington.edu/dm/vfml/

  14. G. Hulten, P. Domingos, Y. Abe, Mining massive relational databases, in IJCAI 2003 Workshop on Learning Statistical Models from Relational Data, Acapulco, Mexico (2003)

    Google Scholar 

  15. G. Hulten, L. Spencer, P. Domingos, Mining time-changing data streams, in Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA (ACM, New York, 2001), pp. 97–106

    Chapter  Google Scholar 

  16. O. Maron, A. Moore, Hoeffding races: accelerating model selection search for classification and function approximation, in Advances in Neural Information Processing Systems 6, ed. by J.D. Cowan, G. Tesauro, J. Alspector (Morgan Kaufmann, San Mateo, 1994)

    Google Scholar 

  17. M. Mehta, A. Agrawal, J. Rissanen, SLIQ: a fast scalable classifier for data mining, in Proceedings of the Fifth International Conference on Extending Database Technology, Avignon, France (Springer, Berlin, 1996), pp. 18–32

    Google Scholar 

  18. R. Musick, J. Catlett, S. Russell, Decision theoretic subsampling for induction on large databases, in Proceedings of the Tenth International Conference on Machine Learning, Amherst, MA (Morgan Kaufmann, San Mateo, 1993), pp. 212–219

    Google Scholar 

  19. J.R. Quinlan, C4.5: Programs for Machine Learning (Morgan Kaufmann, San Mateo, 1993)

    Google Scholar 

  20. T. Scheffer, S. Wrobel, Incremental maximization of non-instance-averaging utility functions with applications to knowledge discovery problems, in Proceedings of the Eighteenth International Conference on Machine Learning, Williamstown, MA (Morgan Kaufmann, San Mateo, 2001), pp. 481–488

    Google Scholar 

  21. J.C. Shafer, R. Agrawal, M. Mehta, SPRINT: a scalable parallel classifier for data mining, in Proceedings of the Twenty-Second International Conference on Very Large Databases, Bombay, India (Morgan Kaufmann, San Mateo, 1996), pp. 544–555

    Google Scholar 

  22. J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, M. Anthony, Structural risk minimization over data-dependent hierarchies. Technical report NC-TR-96-053, Department of Computer Science, Royal Holloway, University of London, Egham, UK (1996)

    Google Scholar 

  23. P.E. Utgoff, An improved algorithm for incremental induction of decision trees, in Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ (Morgan Kaufmann, San Mateo, 1994), pp. 318–325

    Google Scholar 

  24. A. Wald, Sequential Analysis (Wiley, New York, 1947)

    MATH  Google Scholar 

  25. A. Wolman, G. Voelker, N. Sharma, N. Cardwell, M. Brown, T. Landray, D. Pinnel, A. Karlin, H. Levy, Organization-based analysis of Web-object sharing and caching, in Proceedings of the Second USENIX Conference on Internet Technologies and Systems, Boulder, CO (1999), pp. 25–36

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pedro Domingos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Hulten, G., Domingos, P. (2016). Mining Decision Trees from Streams. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds) Data Stream Management. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28608-0_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-28608-0_9

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-28607-3

  • Online ISBN: 978-3-540-28608-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics