Mining Decision Trees from Streams

Hulten, Geoff; Domingos, Pedro

doi:10.1007/978-3-540-28608-0_9

Geoff Hulten⁶ &
Pedro Domingos⁷

Part of the book series: Data-Centric Systems and Applications ((DCSA))

3457 Accesses
1 Citations

Abstract

Many organizations today have more than very large databases; they have databases that grow without limit at a rate of several million records per day. Mining these continuous data streams brings unique opportunities, but also new challenges. We present a method that can semi-automatically enhance a wide class of existing learning algorithms so they can learn from such high-speed data streams in real time. The method works by sampling just enough data from the data stream to make each decision required by the learning process. The method is applicable to essentially any induction algorithm based on discrete search. In this chapter, we illustrate the use of our method by applying it to what is perhaps the most widely used form of data mining: decision tree induction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Hardcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees (Wadsworth, Belmont, 1984)
MATH Google Scholar
J. Catlett, Megainduction: machine learning on very large databases. PhD thesis, Basser Department of Computer Science, University of Sydney, Sydney, Australia (1991)
Google Scholar
C. Domingo, R. Gavalda, O. Watanabe, Adaptive sampling methods for scaling up knowledge discovery algorithms. Data Min. Knowl. Discov. 6, 131–152 (2002)
Article MathSciNet MATH Google Scholar
P. Domingos, G. Hulten, A general method for scaling up machine learning algorithms and its application to clustering, in Proceedings of the Eighteenth International Conference on Machine Learning, Williamstown, MA (Morgan Kaufmann, San Mateo, 2001), pp. 106–113
Google Scholar
P. Domingos, G. Hulten, Learning from infinite data in finite time, in Advances in Neural Information Processing Systems, vol. 14, ed. by T.G. Dietterich, S. Becker, Z. Ghahramani (MIT Press, Cambridge, 2002), pp. 673–680
Google Scholar
Y. Freund, Self bounding learning algorithms, in Proceedings of the Eleventh Annual Conference on Computational Learning Theory, Madison, WI (Morgan Kaufmann, San Mateo, 1998)
Google Scholar
V. Ganti, J. Gehrke, R. Ramakrishnan, DEMON: mining and monitoring evolving data, in Proceedings of the Sixteenth International Conference on Data Engineering, San Diego, CA (2000), pp. 439–448
Google Scholar
J. Gehrke, V. Ganti, R. Ramakrishnan, W.-L. Loh, BOAT: optimistic decision tree construction, in Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Philadelphia, PA (ACM, New York, 1999), pp. 169–180
Chapter Google Scholar
J. Gratch, Sequential inductive learning, in Proceedings of the Thirteenth National Conference on Artificial Intelligence, Portland, OR (AAAI Press, Menlo Park, 1996), pp. 779–786
Google Scholar
R. Greiner, PALO: a probabilistic hill-climbing algorithm. Artif. Intell. 84, 177–208 (1996)
Article MathSciNet Google Scholar
W. Hoeffding, Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–30 (1963)
Article MathSciNet MATH Google Scholar
G. Hulten, P. Domingos, Mining complex models from arbitrarily large databases in constant time, in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada (ACM Press, New York, 2002), pp. 525–531
Chapter Google Scholar
G. Hulten, P. Domingos, VFML—a toolkit for mining high-speed time-changing data streams (2003). http://www.cs.washington.edu/dm/vfml/
G. Hulten, P. Domingos, Y. Abe, Mining massive relational databases, in IJCAI 2003 Workshop on Learning Statistical Models from Relational Data, Acapulco, Mexico (2003)
Google Scholar
G. Hulten, L. Spencer, P. Domingos, Mining time-changing data streams, in Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA (ACM, New York, 2001), pp. 97–106
Chapter Google Scholar
O. Maron, A. Moore, Hoeffding races: accelerating model selection search for classification and function approximation, in Advances in Neural Information Processing Systems 6, ed. by J.D. Cowan, G. Tesauro, J. Alspector (Morgan Kaufmann, San Mateo, 1994)
Google Scholar
M. Mehta, A. Agrawal, J. Rissanen, SLIQ: a fast scalable classifier for data mining, in Proceedings of the Fifth International Conference on Extending Database Technology, Avignon, France (Springer, Berlin, 1996), pp. 18–32
Google Scholar
R. Musick, J. Catlett, S. Russell, Decision theoretic subsampling for induction on large databases, in Proceedings of the Tenth International Conference on Machine Learning, Amherst, MA (Morgan Kaufmann, San Mateo, 1993), pp. 212–219
Google Scholar
J.R. Quinlan, C4.5: Programs for Machine Learning (Morgan Kaufmann, San Mateo, 1993)
Google Scholar
T. Scheffer, S. Wrobel, Incremental maximization of non-instance-averaging utility functions with applications to knowledge discovery problems, in Proceedings of the Eighteenth International Conference on Machine Learning, Williamstown, MA (Morgan Kaufmann, San Mateo, 2001), pp. 481–488
Google Scholar
J.C. Shafer, R. Agrawal, M. Mehta, SPRINT: a scalable parallel classifier for data mining, in Proceedings of the Twenty-Second International Conference on Very Large Databases, Bombay, India (Morgan Kaufmann, San Mateo, 1996), pp. 544–555
Google Scholar
J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, M. Anthony, Structural risk minimization over data-dependent hierarchies. Technical report NC-TR-96-053, Department of Computer Science, Royal Holloway, University of London, Egham, UK (1996)
Google Scholar
P.E. Utgoff, An improved algorithm for incremental induction of decision trees, in Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ (Morgan Kaufmann, San Mateo, 1994), pp. 318–325
Google Scholar
A. Wald, Sequential Analysis (Wiley, New York, 1947)
MATH Google Scholar
A. Wolman, G. Voelker, N. Sharma, N. Cardwell, M. Brown, T. Landray, D. Pinnel, A. Karlin, H. Levy, Organization-based analysis of Web-object sharing and caching, in Proceedings of the Second USENIX Conference on Internet Technologies and Systems, Boulder, CO (1999), pp. 25–36
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research, Redmond, WA, USA
Geoff Hulten
Department of Computer Science and Engineering, University of Washington, Seattle, WA, USA
Pedro Domingos

Authors

Geoff Hulten
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Domingos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pedro Domingos .

Editor information

Editors and Affiliations

University Campus - Kounoupidiana, School of ECE, Techn. Univ. of Crete University Campus - Kounoupidiana, Chania, Greece
Minos Garofalakis
Microsoft Corporation, Redmond, Washington, USA
Johannes Gehrke
Amazon India , Bangalore, India
Rajeev Rastogi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hulten, G., Domingos, P. (2016). Mining Decision Trees from Streams. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds) Data Stream Management. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28608-0_9

Download citation

DOI: https://doi.org/10.1007/978-3-540-28608-0_9
Published: 12 July 2016
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28607-3
Online ISBN: 978-3-540-28608-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics