Skip to main content

A New Decision Tree Classification Method for Mining High-Speed Data Streams Based on Threaded Binary Search Trees

  • Conference paper
Emerging Technologies in Knowledge Discovery and Data Mining (PAKDD 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4819))

Included in the following conference series:

Abstract

One of most important algorithms for mining data streams is VFDT. It uses Hoeffding inequality to achieve a probabilistic bound on the accuracy of the tree constructed. Gama et al. have extended VFDT in two directions. Their system VFDTc can deal with continuous data and use more powerful classification techniques at tree leaves. In this paper, we revisit this problem and implemented a system VFDTt on top of VFDT and VFDTc. We make the following three contributions: 1) we present a threaded binary search trees (TBST) approach for efficiently handling continuous attributes. It builds a threaded binary search tree, and its processing time for values inserting is O(nlogn), while VFDT‘s processing time is O(n 2 ). When a new example arrives, VFDTc need update O(logn) attribute tree nodes, but VFDTt just need update one necessary node.2) we improve the method of getting the best split-test point of a given continuous attribute. Comparing to the method used in VFDTc, it improves fromO(nlogn) to O (n) in processing time. 3) Comparing to VFDTc, VFDTt‘s candidate split-test number decrease fromO(n) to O(logn).Comparing to VFDT, the most relevant property of our system is an average reduction of 25.53% in processing time, while keep the same tree size and accuracy. Overall, the techniques introduced here significantly improve the efficiency of decision tree classification on data streams.

This work was supported by the National Science Foundation of China under Grants No. 60573057, 60473057 and 90604007.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Babcock, B., Babu, S., Datar, M., Motawani, R., Widom, J.: Models and Issues in Data Stream Systems. In: PODS (2002)

    Google Scholar 

  2. Domingos, P., Hulten, G.: Mining High-Speed Data Streams. In: Proceedings of the Association for Computing Machinery Sixth International Conference on Knowledge Discovery and Data Mining, pp. 71–80 (2000)

    Google Scholar 

  3. Mehta, M., Agrawal, A., Rissanen, J.: SLIQ: A Fast Scalable Classifier for Data Mining. In: Proceedings of The Fifth International Conference on Extending Database Technology, Avignon, France, pp. 18–32 (1996)

    Google Scholar 

  4. Fan, W.: StreamMiner: A Classifier Ensemble-based Engine to Mine Concept Drifting Data Streams. In: VLDB 2004

    Google Scholar 

  5. Gama, J., Rocha, R., Medas, P.: Accurate Decision Trees for Mining High-Speed Data Streams. In: Domingos, P., Faloutsos, C. (eds.) Proceedings of the Ninth International Conference on Knowledge Discovery and Data Mining, ACM Press, New York (2003)

    Google Scholar 

  6. Hulten, G., Spencer, L., Domingos, P.: Mining Time-Changing Data Streams. In: ACM SIGKDD (2001)

    Google Scholar 

  7. Jin, R., Agrawal, G.: Efficient Decision Tree Construction on Streaming Data. In: Proceedings of ACM SIGKDD (2003)

    Google Scholar 

  8. Last, M.: Online Classification of Nonstationary Data Streams. Intelligent Data Analysis 6(2), 129–147 (2002)

    MATH  MathSciNet  Google Scholar 

  9. Muthukrishnan, S.: Data streams: Algorithms and Applications. In: Proceedings of the fourteenth annual ACM-SIAM symposium on discrete algorithms (2003)

    Google Scholar 

  10. Wang, H., Fan, W., Yu, P., Han, J.: Mining Concept-Drifting Data Streams using Ensemble Classifiers. In: 9th ACM International Conference on Knowledge Discovery and Data Mining, Washington DC, USA. SIGKDD (2003)

    Google Scholar 

  11. Arasu, A., Babcock, B., Babu, S., Datar, M., Ito, K., Nishizawa, I., Rosenstein, J., Widom, J.: STREAM: The Stanford Stream Data Manager Demonstration Description –Short Overview of System Status and Plans. In: Proc. of the ACM Intl Conf. on Management of Data (SIGMOD 2003) (June 2003)

    Google Scholar 

  12. Aggarwal, C., Han, J., Wang, J., Yu, P.S.: On Demand Classification of Data Streams. In: Proc. 2004 Int. Conf. on Knowledge Discovery and Data Mining (KDD 2004), Seattle, WA (2004)

    Google Scholar 

  13. Guetova, M., Holldobter, S.H.-P.: Incremental Fuzzy Decision Trees. In: Jarke, M., Koehler, J., Lakemeyer, G. (eds.) KI 2002. LNCS (LNAI), vol. 2479, Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  14. Ben-David, S., Gehrke, J., Kifer, D.: Detecting Change in Data Streams. In: Proceedings of VLDB 2004

    Google Scholar 

  15. Aggarwal, C.: A Framework for Diagnosing Changes in Evolving Data Streams. In: Proceedings of the ACM SIGMOD Conference (2003)

    Google Scholar 

  16. Gaber, M.M., Zaslavskey, A., Krishnaswamy, S.: Mining Data Streams: a Review. SIGMOD Record 34(2) (June 2005)

    Google Scholar 

  17. Cezary, Z.J.: Fuzzy Decision Trees: Issues and Methods. IEEE Transactions on Systems, Man, and Cybernetics 28(1), 1–14 (1998)

    Google Scholar 

  18. Utgoff, P.E.: Incremental Induction of Decision Trees. Machine Learning 4(2), 161–186 (1989)

    Article  Google Scholar 

  19. Xie, Q.H.: An Efficient Approach for Mining Concept-Drifting Data Streams, Master Thesis

    Google Scholar 

  20. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA (1993)

    Google Scholar 

  21. Hoeffding, W.: Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association 58, 13–30 (1963)

    Article  MATH  MathSciNet  Google Scholar 

  22. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees, Wadsworth, Belmont, CA (1984)

    Google Scholar 

  23. Maron, O., Moore, A.: Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation. In: Cowan, J.D., Tesauro, G., Alspector, J. (eds.) Advances in Neural Information Processing System (1994)

    Google Scholar 

  24. Kelly, M.G., Hand, D.J., Adams, N.M.: The Impact of Changing Populations on Classifier Performance. In: Proc. of KDD-99, pp. 367–371 (1999)

    Google Scholar 

  25. Black, M., Hickey, R.J.: Maintaining the Performance of a Learned Classifier under Concept Drift. Intelligent Data Analysis 3, 453–474 (1999)

    Article  Google Scholar 

  26. Maimon, O., Last, M.: Knowledge Discovery and Data Mining,the Info-Fuzzy Network(IFN) Methodology. Kluwer Academic Publishers, Dordrecht (2000)

    MATH  Google Scholar 

  27. Fayyad, U.M, Irani, K.B.: On the Handling of Continuous-valued Attributes in Decision Tree Generation. Machine Learning 8, 87–102 (1992)

    MATH  Google Scholar 

  28. Wang, T., Li, Z., Yan, Y., Chen, H.: An Efficient Classification System Based on Binary Search Trees for Data Streams Mining. In: ICONS (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Takashi Washio Zhi-Hua Zhou Joshua Zhexue Huang Xiaohua Hu Jinyan Li Chao Xie Jieyue He Deqing Zou Kuan-Ching Li Mário M. Freire

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, T., Li, Z., Hu, X., Yan, Y., Chen, H. (2007). A New Decision Tree Classification Method for Mining High-Speed Data Streams Based on Threaded Binary Search Trees. In: Washio, T., et al. Emerging Technologies in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4819. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77018-3_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-77018-3_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-77016-9

  • Online ISBN: 978-3-540-77018-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics