Abstract
One of most important algorithms for mining data streams is VFDT. It uses Hoeffding inequality to achieve a probabilistic bound on the accuracy of the tree constructed. Gama et al. have extended VFDT in two directions. Their system VFDTc can deal with continuous data and use more powerful classification techniques at tree leaves. In this paper, we revisit this problem and implemented a system VFDTt on top of VFDT and VFDTc. We make the following three contributions: 1) we present a threaded binary search trees (TBST) approach for efficiently handling continuous attributes. It builds a threaded binary search tree, and its processing time for values inserting is O(nlogn), while VFDT‘s processing time is O(n 2 ). When a new example arrives, VFDTc need update O(logn) attribute tree nodes, but VFDTt just need update one necessary node.2) we improve the method of getting the best split-test point of a given continuous attribute. Comparing to the method used in VFDTc, it improves fromO(nlogn) to O (n) in processing time. 3) Comparing to VFDTc, VFDTt‘s candidate split-test number decrease fromO(n) to O(logn).Comparing to VFDT, the most relevant property of our system is an average reduction of 25.53% in processing time, while keep the same tree size and accuracy. Overall, the techniques introduced here significantly improve the efficiency of decision tree classification on data streams.
This work was supported by the National Science Foundation of China under Grants No. 60573057, 60473057 and 90604007.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Babcock, B., Babu, S., Datar, M., Motawani, R., Widom, J.: Models and Issues in Data Stream Systems. In: PODS (2002)
Domingos, P., Hulten, G.: Mining High-Speed Data Streams. In: Proceedings of the Association for Computing Machinery Sixth International Conference on Knowledge Discovery and Data Mining, pp. 71–80 (2000)
Mehta, M., Agrawal, A., Rissanen, J.: SLIQ: A Fast Scalable Classifier for Data Mining. In: Proceedings of The Fifth International Conference on Extending Database Technology, Avignon, France, pp. 18–32 (1996)
Fan, W.: StreamMiner: A Classifier Ensemble-based Engine to Mine Concept Drifting Data Streams. In: VLDB 2004
Gama, J., Rocha, R., Medas, P.: Accurate Decision Trees for Mining High-Speed Data Streams. In: Domingos, P., Faloutsos, C. (eds.) Proceedings of the Ninth International Conference on Knowledge Discovery and Data Mining, ACM Press, New York (2003)
Hulten, G., Spencer, L., Domingos, P.: Mining Time-Changing Data Streams. In: ACM SIGKDD (2001)
Jin, R., Agrawal, G.: Efficient Decision Tree Construction on Streaming Data. In: Proceedings of ACM SIGKDD (2003)
Last, M.: Online Classification of Nonstationary Data Streams. Intelligent Data Analysis 6(2), 129–147 (2002)
Muthukrishnan, S.: Data streams: Algorithms and Applications. In: Proceedings of the fourteenth annual ACM-SIAM symposium on discrete algorithms (2003)
Wang, H., Fan, W., Yu, P., Han, J.: Mining Concept-Drifting Data Streams using Ensemble Classifiers. In: 9th ACM International Conference on Knowledge Discovery and Data Mining, Washington DC, USA. SIGKDD (2003)
Arasu, A., Babcock, B., Babu, S., Datar, M., Ito, K., Nishizawa, I., Rosenstein, J., Widom, J.: STREAM: The Stanford Stream Data Manager Demonstration Description –Short Overview of System Status and Plans. In: Proc. of the ACM Intl Conf. on Management of Data (SIGMOD 2003) (June 2003)
Aggarwal, C., Han, J., Wang, J., Yu, P.S.: On Demand Classification of Data Streams. In: Proc. 2004 Int. Conf. on Knowledge Discovery and Data Mining (KDD 2004), Seattle, WA (2004)
Guetova, M., Holldobter, S.H.-P.: Incremental Fuzzy Decision Trees. In: Jarke, M., Koehler, J., Lakemeyer, G. (eds.) KI 2002. LNCS (LNAI), vol. 2479, Springer, Heidelberg (2002)
Ben-David, S., Gehrke, J., Kifer, D.: Detecting Change in Data Streams. In: Proceedings of VLDB 2004
Aggarwal, C.: A Framework for Diagnosing Changes in Evolving Data Streams. In: Proceedings of the ACM SIGMOD Conference (2003)
Gaber, M.M., Zaslavskey, A., Krishnaswamy, S.: Mining Data Streams: a Review. SIGMOD Record 34(2) (June 2005)
Cezary, Z.J.: Fuzzy Decision Trees: Issues and Methods. IEEE Transactions on Systems, Man, and Cybernetics 28(1), 1–14 (1998)
Utgoff, P.E.: Incremental Induction of Decision Trees. Machine Learning 4(2), 161–186 (1989)
Xie, Q.H.: An Efficient Approach for Mining Concept-Drifting Data Streams, Master Thesis
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA (1993)
Hoeffding, W.: Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association 58, 13–30 (1963)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees, Wadsworth, Belmont, CA (1984)
Maron, O., Moore, A.: Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation. In: Cowan, J.D., Tesauro, G., Alspector, J. (eds.) Advances in Neural Information Processing System (1994)
Kelly, M.G., Hand, D.J., Adams, N.M.: The Impact of Changing Populations on Classifier Performance. In: Proc. of KDD-99, pp. 367–371 (1999)
Black, M., Hickey, R.J.: Maintaining the Performance of a Learned Classifier under Concept Drift. Intelligent Data Analysis 3, 453–474 (1999)
Maimon, O., Last, M.: Knowledge Discovery and Data Mining,the Info-Fuzzy Network(IFN) Methodology. Kluwer Academic Publishers, Dordrecht (2000)
Fayyad, U.M, Irani, K.B.: On the Handling of Continuous-valued Attributes in Decision Tree Generation. Machine Learning 8, 87–102 (1992)
Wang, T., Li, Z., Yan, Y., Chen, H.: An Efficient Classification System Based on Binary Search Trees for Data Streams Mining. In: ICONS (2007)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, T., Li, Z., Hu, X., Yan, Y., Chen, H. (2007). A New Decision Tree Classification Method for Mining High-Speed Data Streams Based on Threaded Binary Search Trees. In: Washio, T., et al. Emerging Technologies in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4819. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77018-3_27
Download citation
DOI: https://doi.org/10.1007/978-3-540-77018-3_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77016-9
Online ISBN: 978-3-540-77018-3
eBook Packages: Computer ScienceComputer Science (R0)