Clustering feature decision trees for semi-supervised classification from high-speed data streams

Xu, Wen-hua; Qin, Zheng; Chang, Yang

doi:10.1631/jzus.C1000330

Clustering feature decision trees for semi-supervised classification from high-speed data streams

Published: 02 August 2011

Volume 12, pages 615–628, (2011)
Cite this article

Journal of Zhejiang University SCIENCE C Aims and scope Submit manuscript

Wen-hua Xu¹,
Zheng Qin² &
Yang Chang²

240 Accesses
5 Citations
Explore all metrics

Abstract

Most stream data classification algorithms apply the supervised learning strategy which requires massive labeled data. Such approaches are impractical since labeled data are usually hard to obtain in reality. In this paper, we build a clustering feature decision tree model, CFDT, from data streams having both unlabeled and a small number of labeled examples. CFDT applies a micro-clustering algorithm that scans the data only once to provide the statistical summaries of the data for incremental decision tree induction. Micro-clusters also serve as classifiers in tree leaves to improve classification accuracy and reinforce the any-time property. Our experiments on synthetic and real-world datasets show that CFDT is highly scalable for data streams while generating high classification accuracy with high speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Enhancing the DISSFCM Algorithm for Data Stream Classification

Hybrid decision trees for data streams based on Incremental Flexible Naive Bayes prediction at leaf nodes

Article 13 June 2019

C. Sweetlin Hemalatha, Ravi Pathak & V. Vaidehi

Comparative Study of Various Decision Tree Methods for Data Stream Mining

References

Bifet, A., Kirkby, R., Holmes, G., Pfahringer, B., 2007. MOA: Massive Online Analysis. Available from http://moa.cs.waikato.ac.nz/ [Accessed on Jan. 31, 2010].
Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., Gavalda, R., 2009. New Ensemble Methods for Evolving Data Streams. Proc. 15th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.139–148. [doi:10.1145/1557019.1557041]
Bifet, A., Holmes, G., Pfahringer, B., Frank, E., 2010. Fast perceptron decision tree learning from evolving data streams. LNCS, 6119:299–310. [doi:10.1007/978-3-642-13672-6_30]
Google Scholar
Chapelle, O., Scholkopf, B., Zien, A., 2006. Semi-supervised Learning. MIT Press, Cambridge, USA, p.5.
Google Scholar
Domingos, P., Hulten, G., 2000. Mining High-Speed Data Streams. Proc. 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.71–80. [doi: 10.1145/347090.347107]
Gama, J., Rocha, R., Medas, P., 2003. Accurate Decision Trees for Mining High-Speed Data Streams. Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.523–528. [doi:10.1145/956750.956813]
Gehrke, J., Ganti, V., Ramakrishnan, R., Loh, W., 1999. BOAT-Optimistic Decision Tree Construction. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.169–180. [doi:10.1145/304182.304197]
Gehrke, J., Ramakrishnan, R., Ganti, V., 2000. RainForest—a framework for fast decision tree construction of large datasets. Data Min. Knowl. Disc., 4(2/3):127–162. [doi:10.1023/A:1009839829793]
Article Google Scholar
Greenwald, M., Khanna, S., 2001. Space-Efficient Online Computation of Quantile Summaries. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.58–66. [doi:10.1145/375663.375670]
Hulten, G., Domingos, P., 2003. VFML-a Toolkit for Mining High-Speed Time-Changing Data Streams. Available from http://www.cs.washington.edu/dm/vfml [Accessed on Apr. 25, 2010].
Hulten, G., Spencer, L., Domingos, P., 2001. Mining Time-Changing Data Streams. Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.97–106. [doi:10.1145/502512.502529]
Jin, W., Tung, A.K.H., Han, J., 2001. Mining Top-n Local Outliers in Large Databases. Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.293–298. [doi:10.1145/502512.502554]
Li, P., Wu, X., Hu, X., 2010. Learning from Concept Drifting Data Streams with Unlabeled Data. Proc. 24th AAAI Conf. on Artificial Intelligence, p.1495–1496.
Masud, M.M., Gao, J., Khan, L., Han, J., Thuraisingham, B., 2008. A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data. Proc. 8th IEEE Int. Conf. on Data Mining, p.929–934. [doi:10.1109/ICDM.2008.152]
Mehta, M., Agrawal, R., Rissanen, J., 1996. SLIQ: a fast scalable classifier for data mining. LNCS, 1057:18–32. [doi:10.1007/BFb0014141]
Google Scholar
Pfahringer, B., Holmes, G., Kirkby, R., 2007. New options for Hoeffding trees. LNCS, 4830:90–99. [doi:10.1007/978-3-540-76928-6_11]
Google Scholar
Pfahringer, B., Holmes, G., Krikby, R., 2008. Handling Numeric Attributes in Hoeffding Trees. Proc. 12th Pacific-Asia Conf. on Knowledge Discovery and Data Mining, p.296–307. [doi:10.1007/978-3-540-68125-0_27]
Shafer, J.C., Agrawal, R., Mehta, M., 1996. SPRINT: a Scalable Parallel Classifier for Data Mining. Proc. 22nd Int. Conf. on Very Large Data Bases, p.544–555.
Street, W.N., Kim, Y., 2001. A Streaming Ensemble Algorithm (SEA) for Large-Scale Classification. Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.377–382.
Wang, H., Fan, W., Yu, P.S., Han, J., 2003. Mining Concept-Drifting Data Streams Using Ensemble Classifiers. Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.226–235. [doi:10.1145/956750.956 778]
Wu, S., Yang, C., Zhou, J., 2006. Clustering-Training for Data Stream Mining. Proc. 6th IEEE Int. Conf. on Data Mining, p.653–656. [doi:10.1109/ICDMW.2006.45]
Yu, H., Yang, J., Han, J., 2003. Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.306–315. [doi:10.1145/956750.956786]
Zhang, T., Ramakrishnan, R., Livny, M., 1996. BIRCH: an Efficient Data Clustering Method for Very Large Databases. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.103–114. [doi:10.1145/235968. 233324]

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Wen-hua Xu
School of Software, Tsinghua University, Beijing, 100084, China
Zheng Qin & Yang Chang

Authors

Wen-hua Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Qin
View author publications
You can also search for this author in PubMed Google Scholar
Yang Chang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zheng Qin.

Additional information

Project supported by the National Natural Science Foundation of China (No. 60673024) and the “Eleventh Five” Preliminary Research Project of PLA (No. 102060206)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, Wh., Qin, Z. & Chang, Y. Clustering feature decision trees for semi-supervised classification from high-speed data streams. J. Zhejiang Univ. - Sci. C 12, 615–628 (2011). https://doi.org/10.1631/jzus.C1000330

Download citation

Received: 25 September 2010
Accepted: 09 March 2011
Published: 02 August 2011
Issue Date: August 2011
DOI: https://doi.org/10.1631/jzus.C1000330

Key words

CLC number

TP391

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Clustering feature decision trees for semi-supervised classification from high-speed data streams

Abstract

Access this article

Similar content being viewed by others

Enhancing the DISSFCM Algorithm for Data Stream Classification

Hybrid decision trees for data streams based on Incremental Flexible Naive Bayes prediction at leaf nodes

Comparative Study of Various Decision Tree Methods for Data Stream Mining

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

CLC number

Navigation

Clustering feature decision trees for semi-supervised classification from high-speed data streams

Abstract

Access this article

Similar content being viewed by others

Enhancing the DISSFCM Algorithm for Data Stream Classification

Hybrid decision trees for data streams based on Incremental Flexible Naive Bayes prediction at leaf nodes

Comparative Study of Various Decision Tree Methods for Data Stream Mining

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Search

Navigation