Skip to main content
Log in

Clustering feature decision trees for semi-supervised classification from high-speed data streams

  • Published:
Journal of Zhejiang University SCIENCE C Aims and scope Submit manuscript

Abstract

Most stream data classification algorithms apply the supervised learning strategy which requires massive labeled data. Such approaches are impractical since labeled data are usually hard to obtain in reality. In this paper, we build a clustering feature decision tree model, CFDT, from data streams having both unlabeled and a small number of labeled examples. CFDT applies a micro-clustering algorithm that scans the data only once to provide the statistical summaries of the data for incremental decision tree induction. Micro-clusters also serve as classifiers in tree leaves to improve classification accuracy and reinforce the any-time property. Our experiments on synthetic and real-world datasets show that CFDT is highly scalable for data streams while generating high classification accuracy with high speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Bifet, A., Kirkby, R., Holmes, G., Pfahringer, B., 2007. MOA: Massive Online Analysis. Available from http://moa.cs.waikato.ac.nz/ [Accessed on Jan. 31, 2010].

  • Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., Gavalda, R., 2009. New Ensemble Methods for Evolving Data Streams. Proc. 15th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.139–148. [doi:10.1145/1557019.1557041]

  • Bifet, A., Holmes, G., Pfahringer, B., Frank, E., 2010. Fast perceptron decision tree learning from evolving data streams. LNCS, 6119:299–310. [doi:10.1007/978-3-642-13672-6_30]

    Google Scholar 

  • Chapelle, O., Scholkopf, B., Zien, A., 2006. Semi-supervised Learning. MIT Press, Cambridge, USA, p.5.

    Google Scholar 

  • Domingos, P., Hulten, G., 2000. Mining High-Speed Data Streams. Proc. 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.71–80. [doi: 10.1145/347090.347107]

  • Gama, J., Rocha, R., Medas, P., 2003. Accurate Decision Trees for Mining High-Speed Data Streams. Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.523–528. [doi:10.1145/956750.956813]

  • Gehrke, J., Ganti, V., Ramakrishnan, R., Loh, W., 1999. BOAT-Optimistic Decision Tree Construction. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.169–180. [doi:10.1145/304182.304197]

  • Gehrke, J., Ramakrishnan, R., Ganti, V., 2000. RainForest—a framework for fast decision tree construction of large datasets. Data Min. Knowl. Disc., 4(2/3):127–162. [doi:10.1023/A:1009839829793]

    Article  Google Scholar 

  • Greenwald, M., Khanna, S., 2001. Space-Efficient Online Computation of Quantile Summaries. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.58–66. [doi:10.1145/375663.375670]

  • Hulten, G., Domingos, P., 2003. VFML-a Toolkit for Mining High-Speed Time-Changing Data Streams. Available from http://www.cs.washington.edu/dm/vfml [Accessed on Apr. 25, 2010].

  • Hulten, G., Spencer, L., Domingos, P., 2001. Mining Time-Changing Data Streams. Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.97–106. [doi:10.1145/502512.502529]

  • Jin, W., Tung, A.K.H., Han, J., 2001. Mining Top-n Local Outliers in Large Databases. Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.293–298. [doi:10.1145/502512.502554]

  • Li, P., Wu, X., Hu, X., 2010. Learning from Concept Drifting Data Streams with Unlabeled Data. Proc. 24th AAAI Conf. on Artificial Intelligence, p.1495–1496.

  • Masud, M.M., Gao, J., Khan, L., Han, J., Thuraisingham, B., 2008. A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data. Proc. 8th IEEE Int. Conf. on Data Mining, p.929–934. [doi:10.1109/ICDM.2008.152]

  • Mehta, M., Agrawal, R., Rissanen, J., 1996. SLIQ: a fast scalable classifier for data mining. LNCS, 1057:18–32. [doi:10.1007/BFb0014141]

    Google Scholar 

  • Pfahringer, B., Holmes, G., Kirkby, R., 2007. New options for Hoeffding trees. LNCS, 4830:90–99. [doi:10.1007/978-3-540-76928-6_11]

    Google Scholar 

  • Pfahringer, B., Holmes, G., Krikby, R., 2008. Handling Numeric Attributes in Hoeffding Trees. Proc. 12th Pacific-Asia Conf. on Knowledge Discovery and Data Mining, p.296–307. [doi:10.1007/978-3-540-68125-0_27]

  • Shafer, J.C., Agrawal, R., Mehta, M., 1996. SPRINT: a Scalable Parallel Classifier for Data Mining. Proc. 22nd Int. Conf. on Very Large Data Bases, p.544–555.

  • Street, W.N., Kim, Y., 2001. A Streaming Ensemble Algorithm (SEA) for Large-Scale Classification. Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.377–382.

  • Wang, H., Fan, W., Yu, P.S., Han, J., 2003. Mining Concept-Drifting Data Streams Using Ensemble Classifiers. Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.226–235. [doi:10.1145/956750.956 778]

  • Wu, S., Yang, C., Zhou, J., 2006. Clustering-Training for Data Stream Mining. Proc. 6th IEEE Int. Conf. on Data Mining, p.653–656. [doi:10.1109/ICDMW.2006.45]

  • Yu, H., Yang, J., Han, J., 2003. Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.306–315. [doi:10.1145/956750.956786]

  • Zhang, T., Ramakrishnan, R., Livny, M., 1996. BIRCH: an Efficient Data Clustering Method for Very Large Databases. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.103–114. [doi:10.1145/235968. 233324]

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zheng Qin.

Additional information

Project supported by the National Natural Science Foundation of China (No. 60673024) and the “Eleventh Five” Preliminary Research Project of PLA (No. 102060206)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, Wh., Qin, Z. & Chang, Y. Clustering feature decision trees for semi-supervised classification from high-speed data streams. J. Zhejiang Univ. - Sci. C 12, 615–628 (2011). https://doi.org/10.1631/jzus.C1000330

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/jzus.C1000330

Key words

CLC number

Navigation