Abstract
Most stream data classification algorithms apply the supervised learning strategy which requires massive labeled data. Such approaches are impractical since labeled data are usually hard to obtain in reality. In this paper, we build a clustering feature decision tree model, CFDT, from data streams having both unlabeled and a small number of labeled examples. CFDT applies a micro-clustering algorithm that scans the data only once to provide the statistical summaries of the data for incremental decision tree induction. Micro-clusters also serve as classifiers in tree leaves to improve classification accuracy and reinforce the any-time property. Our experiments on synthetic and real-world datasets show that CFDT is highly scalable for data streams while generating high classification accuracy with high speed.
Similar content being viewed by others
References
Bifet, A., Kirkby, R., Holmes, G., Pfahringer, B., 2007. MOA: Massive Online Analysis. Available from http://moa.cs.waikato.ac.nz/ [Accessed on Jan. 31, 2010].
Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., Gavalda, R., 2009. New Ensemble Methods for Evolving Data Streams. Proc. 15th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.139–148. [doi:10.1145/1557019.1557041]
Bifet, A., Holmes, G., Pfahringer, B., Frank, E., 2010. Fast perceptron decision tree learning from evolving data streams. LNCS, 6119:299–310. [doi:10.1007/978-3-642-13672-6_30]
Chapelle, O., Scholkopf, B., Zien, A., 2006. Semi-supervised Learning. MIT Press, Cambridge, USA, p.5.
Domingos, P., Hulten, G., 2000. Mining High-Speed Data Streams. Proc. 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.71–80. [doi: 10.1145/347090.347107]
Gama, J., Rocha, R., Medas, P., 2003. Accurate Decision Trees for Mining High-Speed Data Streams. Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.523–528. [doi:10.1145/956750.956813]
Gehrke, J., Ganti, V., Ramakrishnan, R., Loh, W., 1999. BOAT-Optimistic Decision Tree Construction. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.169–180. [doi:10.1145/304182.304197]
Gehrke, J., Ramakrishnan, R., Ganti, V., 2000. RainForest—a framework for fast decision tree construction of large datasets. Data Min. Knowl. Disc., 4(2/3):127–162. [doi:10.1023/A:1009839829793]
Greenwald, M., Khanna, S., 2001. Space-Efficient Online Computation of Quantile Summaries. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.58–66. [doi:10.1145/375663.375670]
Hulten, G., Domingos, P., 2003. VFML-a Toolkit for Mining High-Speed Time-Changing Data Streams. Available from http://www.cs.washington.edu/dm/vfml [Accessed on Apr. 25, 2010].
Hulten, G., Spencer, L., Domingos, P., 2001. Mining Time-Changing Data Streams. Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.97–106. [doi:10.1145/502512.502529]
Jin, W., Tung, A.K.H., Han, J., 2001. Mining Top-n Local Outliers in Large Databases. Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.293–298. [doi:10.1145/502512.502554]
Li, P., Wu, X., Hu, X., 2010. Learning from Concept Drifting Data Streams with Unlabeled Data. Proc. 24th AAAI Conf. on Artificial Intelligence, p.1495–1496.
Masud, M.M., Gao, J., Khan, L., Han, J., Thuraisingham, B., 2008. A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data. Proc. 8th IEEE Int. Conf. on Data Mining, p.929–934. [doi:10.1109/ICDM.2008.152]
Mehta, M., Agrawal, R., Rissanen, J., 1996. SLIQ: a fast scalable classifier for data mining. LNCS, 1057:18–32. [doi:10.1007/BFb0014141]
Pfahringer, B., Holmes, G., Kirkby, R., 2007. New options for Hoeffding trees. LNCS, 4830:90–99. [doi:10.1007/978-3-540-76928-6_11]
Pfahringer, B., Holmes, G., Krikby, R., 2008. Handling Numeric Attributes in Hoeffding Trees. Proc. 12th Pacific-Asia Conf. on Knowledge Discovery and Data Mining, p.296–307. [doi:10.1007/978-3-540-68125-0_27]
Shafer, J.C., Agrawal, R., Mehta, M., 1996. SPRINT: a Scalable Parallel Classifier for Data Mining. Proc. 22nd Int. Conf. on Very Large Data Bases, p.544–555.
Street, W.N., Kim, Y., 2001. A Streaming Ensemble Algorithm (SEA) for Large-Scale Classification. Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.377–382.
Wang, H., Fan, W., Yu, P.S., Han, J., 2003. Mining Concept-Drifting Data Streams Using Ensemble Classifiers. Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.226–235. [doi:10.1145/956750.956 778]
Wu, S., Yang, C., Zhou, J., 2006. Clustering-Training for Data Stream Mining. Proc. 6th IEEE Int. Conf. on Data Mining, p.653–656. [doi:10.1109/ICDMW.2006.45]
Yu, H., Yang, J., Han, J., 2003. Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.306–315. [doi:10.1145/956750.956786]
Zhang, T., Ramakrishnan, R., Livny, M., 1996. BIRCH: an Efficient Data Clustering Method for Very Large Databases. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.103–114. [doi:10.1145/235968. 233324]
Author information
Authors and Affiliations
Corresponding author
Additional information
Project supported by the National Natural Science Foundation of China (No. 60673024) and the “Eleventh Five” Preliminary Research Project of PLA (No. 102060206)
Rights and permissions
About this article
Cite this article
Xu, Wh., Qin, Z. & Chang, Y. Clustering feature decision trees for semi-supervised classification from high-speed data streams. J. Zhejiang Univ. - Sci. C 12, 615–628 (2011). https://doi.org/10.1631/jzus.C1000330
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/jzus.C1000330
Key words
- Clustering feature vector
- Decision tree
- Semi-supervised learning
- Stream data classification
- Very fast decision tree