skip to main content
research-article

Mining Recurring Concept Drifts with Limited Labeled Streaming Data

Published: 01 February 2012 Publication History

Abstract

Tracking recurring concept drifts is a significant issue for machine learning and data mining that frequently appears in real-world stream classification problems. It is a challenge for many streaming classification algorithms to learn recurring concepts in a data stream environment with unlabeled data, and this challenge has received little attention from the research community. Motivated by this challenge, this article focuses on the problem of recurring contexts in streaming environments with limited labeled data. We propose a semi-supervised classification algorithm for data streams with REcurring concept Drifts and Limited LAbeled data, called REDLLA, in which a decision tree is adopted as the classification model. When growing a tree, a clustering algorithm based on k-means is installed to produce concept clusters and unlabeled data are labeled in the method of majority-class at leaves. In view of deviations between history and new concept clusters, potential concept drifts are distinguished and recurring concepts are maintained. Extensive studies on both synthetic and real-world data confirm the advantages of our REDLLA algorithm over three state-of-the-art online classification algorithms of CVFDT, DWCDS, and CDRDT and several known online semi-supervised algorithms, even in the case with more than 90% unlabeled data.

References

[1]
Abe, N. and Mamitsuka, H. 1998. Query learning strategies using boosting and bagging. In Proceedings of the 15th International Conference on Machine Learning. Morgan Kaufmann Publishers, 1--9.
[2]
Aggarwal, C. C., Han, J. W., and Yu, P. S. 2004. On demand classification of data streams. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 503--508.
[3]
Angluin, D. and Laird, P. 1988. Learning from noisy examples. Mach. Learn. 2, 343--370.
[4]
Belkin, M. and Niyogi, P. 2004. Semi-supervised learning on riemannian manifolds. Mach. Learn. 56, 209--239.
[5]
Bifet, A., Holmes, G., Kirkby, R., and Pfahringer, B. 2010. Moa: Massive online analysis. Mach. Learn. Res. 11, 1601--1604.
[6]
Blake, C., Keogh, E., and Merz, C. 1998. Uci repository of machine learning databases. http://www.ics.uci.edu/mlearn.
[7]
Blum, A. and Mitchell, T. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the Annual Conference on Computational Learning Theory. Morgan Kaufmann Publishers, 92--100.
[8]
Domingos, P. and Hulten, G. 2000. Mining high-speed data streams. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 71--80.
[9]
Duda, R. O., Hart, P. E., and Stork, D. G. 2001. Pattern Classification. John Willey & Sons.
[10]
Gama, J., Sebastião, R., and Rodrigues, P. P. 2009. Issues in evaluation of stream learning algorithms. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 329--338.
[11]
Gehrke, J., Ramakrishnan, R., and Ganti, V. 1998. Rainforest: A framework for fast decision tree construction of large datasets. In Proceedings of the 24th International Conference on Very Large Data Bases. Morgan Kaufmann, 416--427.
[12]
Gehrke, J., Ganti, V., Ramakrishnan, R., and Loh, W. 1999. Boat-optimistic decision tree construction. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 169--180.
[13]
Harries, M. 1999. Splice-2 comparative evaluation: Electricity pricing. Tech. rep., The University of South Wales.
[14]
Harries, M. B., Sammut, C., and Horn, K. 1998. Extracting hidden context. Mach. Learn. 32, 101--126.
[15]
Helmbold, D. P. and Long, P. M. 1994. Tracking drifting concepts by minimizing disagreement. Mach. Learn. 14, 27--45.
[16]
Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58, 13--30.
[17]
Hulten, G., Spencer, L., and Domingos, P. 2001. Mining time-changing data streams. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 97--106.
[18]
Joachims, T. 1999. Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning. Morgan Kaufmann Publishers. 200--209.
[19]
Katakis, I., Tsoumakas, G., and Vlahavas, I. 2010. Tracking recurring contexts using ensemble classifiers: An application to email filtering. Knowl. Inf. Syst. 22, 371--391.
[20]
Li, C., Zhang, Y., and Li, X. 2009a. Ocvfdt: One-class very fast decision tree for one-class classification of data streams. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 79--86.
[21]
Li, P. P., Liang, Q. H., Wu, X. D., and Hu, X. G. 2009b. Parameter estimation in semi-random decision tree ensembling on streaming data. In Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Springer, 376--388.
[22]
Li, P. P., Wu, X. D., Hu, X. G., Liang, Q. H., and Gao, Y. J. 2010. A random decision tree ensemble for mining concept drifts from noisy data streams. J. Appl. Artif. Intell. 24, 680--710.
[23]
Masud, M. M., Gao, J., Khan, L., and Han, J. W. 2008. A practical approach to classify evolving data streams: training with limited amount of labeled data. In Proceedings of the 8th IEEE International Conference on Data Mining. IEEE Computer Society, 929--934.
[24]
Nishida, K., Yamauchi, K., and Omori, T. 2005. Ace: Adaptive classifiers-ensemble system for concept-drifting environments. In Proceedings of the 6th International Workshop on Multiple Classifier Systems. Springer-Verlag, 176--185.
[25]
Ramamurthy, S. and Bhatnagar, R. 2007. Tracking recurrent concept drift in streaming data using ensemble classifiers. In Proceedings of the 6th International Conference on Machine Learning and Applications. IEEE Computer Society, 404--409.
[26]
Shafer, J., Agrawal, R., and Mehta, M. 1996. Sprint: A scalable parallel classifier for data mining. In Proceedings of the 22th International Conference on Very Large Data Bases. Morgan Kaufmann, 544--555.
[27]
Schlimmer, J. and Granger, Jr., R. H. G. 1986. Incremental learning from noisy data. Mach. Learn. 1, 317--354.
[28]
Sindhwani, V., Chu, W., and Keerthi, S. S. 2007. Semi-supervised gaussian process classifiers. In Proceedings of the 20th International Joint Conference on Artifical Intelligence. Morgan Kaufmann, 1059--1064.
[29]
Street, W. N. and Kim, Y. S. 2001. A streaming ensemble algorithm (sea) for large-scale classification. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 377--382.
[30]
Wang, Y., Li, Z. H., Zhang, Y., Zhang, L. B., and Jiang, Y. 2006. Improving the performance of data stream classifiers by mining recurring contexts. In Proceedings of the 2nd International Conference on Advanced Data Mining and Applications. Springer, 1094--1106.
[31]
Widmer, G. and Kubat, M. 1996. Learning in the presence of concept drift and hidden contests. Mach. Learn. 23, 69--101.
[32]
Widyantoro, D. H. 2007. Exploiting unlabeled data in concept drift learning. J. Inf. 8, 54--62.
[33]
Wu, S., Yang, C., and Zhou, J. 2006. Clustering-training for data stream mining. In Proceedings of the 6th IEEE International Conference on Data Mining Workshop. IEEE Computer Society, 404--409.
[34]
Zhou, Z. H. and Li, M. 2005. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Engin. 17, 1529--1541.
[35]
Zhou, Z. H. and Li, M. 2007. Semi-supervised regression with co-training style algorithms. IEEE Trans. Knowl. Data Engin. 19, 1479--1493.
[36]
Zhou, Z. H. and Li, M. 2010. Semi-supervised learning by disagreement. Knowl. Inf. Syst. 24, 415--439.
[37]
Zhu, Q., Hu, X.-G., Zhang, Y.-H., Li, P.-P., and Wu, X.-D. 2010. A double-window-based classification algorithm for concept drifting data streams. In Proceedings of the IEEE Conference on Granular Computing. IEEE Computer Society, 639--644.
[38]
Zhu, X. J. 2001. Semi-Supervised learning literature survey. Tech. rep., University of Wisconsin.

Cited By

View all
  • (2024)Addressing Intermediate Verification Latency in Online Learning Through Immediate Pseudo-labeling and Oriented Synthetic Correction2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650783(1-10)Online publication date: 30-Jun-2024
  • (2023)Classification in Dynamic Data Streams With a Scarcity of LabelsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.313575535:4(3512-3524)Online publication date: 1-Apr-2023
  • (2023)Experimenting with Supervised Drift Detectors in Semi-supervised Learning2023 IEEE Symposium Series on Computational Intelligence (SSCI)10.1109/SSCI52147.2023.10371933(730-735)Online publication date: 5-Dec-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology  Volume 3, Issue 2
February 2012
455 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/2089094
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 February 2012
Accepted: 01 July 2011
Revised: 01 May 2011
Received: 01 September 2010
Published in TIST Volume 3, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data stream
  2. clustering
  3. concept drift
  4. decision tree

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)3
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Addressing Intermediate Verification Latency in Online Learning Through Immediate Pseudo-labeling and Oriented Synthetic Correction2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650783(1-10)Online publication date: 30-Jun-2024
  • (2023)Classification in Dynamic Data Streams With a Scarcity of LabelsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.313575535:4(3512-3524)Online publication date: 1-Apr-2023
  • (2023)Experimenting with Supervised Drift Detectors in Semi-supervised Learning2023 IEEE Symposium Series on Computational Intelligence (SSCI)10.1109/SSCI52147.2023.10371933(730-735)Online publication date: 5-Dec-2023
  • (2021)Probabilistic exact adaptive random forest for recurrent concepts in data streamsInternational Journal of Data Science and Analytics10.1007/s41060-021-00273-1Online publication date: 14-Jul-2021
  • (2020)GCI: A GPU Based Transfer Learning Approach for Detecting Cheats of Computer GameIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2020.3013817(1-1)Online publication date: 2020
  • (2020)A Survey on Multi-Label Data Stream ClassificationIEEE Access10.1109/ACCESS.2019.29620598(1249-1275)Online publication date: 2020
  • (2020)Online Reliable Semi-supervised Learning on Evolving Data StreamsInformation Sciences10.1016/j.ins.2020.03.052Online publication date: Mar-2020
  • (2020)Semi-Supervised Classification of Data Streams by BIRCH Ensemble and Local Structure MappingJournal of Computer Science and Technology10.1007/s11390-020-9999-y35:2(295-304)Online publication date: 1-Mar-2020
  • (2019)Incremental Market Behavior Classification in Presence of Recurring ConceptsEntropy10.3390/e2101002521:1(25)Online publication date: 1-Jan-2019
  • (2019)Hierarchical Cluster-Based Adaptive Model for Semi-Supervised Classification of Data Stream with Concept DriftProceedings of the 2019 International Conference on Artificial Intelligence and Computer Science10.1145/3349341.3349366(41-49)Online publication date: 12-Jul-2019
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media