research-article

Mining Recurring Concept Drifts with Limited Labeled Streaming Data

Authors:

Xuegang HuAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology (TIST), Volume 3, Issue 2

Article No.: 29, Pages 1 - 32

https://doi.org/10.1145/2089094.2089105

Published: 01 February 2012 Publication History

Abstract

Tracking recurring concept drifts is a significant issue for machine learning and data mining that frequently appears in real-world stream classification problems. It is a challenge for many streaming classification algorithms to learn recurring concepts in a data stream environment with unlabeled data, and this challenge has received little attention from the research community. Motivated by this challenge, this article focuses on the problem of recurring contexts in streaming environments with limited labeled data. We propose a semi-supervised classification algorithm for data streams with REcurring concept Drifts and Limited LAbeled data, called REDLLA, in which a decision tree is adopted as the classification model. When growing a tree, a clustering algorithm based on k-means is installed to produce concept clusters and unlabeled data are labeled in the method of majority-class at leaves. In view of deviations between history and new concept clusters, potential concept drifts are distinguished and recurring concepts are maintained. Extensive studies on both synthetic and real-world data confirm the advantages of our REDLLA algorithm over three state-of-the-art online classification algorithms of CVFDT, DWCDS, and CDRDT and several known online semi-supervised algorithms, even in the case with more than 90% unlabeled data.

References

[1]

Abe, N. and Mamitsuka, H. 1998. Query learning strategies using boosting and bagging. In Proceedings of the 15th International Conference on Machine Learning. Morgan Kaufmann Publishers, 1--9.

Digital Library

[2]

Aggarwal, C. C., Han, J. W., and Yu, P. S. 2004. On demand classification of data streams. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 503--508.

Digital Library

[3]

Angluin, D. and Laird, P. 1988. Learning from noisy examples. Mach. Learn. 2, 343--370.

Digital Library

[4]

Belkin, M. and Niyogi, P. 2004. Semi-supervised learning on riemannian manifolds. Mach. Learn. 56, 209--239.

Digital Library

[5]

Bifet, A., Holmes, G., Kirkby, R., and Pfahringer, B. 2010. Moa: Massive online analysis. Mach. Learn. Res. 11, 1601--1604.

Digital Library

[6]

Blake, C., Keogh, E., and Merz, C. 1998. Uci repository of machine learning databases. http://www.ics.uci.edu/mlearn.

[7]

Blum, A. and Mitchell, T. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the Annual Conference on Computational Learning Theory. Morgan Kaufmann Publishers, 92--100.

Digital Library

[8]

Domingos, P. and Hulten, G. 2000. Mining high-speed data streams. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 71--80.

Digital Library

[9]

Duda, R. O., Hart, P. E., and Stork, D. G. 2001. Pattern Classification. John Willey & Sons.

Digital Library

[10]

Gama, J., Sebastião, R., and Rodrigues, P. P. 2009. Issues in evaluation of stream learning algorithms. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 329--338.

Digital Library

[11]

Gehrke, J., Ramakrishnan, R., and Ganti, V. 1998. Rainforest: A framework for fast decision tree construction of large datasets. In Proceedings of the 24th International Conference on Very Large Data Bases. Morgan Kaufmann, 416--427.

Digital Library

[12]

Gehrke, J., Ganti, V., Ramakrishnan, R., and Loh, W. 1999. Boat-optimistic decision tree construction. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 169--180.

Digital Library

[13]

Harries, M. 1999. Splice-2 comparative evaluation: Electricity pricing. Tech. rep., The University of South Wales.

[14]

Harries, M. B., Sammut, C., and Horn, K. 1998. Extracting hidden context. Mach. Learn. 32, 101--126.

Digital Library

[15]

Helmbold, D. P. and Long, P. M. 1994. Tracking drifting concepts by minimizing disagreement. Mach. Learn. 14, 27--45.

Digital Library

[16]

Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58, 13--30.

[17]

Hulten, G., Spencer, L., and Domingos, P. 2001. Mining time-changing data streams. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 97--106.

Digital Library

[18]

Joachims, T. 1999. Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning. Morgan Kaufmann Publishers. 200--209.

Digital Library

[19]

Katakis, I., Tsoumakas, G., and Vlahavas, I. 2010. Tracking recurring contexts using ensemble classifiers: An application to email filtering. Knowl. Inf. Syst. 22, 371--391.

Digital Library

[20]

Li, C., Zhang, Y., and Li, X. 2009a. Ocvfdt: One-class very fast decision tree for one-class classification of data streams. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 79--86.

Digital Library

[21]

Li, P. P., Liang, Q. H., Wu, X. D., and Hu, X. G. 2009b. Parameter estimation in semi-random decision tree ensembling on streaming data. In Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Springer, 376--388.

Digital Library

[22]

Li, P. P., Wu, X. D., Hu, X. G., Liang, Q. H., and Gao, Y. J. 2010. A random decision tree ensemble for mining concept drifts from noisy data streams. J. Appl. Artif. Intell. 24, 680--710.

Digital Library

[23]

Masud, M. M., Gao, J., Khan, L., and Han, J. W. 2008. A practical approach to classify evolving data streams: training with limited amount of labeled data. In Proceedings of the 8th IEEE International Conference on Data Mining. IEEE Computer Society, 929--934.

Digital Library

[24]

Nishida, K., Yamauchi, K., and Omori, T. 2005. Ace: Adaptive classifiers-ensemble system for concept-drifting environments. In Proceedings of the 6th International Workshop on Multiple Classifier Systems. Springer-Verlag, 176--185.

Digital Library

[25]

Ramamurthy, S. and Bhatnagar, R. 2007. Tracking recurrent concept drift in streaming data using ensemble classifiers. In Proceedings of the 6th International Conference on Machine Learning and Applications. IEEE Computer Society, 404--409.

Digital Library

[26]

Shafer, J., Agrawal, R., and Mehta, M. 1996. Sprint: A scalable parallel classifier for data mining. In Proceedings of the 22th International Conference on Very Large Data Bases. Morgan Kaufmann, 544--555.

Digital Library

[27]

Schlimmer, J. and Granger, Jr., R. H. G. 1986. Incremental learning from noisy data. Mach. Learn. 1, 317--354.

Digital Library

[28]

Sindhwani, V., Chu, W., and Keerthi, S. S. 2007. Semi-supervised gaussian process classifiers. In Proceedings of the 20th International Joint Conference on Artifical Intelligence. Morgan Kaufmann, 1059--1064.

Digital Library

[29]

Street, W. N. and Kim, Y. S. 2001. A streaming ensemble algorithm (sea) for large-scale classification. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 377--382.

Digital Library

[30]

Wang, Y., Li, Z. H., Zhang, Y., Zhang, L. B., and Jiang, Y. 2006. Improving the performance of data stream classifiers by mining recurring contexts. In Proceedings of the 2nd International Conference on Advanced Data Mining and Applications. Springer, 1094--1106.

Digital Library

[31]

Widmer, G. and Kubat, M. 1996. Learning in the presence of concept drift and hidden contests. Mach. Learn. 23, 69--101.

Digital Library

[32]

Widyantoro, D. H. 2007. Exploiting unlabeled data in concept drift learning. J. Inf. 8, 54--62.

[33]

Wu, S., Yang, C., and Zhou, J. 2006. Clustering-training for data stream mining. In Proceedings of the 6th IEEE International Conference on Data Mining Workshop. IEEE Computer Society, 404--409.

Digital Library

[34]

Zhou, Z. H. and Li, M. 2005. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Engin. 17, 1529--1541.

Digital Library

[35]

Zhou, Z. H. and Li, M. 2007. Semi-supervised regression with co-training style algorithms. IEEE Trans. Knowl. Data Engin. 19, 1479--1493.

Digital Library

[36]

Zhou, Z. H. and Li, M. 2010. Semi-supervised learning by disagreement. Knowl. Inf. Syst. 24, 415--439.

Digital Library

[37]

Zhu, Q., Hu, X.-G., Zhang, Y.-H., Li, P.-P., and Wu, X.-D. 2010. A double-window-based classification algorithm for concept drifting data streams. In Proceedings of the IEEE Conference on Granular Computing. IEEE Computer Society, 639--644.

Digital Library

[38]

Zhu, X. J. 2001. Semi-Supervised learning literature survey. Tech. rep., University of Wisconsin.

Cited By

Zhong ZSong LTang FYuan B(2024)Addressing Intermediate Verification Latency in Online Learning Through Immediate Pseudo-labeling and Oriented Synthetic Correction2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650783(1-10)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650783
Fahy CYang SGongora M(2023)Classification in Dynamic Data Streams With a Scarcity of LabelsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.313575535:4(3512-3524)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TKDE.2021.3135755
Pérez JMaior de Barros RTeixeira de Carvalho Santos S(2023)Experimenting with Supervised Drift Detectors in Semi-supervised Learning2023 IEEE Symposium Series on Computational Intelligence (SSCI)10.1109/SSCI52147.2023.10371933(730-735)Online publication date: 5-Dec-2023
https://doi.org/10.1109/SSCI52147.2023.10371933
Show More Cited By

Index Terms

Mining Recurring Concept Drifts with Limited Labeled Streaming Data

Recommendations

Mining decision rules on data streams in the presence of concept drifts

In a database, the concept of an example might change along with time, which is known as concept drift. When the concept drift occurs, the classification model built by using the old dataset is not suitable for predicting a new dataset. Therefore, the ...
Learning from concept drifting data streams with unlabeled data

Most existing work on classification of data streams assumes that all streaming data are labeled and the class labels are immediately available. However, in real-world applications, such as credit fraud and intrusion detection, this assumption is not ...
Semi-supervised Classification of Data Streams Based on Adaptive Density Peak Clustering
Neural Information Processing
Abstract
In the real-world scenario of data stream classification, label scarcity is very common. More challenges are data streams always include concept drifts. To handle these challenges, an algorithm of semi-supervised classification of data streams ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 3, Issue 2

February 2012

455 pages

ISSN:2157-6904

EISSN:2157-6912

DOI:10.1145/2089094

Issue’s Table of Contents

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 February 2012

Accepted: 01 July 2011

Revised: 01 May 2011

Received: 01 September 2010

Published in TIST Volume 3, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Chinese Academy of Sciences
National Natural Science Foundation of China
Ministry of Science and Technology of the People's Republic of China
Fundamental Research Funds for the Central Universities of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
750
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)3

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhong ZSong LTang FYuan B(2024)Addressing Intermediate Verification Latency in Online Learning Through Immediate Pseudo-labeling and Oriented Synthetic Correction2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650783(1-10)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650783
Fahy CYang SGongora M(2023)Classification in Dynamic Data Streams With a Scarcity of LabelsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.313575535:4(3512-3524)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TKDE.2021.3135755
Pérez JMaior de Barros RTeixeira de Carvalho Santos S(2023)Experimenting with Supervised Drift Detectors in Semi-supervised Learning2023 IEEE Symposium Series on Computational Intelligence (SSCI)10.1109/SSCI52147.2023.10371933(730-735)Online publication date: 5-Dec-2023
https://doi.org/10.1109/SSCI52147.2023.10371933
Wu OKoh YDobbie GLacombe T(2021)Probabilistic exact adaptive random forest for recurrent concepts in data streamsInternational Journal of Data Science and Analytics10.1007/s41060-021-00273-1Online publication date: 14-Jul-2021
https://doi.org/10.1007/s41060-021-00273-1
Islam MDong BChandra SKhan LThuraisingham B(2020)GCI: A GPU Based Transfer Learning Approach for Detecting Cheats of Computer GameIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2020.3013817(1-1)Online publication date: 2020
https://doi.org/10.1109/TDSC.2020.3013817
Zheng XLi PChu ZHu X(2020)A Survey on Multi-Label Data Stream ClassificationIEEE Access10.1109/ACCESS.2019.29620598(1249-1275)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2019.2962059
Ud Din SShao JKumar JAli WLiu JYe Y(2020)Online Reliable Semi-supervised Learning on Evolving Data StreamsInformation Sciences10.1016/j.ins.2020.03.052Online publication date: Mar-2020
https://doi.org/10.1016/j.ins.2020.03.052
Wen YLiu S(2020)Semi-Supervised Classification of Data Streams by BIRCH Ensemble and Local Structure MappingJournal of Computer Science and Technology10.1007/s11390-020-9999-y35:2(295-304)Online publication date: 1-Mar-2020
https://dl.acm.org/doi/10.1007/s11390-020-9999-y
Suárez-Cetrulo ACervantes AQuintana D(2019)Incremental Market Behavior Classification in Presence of Recurring ConceptsEntropy10.3390/e2101002521:1(25)Online publication date: 1-Jan-2019
https://doi.org/10.3390/e21010025
Qin KQin Y(2019)Hierarchical Cluster-Based Adaptive Model for Semi-Supervised Classification of Data Stream with Concept DriftProceedings of the 2019 International Conference on Artificial Intelligence and Computer Science10.1145/3349341.3349366(41-49)Online publication date: 12-Jul-2019
https://dl.acm.org/doi/10.1145/3349341.3349366
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents