STDS: self-training data streams for mining limited labeled data in non-stationary environment

Khezri, Shirin; Tanha, Jafar; Ahmadi, Ali; Sharifi, Arash

doi:10.1007/s10489-019-01585-3

STDS: self-training data streams for mining limited labeled data in non-stationary environment

Published: 21 January 2020

Volume 50, pages 1448–1467, (2020)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Shirin Khezri¹,
Jafar Tanha ORCID: orcid.org/0000-0002-0779-6027²,
Ali Ahmadi^3,4 &
…
Arash Sharifi¹

688 Accesses
7 Citations
Explore all metrics

Abstract

Inthis article, wefocus on the classification problem to semi-supervised learning in non-stationary environment. Semi-supervised learning is a learning task from both labeled and unlabeled data points. There are several approaches to semi-supervised learning in stationary environment which are not applicable directly for data streams. We propose a novel semi-supervised learning algorithm, named STDS. The proposed approach uses labeled and unlabeled data and employs an approach to handle the concept drift in data streams. The main challenge in semi-supervised self-training for data streams is to find a proper selection metric in order to find a set of high-confidence predictions and a proper underlying base learner. We therefore propose an ensemble approach to find a set of high-confidence predictions based on clustering algorithms and classifier predictions. We then employ the Kullback-Leibler (KL) divergence approach to measure the distribution differences between sequential chunks in order to detect the concept drift. When drift is detected, a new classifier is updated from the new set of labeled data in the current chunk; otherwise, a percentage of high-confidence newly labeled data in the current chunk is added to the labeled data in the next chunk for updating the incremental classifier based on the proposed selection metric. The results of our experiments on a number of classification benchmark datasets show that STDS outperforms the supervised and the most of other semi-supervised learning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Transfer Learning for Semi-supervised Classification of Non-stationary Data Streams

Semi-supervised Classification on Data Streams with Recurring Concept Drift Based on Conformal Prediction

An ensemble of cluster-based classifiers for semi-supervised classification of non-stationary data streams

Article 28 April 2015

References

Aggarwal CC (2009) Data streams: an overview and scientific applications. In: Scientific data mining and knowledge discovery. Springer, pp 377–397
Baena-García M, del Campo-Ávila J, Fidalgo R, Bifet A, Gavaldà R, Morales-Bueno R (2006) Early drift detection method
Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learning Res 7(Nov):2399–2434
MathSciNet MATH Google Scholar
Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) Moa: massive online analysis. J Mach Learn Res 11(May):1601–1604
Google Scholar
Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory. ACM , pp 92–100
Borchani H, Larrañaga P, Bielza C (2011) Classifying evolving data streams with partially labeled data. Intelligent Data Analysis 15(5):655–670
Article Google Scholar
Breiman L (2001) Random forests. Machine Learning 45(1):5–32
Article Google Scholar
Brzeziński D (2010) Mining data streams with concept drift. PhD thesis, PhD thesis, MS thesis, Dept. of Computing Science and Management, Poznan University of Technology, Poznan Google Scholar
Brzezinski D, Stefanowski J (2014) Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans Neural Netw Learning Sys 25(1):81–94
Article Google Scholar
Cui W, Liu S, Li T, Shi C, Song Y, Gao Z, Qu H, Tong X (2011) Textflow: towards better understanding of evolving topics in text. IEEE Trans Visualization Comput Graphics 17(12):2412– 2421
Article Google Scholar
Dasu T, Krishnan S, Venkatasubramanian S, Yi K (2006) An information-theoretic approach to detecting changes in multi-dimensional data streams. In: Proc. Symp. on the interface of statistics, computing science, and applications. Citeseer
Ditzler G, Roveri M, Alippi C, Polikar R (2015) Learning in nonstationary environments: a survey. IEEE Comput Intell Magazine 10(4):12–25
Article Google Scholar
Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 71–80
Dyer KB, Capo R, Polikar R (2014) Compose: a semisupervised learning framework for initially labeled nonstationary streaming data. IEEE Trans Neural Netw Learning Sys 25(1):12–26
Article Google Scholar
Efron B, Tibshirani R (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Sci: 54–75
Elwell R, Polikar R (2011) Incremental learning of concept drift in nonstationary environments. IEEE Trans Neural Netw 22(10):1517–1531
Article Google Scholar
Ferreira RS, Zimbrão G, Alvim LGM (2019) Amanda: semi-supervised density-based adaptive model for non-stationary data with extreme verification latency. Inf Sci
Frank A, Asuncion A (2010) UCI machine learning repository
Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM Sigmod Record 34(2):18–26
Article Google Scholar
Gama J (2010) Knowledge discovery from data streams. Chapman and Hall/CRC
Gama J, Fernandes R, Rocha R (2006) Decision trees for mining data streams. Intelligent Data Analysis 10(1):23–45
Article Google Scholar
Gama J, Gaber MM (2007) Learning from data streams: processing techniques in sensor networks. Springer
Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: Brazilian symposium on artificial intelligence. Springer, pp 286–295
Gama J, Rocha R, Medas P (2003) Accurate decision trees for mining high-speed data streams. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 523–528
Gama J, žliobaitė I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM computing surveys (CSUR) 46(4):44
Article Google Scholar
Gao J, Fan W, Han J, Yu PS (2007) A general framework for mining concept-drifting data streams with skewed distributions. In: Proceedings of the SIAM international conference on data mining. SIAM, p 2007
Gomes HM, Barddal JP, Enembreck F, Bifet A (2017) A survey on ensemble learning for data stream classification. ACM Computing Surveys (CSUR) 50(2):23
Article Google Scholar
Harries M, New South Wales (1999) Splice-2 comparative evaluation: electricity pricing
Hosseini MJ, Gholipour A, Beigy H (2016) An ensemble of cluster-based classifiers for semi-supervised classification of non-stationary data streams. Knowl Inf Syst 46(3):567–597
Article Google Scholar
Hulten G, Spencer L, Pedro Domingos. (2001) Mining time-changing data streams. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 97–106
Kadwe Y, Suryawanshi V (2015) A review on concept drift. IOSR J Comput Eng 17:20–26
Google Scholar
Kim Y, Park CH (2017) An efficient concept drift detection method for streaming data under limited labeling. IEICE Trans Inf Sys 100(10):2537–2546
Article Google Scholar
Kirkby RB (2007) Improving hoeffding trees. PhD thesis, The University of Waikato
Klinkenberg R, Joachims T (2000) Detecting concept drift with support vector machines. In: ICML, pp 487–494
Zico Kolter J, Maloof MA (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8(Dec):2755–2790
MATH Google Scholar
Krawczyk B, Minku LL, Gama J, Stefanowski J, Woźniak M (2017) Ensemble learning for data stream analysis: a survey. Information Fusion 37:132–156
Article Google Scholar
Krawczyk B, Wozniak M (2015) Weighted naive bayes classifier with forgetting for drifting data streams. In: IEEE international conference on systems, man, and cybernetics. IEEE, p 2015
Kulkarni P, Ade R (2014) Incremental learning from unbalanced data with concept class, concept drift and missing features: a review. International Journal of Data Mining & Knowledge Management Process 4(6):15
Article Google Scholar
Li P, Wu X, Hu X (2010) Mining recurring concept drifts with limited labeled streaming data. In: Proceedings of 2nd Asian conference on machine learning, pp 241–252
Malekian D, Hashemi MR (2013) An adaptive profile based fraud detection framework for handling concept drift. In: 2013 10th international ISC conference on information security and cryptology (ISCISC). IEEE, pp 1–6
Masud MM, Gao J, Khan L, Han J, Thuraisingham B (2008) A practical approach to classify evolving data streams: training with limited amount of labeled data. In: Eighth IEEE international conference on data mining, 2008. ICDM’08. IEEE, pp 929–934
Minku LL, Yao X (2012) Ddd: a new ensemble approach for dealing with concept drift. IEEE Trans Knowledge Data Eng 24(4):619–633
Article Google Scholar
Nguyen H-L, Woon Y-K, Ng W-K (2015) A survey on data stream clustering and classification. Knowledge Inf Sys 45(3):535–569
Article Google Scholar
Prasad BR, Agarwal S (2016) Stream data mining: platforms, algorithms, performance evaluators and research trends. International Journal of Database Theory and Application 9(9):201–218
Article Google Scholar
Ren S, Lian Y, Zou X (2014) Incremental naïve bayesian learning algorithm based on classification contribution degree. JCP 9(8):1967–1974
Google Scholar
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496
Article Google Scholar
Sakaki T, Okazaki M, Matsuo Y (2010) Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of the 19th international conference on world wide web. ACM, pp 851–860
Tanha J (2018) Mssboost: a new multiclass boosting to semi-supervised learning. Neurocomputing
Tanha J, et al. (2013) Ensemble approaches to semi-supervised learning. SIKS
Tanha J, Someren MV, Afsarmanesh H (2014) Boosting for multiclass semi-supervised learning. Pattern Recogn Lett 37:63–77
Article Google Scholar
Tanha J, Van Someren M, Afsarmanesh H (2017) Semi-supervised self-training for decision tree classifiers. International Journal of Machine Learning and Cybernetics 8(1):355–370
Article Google Scholar
Tanha J (2019) A multiclass boosting algorithm to labeled and unlabeled data. International Journal of Machine Learning and Cybernetics 10(12):3647–3665
Article Google Scholar
Tsymbal A (2004) The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin 106 (2)
Umer M, Frederickson C, Polikar R (2016) Learning under extreme verification latency quickly: fast compose. In: 2016 IEEE symposium series on computational intelligence (SSCI). IEEE, pp 1–8
Vorburger P, Bernstein A (2006) Entropy-based concept shift detection. In: Sixth international conference on data mining ICDM’06, p 2006
Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 226–235
Yi W, Li T (2018) Improving semi-supervised co-forest algorithm in evolving data streams. Appl Intell: 1–15
Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Machine Learning 23(1):69–101
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran
Shirin Khezri & Arash Sharifi
Electrical and computer Engineering Department, University of Tabriz, Tabriz, Iran
Jafar Tanha
School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
Ali Ahmadi
Faculty of Computer Engineering, K.N.Toosi University of Technology, Tehran, Iran
Ali Ahmadi

Authors

Shirin Khezri
View author publications
You can also search for this author in PubMed Google Scholar
Jafar Tanha
View author publications
You can also search for this author in PubMed Google Scholar
Ali Ahmadi
View author publications
You can also search for this author in PubMed Google Scholar
Arash Sharifi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jafar Tanha.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khezri, S., Tanha, J., Ahmadi, A. et al. STDS: self-training data streams for mining limited labeled data in non-stationary environment. Appl Intell 50, 1448–1467 (2020). https://doi.org/10.1007/s10489-019-01585-3

Download citation

Published: 21 January 2020
Issue Date: May 2020
DOI: https://doi.org/10.1007/s10489-019-01585-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

STDS: self-training data streams for mining limited labeled data in non-stationary environment

Abstract

Access this article

Similar content being viewed by others

Transfer Learning for Semi-supervised Classification of Non-stationary Data Streams

Semi-supervised Classification on Data Streams with Recurring Concept Drift Based on Conformal Prediction

An ensemble of cluster-based classifiers for semi-supervised classification of non-stationary data streams

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

STDS: self-training data streams for mining limited labeled data in non-stationary environment

Abstract

Access this article

Similar content being viewed by others

Transfer Learning for Semi-supervised Classification of Non-stationary Data Streams

Semi-supervised Classification on Data Streams with Recurring Concept Drift Based on Conformal Prediction

An ensemble of cluster-based classifiers for semi-supervised classification of non-stationary data streams

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation