Skip to main content
Log in

STDS: self-training data streams for mining limited labeled data in non-stationary environment

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Inthis article, wefocus on the classification problem to semi-supervised learning in non-stationary environment. Semi-supervised learning is a learning task from both labeled and unlabeled data points. There are several approaches to semi-supervised learning in stationary environment which are not applicable directly for data streams. We propose a novel semi-supervised learning algorithm, named STDS. The proposed approach uses labeled and unlabeled data and employs an approach to handle the concept drift in data streams. The main challenge in semi-supervised self-training for data streams is to find a proper selection metric in order to find a set of high-confidence predictions and a proper underlying base learner. We therefore propose an ensemble approach to find a set of high-confidence predictions based on clustering algorithms and classifier predictions. We then employ the Kullback-Leibler (KL) divergence approach to measure the distribution differences between sequential chunks in order to detect the concept drift. When drift is detected, a new classifier is updated from the new set of labeled data in the current chunk; otherwise, a percentage of high-confidence newly labeled data in the current chunk is added to the labeled data in the next chunk for updating the incremental classifier based on the proposed selection metric. The results of our experiments on a number of classification benchmark datasets show that STDS outperforms the supervised and the most of other semi-supervised learning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Aggarwal CC (2009) Data streams: an overview and scientific applications. In: Scientific data mining and knowledge discovery. Springer, pp 377–397

  2. Baena-García M, del Campo-Ávila J, Fidalgo R, Bifet A, Gavaldà R, Morales-Bueno R (2006) Early drift detection method

  3. Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learning Res 7(Nov):2399–2434

    MathSciNet  MATH  Google Scholar 

  4. Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) Moa: massive online analysis. J Mach Learn Res 11(May):1601–1604

    Google Scholar 

  5. Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory. ACM , pp 92–100

  6. Borchani H, Larrañaga P, Bielza C (2011) Classifying evolving data streams with partially labeled data. Intelligent Data Analysis 15(5):655–670

    Article  Google Scholar 

  7. Breiman L (2001) Random forests. Machine Learning 45(1):5–32

    Article  Google Scholar 

  8. Brzeziński D (2010) Mining data streams with concept drift. PhD thesis, PhD thesis, MS thesis, Dept. of Computing Science and Management, Poznan University of Technology, Poznan Google Scholar

  9. Brzezinski D, Stefanowski J (2014) Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans Neural Netw Learning Sys 25(1):81–94

    Article  Google Scholar 

  10. Cui W, Liu S, Li T, Shi C, Song Y, Gao Z, Qu H, Tong X (2011) Textflow: towards better understanding of evolving topics in text. IEEE Trans Visualization Comput Graphics 17(12):2412– 2421

    Article  Google Scholar 

  11. Dasu T, Krishnan S, Venkatasubramanian S, Yi K (2006) An information-theoretic approach to detecting changes in multi-dimensional data streams. In: Proc. Symp. on the interface of statistics, computing science, and applications. Citeseer

  12. Ditzler G, Roveri M, Alippi C, Polikar R (2015) Learning in nonstationary environments: a survey. IEEE Comput Intell Magazine 10(4):12–25

    Article  Google Scholar 

  13. Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 71–80

  14. Dyer KB, Capo R, Polikar R (2014) Compose: a semisupervised learning framework for initially labeled nonstationary streaming data. IEEE Trans Neural Netw Learning Sys 25(1):12–26

    Article  Google Scholar 

  15. Efron B, Tibshirani R (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Sci: 54–75

  16. Elwell R, Polikar R (2011) Incremental learning of concept drift in nonstationary environments. IEEE Trans Neural Netw 22(10):1517–1531

    Article  Google Scholar 

  17. Ferreira RS, Zimbrão G, Alvim LGM (2019) Amanda: semi-supervised density-based adaptive model for non-stationary data with extreme verification latency. Inf Sci

  18. Frank A, Asuncion A (2010) UCI machine learning repository

  19. Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM Sigmod Record 34(2):18–26

    Article  Google Scholar 

  20. Gama J (2010) Knowledge discovery from data streams. Chapman and Hall/CRC

  21. Gama J, Fernandes R, Rocha R (2006) Decision trees for mining data streams. Intelligent Data Analysis 10(1):23–45

    Article  Google Scholar 

  22. Gama J, Gaber MM (2007) Learning from data streams: processing techniques in sensor networks. Springer

  23. Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: Brazilian symposium on artificial intelligence. Springer, pp 286–295

  24. Gama J, Rocha R, Medas P (2003) Accurate decision trees for mining high-speed data streams. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 523–528

  25. Gama J, žliobaitė I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM computing surveys (CSUR) 46(4):44

    Article  Google Scholar 

  26. Gao J, Fan W, Han J, Yu PS (2007) A general framework for mining concept-drifting data streams with skewed distributions. In: Proceedings of the SIAM international conference on data mining. SIAM, p 2007

  27. Gomes HM, Barddal JP, Enembreck F, Bifet A (2017) A survey on ensemble learning for data stream classification. ACM Computing Surveys (CSUR) 50(2):23

    Article  Google Scholar 

  28. Harries M, New South Wales (1999) Splice-2 comparative evaluation: electricity pricing

  29. Hosseini MJ, Gholipour A, Beigy H (2016) An ensemble of cluster-based classifiers for semi-supervised classification of non-stationary data streams. Knowl Inf Syst 46(3):567–597

    Article  Google Scholar 

  30. Hulten G, Spencer L, Pedro Domingos. (2001) Mining time-changing data streams. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 97–106

  31. Kadwe Y, Suryawanshi V (2015) A review on concept drift. IOSR J Comput Eng 17:20–26

    Google Scholar 

  32. Kim Y, Park CH (2017) An efficient concept drift detection method for streaming data under limited labeling. IEICE Trans Inf Sys 100(10):2537–2546

    Article  Google Scholar 

  33. Kirkby RB (2007) Improving hoeffding trees. PhD thesis, The University of Waikato

  34. Klinkenberg R, Joachims T (2000) Detecting concept drift with support vector machines. In: ICML, pp 487–494

  35. Zico Kolter J, Maloof MA (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8(Dec):2755–2790

    MATH  Google Scholar 

  36. Krawczyk B, Minku LL, Gama J, Stefanowski J, Woźniak M (2017) Ensemble learning for data stream analysis: a survey. Information Fusion 37:132–156

    Article  Google Scholar 

  37. Krawczyk B, Wozniak M (2015) Weighted naive bayes classifier with forgetting for drifting data streams. In: IEEE international conference on systems, man, and cybernetics. IEEE, p 2015

  38. Kulkarni P, Ade R (2014) Incremental learning from unbalanced data with concept class, concept drift and missing features: a review. International Journal of Data Mining & Knowledge Management Process 4(6):15

    Article  Google Scholar 

  39. Li P, Wu X, Hu X (2010) Mining recurring concept drifts with limited labeled streaming data. In: Proceedings of 2nd Asian conference on machine learning, pp 241–252

  40. Malekian D, Hashemi MR (2013) An adaptive profile based fraud detection framework for handling concept drift. In: 2013 10th international ISC conference on information security and cryptology (ISCISC). IEEE, pp 1–6

  41. Masud MM, Gao J, Khan L, Han J, Thuraisingham B (2008) A practical approach to classify evolving data streams: training with limited amount of labeled data. In: Eighth IEEE international conference on data mining, 2008. ICDM’08. IEEE, pp 929–934

  42. Minku LL, Yao X (2012) Ddd: a new ensemble approach for dealing with concept drift. IEEE Trans Knowledge Data Eng 24(4):619–633

    Article  Google Scholar 

  43. Nguyen H-L, Woon Y-K, Ng W-K (2015) A survey on data stream clustering and classification. Knowledge Inf Sys 45(3):535–569

    Article  Google Scholar 

  44. Prasad BR, Agarwal S (2016) Stream data mining: platforms, algorithms, performance evaluators and research trends. International Journal of Database Theory and Application 9(9):201–218

    Article  Google Scholar 

  45. Ren S, Lian Y, Zou X (2014) Incremental naïve bayesian learning algorithm based on classification contribution degree. JCP 9(8):1967–1974

    Google Scholar 

  46. Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496

    Article  Google Scholar 

  47. Sakaki T, Okazaki M, Matsuo Y (2010) Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of the 19th international conference on world wide web. ACM, pp 851–860

  48. Tanha J (2018) Mssboost: a new multiclass boosting to semi-supervised learning. Neurocomputing

  49. Tanha J, et al. (2013) Ensemble approaches to semi-supervised learning. SIKS

  50. Tanha J, Someren MV, Afsarmanesh H (2014) Boosting for multiclass semi-supervised learning. Pattern Recogn Lett 37:63–77

    Article  Google Scholar 

  51. Tanha J, Van Someren M, Afsarmanesh H (2017) Semi-supervised self-training for decision tree classifiers. International Journal of Machine Learning and Cybernetics 8(1):355–370

    Article  Google Scholar 

  52. Tanha J (2019) A multiclass boosting algorithm to labeled and unlabeled data. International Journal of Machine Learning and Cybernetics 10(12):3647–3665

    Article  Google Scholar 

  53. Tsymbal A (2004) The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin 106 (2)

  54. Umer M, Frederickson C, Polikar R (2016) Learning under extreme verification latency quickly: fast compose. In: 2016 IEEE symposium series on computational intelligence (SSCI). IEEE, pp 1–8

  55. Vorburger P, Bernstein A (2006) Entropy-based concept shift detection. In: Sixth international conference on data mining ICDM’06, p 2006

  56. Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 226–235

  57. Yi W, Li T (2018) Improving semi-supervised co-forest algorithm in evolving data streams. Appl Intell: 1–15

  58. Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Machine Learning 23(1):69–101

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jafar Tanha.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khezri, S., Tanha, J., Ahmadi, A. et al. STDS: self-training data streams for mining limited labeled data in non-stationary environment. Appl Intell 50, 1448–1467 (2020). https://doi.org/10.1007/s10489-019-01585-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-019-01585-3

Keywords

Navigation