Skip to main content
Log in

Efficient unsupervised drift detector for fast and high-dimensional data streams

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Stream mining considers the online arrival of examples at high speed and the possibility of changes in its descriptive features or class definitions compared with past knowledge (i.e., concept drifts). The fast detection of drifts is essential to keep the predictive model updated and stable in changing environments. For many applications, such as those related to smart sensors, the high number of features is an additional challenge in terms of memory and time for stream processing. This paper presents an unsupervised and model-independent concept drift detector suitable for high-speed and high-dimensional data streams. We propose a straightforward two-dimensional data representation that allows the faster processing of datasets with a large number of examples and dimensions. We developed an adaptive drift detector on this visual representation that is efficient for fast streams with thousands of features and is accurate as existing costly methods that perform various statistical tests considering each feature individually. Our method achieves better performance measured by execution time and accuracy in classification problems for different types of drifts. The experimental evaluation considering synthetic and real data demonstrates the method’s versatility in several domains, including entomology, medicine, and transportation systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27

Similar content being viewed by others

Notes

  1. Supporting website: https://sites.google.com/view/ibdd-paper.

  2. Python implementation provided by the authors.

  3. https://nlp.stanford.edu/projects/glove/.

  4. https://bit.ly/39cy83O.

References

  1. Bass C, Williamson MS, Wilding CS, Donnelly MJ, Field LM (2007) Identification of the main malaria vectors in the Anopheles gambiae species complex using a TaqMan real-time PCR assay. Malar J 6(1):155

    Article  Google Scholar 

  2. Bergman LD, Rogowitz BE, Treinish LA (1995) A rule-based tool for assisting colormap selection. In: Proceedings visualization. IEEE, pp 118–125

  3. Bifet A, Gavalda R (2007) Learning from time-changing data with adaptive windowing. In: International conference on data mining (SDM). SIAM, pp 443–448

  4. Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: massive online analysis. J Mach Learn Res 11(May):1601–1604

    Google Scholar 

  5. Brewer C (2015) Designing better Maps: a guide for GIS users. ESRI Press

  6. Campana BJL, Keogh E (2010) A compression based distance measure for texture. In: International conference on data mining (SDM). SIAM, pp 850–861

  7. Cieslak DA, Chawla NV (2009) A framework for monitoring classifiers’ performance: when and why failure occurs? Knowl Inf Syst 18(1):83–108

    Article  Google Scholar 

  8. Dau HA, Keogh E, Kamgar K, Yeh CM, Zhu Y, Gharghabi S, Ratanamahatana CA, Yanping C, Hu B, Begum N, Bagnall A, Mueen A, Batista G (2018), Hexagon-ML: the UCR time series classification archive. https://www.cs.ucr.edu/~eamonn/time_series_data_2018/

  9. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  10. Ditzler G, Polikar R (2011) Hellinger distance based drift detection for nonstationary environments. In: Symposium on computational intelligence in dynamic and uncertain environments (CIDUE). IEEE, pp 41–48

  11. Ditzler G, Roveri M, Alippi C, Polikar R (2015) Learning in nonstationary environments: a survey. IEEE Comput Intell Mag 10(4):12–25

    Article  Google Scholar 

  12. Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml

  13. Dyer KB, Capo R, Polikar R (2013) Compose: a semisupervised learning framework for initially labeled nonstationary streaming data. IEEE Trans Neural Netw Learn Syst 25(1):12–26

    Article  Google Scholar 

  14. Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: Brazilian symposium on artificial intelligence (SBIA), pp 286–295

  15. Gama J, Žliobaitė I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):44

    Article  Google Scholar 

  16. Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PC, Mark RG, Mietus JE, Moody GB, Peng C, Stanley HE (2000) Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. Circulation 101(23):215–220

    Article  Google Scholar 

  17. González-Jiménez M, Babayan SA, Khazaeli P, Doyle M, Walton F, Reddy E, Glew T, Viana M, Ranford-Cartwright L, Niang A (2019) Prediction of mosquito species and population age structure using mid-infrared spectroscopy and supervised machine learning. Wellcome Open Res 4

  18. Guo LZ, Zhou Z, Li YF (2020) Record: resource constrained semi-supervised learning under distribution shift. In: International conference on knowledge discovery & data mining (KDD). ACM, pp 1636–1644

  19. Hawkins DM (1976) Point estimation of the parameters of piecewise regression models. J R Stat Soc Ser C (Appl Stat) 25(1):51–57

    MathSciNet  Google Scholar 

  20. Howlader N, Noone AM, Krapcho M, Garshell J, Miller D, Altekruse SF, Kosary CL, Yu M, Ruhl J, Tatalovich Z (2015) Seer cancer statistics review, 1975–2012. National Cancer Institute, Bethesda, MD

  21. Hu H, Kantardzic M, Sethi TS (2019) No free lunch theorem for concept drift detection in streaming data classification: a review. Wiley Interdiscip Rev Data Min Knowl Discov 10:e1327

    Google Scholar 

  22. Kaluža B, Mirchevska V, Dovgan E, Luštrek M, Gams M (2010) An agent-based approach to care in independent living. In: International joint conference on ambient intelligence (AMI), pp 177–186

  23. Kifer D, Ben-David S, Gehrke J (2004) Detecting change in data streams. In: International conference on very large data bases (VLDB), pp 180–191

  24. Korycki L, Krawczyk B (2019) Unsupervised drift detector ensembles for data stream mining. In: International conference on data science and advanced analytics (DSAA). IEEE, pp 317–325

  25. Laikova KV, Oberemok VV, Krasnodubets AM, Gal’chinsky NV, Useinov RZ, Novikov IA, Temirova ZZ, Gorlov MV, Shved NA, Kumeiko VV (2019) Advances in the understanding of skin cancer: ultraviolet radiation, mutations, and antisense oligonucleotides as anticancer drugs. Molecules 24(8):1516

    Article  Google Scholar 

  26. Liu J, Zhong L, Wickramasuriya J, Vasudevan V (2009) uWave: accelerometer-based personalized gesture recognition and its applications. Pervasive Mobile Comput 5(6):657–675

    Article  Google Scholar 

  27. Maletzke A, Reis D, Cherman E, Batista G (2018) On the need of class ratio insensitive drift tests for data streams. In: Second international workshop on learning with imbalanced domains: theory and applications, pp 110–124

  28. Marks R (1995) An overview of skin cancers. Cancer 75(S2):607–612

    Article  Google Scholar 

  29. Moreland K (2009) Diverging color maps for scientific visualization. In: International symposium on visual computing. Springer, pp 92–103

  30. Niculescu-Mizil A, Caruana R (2005) Predicting good probabilities with supervised learning. In: International conference on machine learning (ICML), pp 625–632

  31. Reis DM, Flach P, Matwin S, Batista G (2016) Fast unsupervised online drift detection using incremental Kolmogorov–Smirnov test. In: International conference on knowledge discovery and data mining (KDD). ACM, pp 1545–1554

  32. Roth GA, Abate D, Abate KH, Abay SM, Abbafati C, Abbasi N, Abbastabar H, Abd-Allah F, Abdela J, Abdelalim A (2018) Global, regional, and national age-sex-specific mortality for 282 causes of death in 195 countries and territories, 1980–2017: a systematic analysis for the global burden of disease study 2017. The Lancet 392(10159):1736–1788

    Article  Google Scholar 

  33. Santolamazza F, Mancini E, Simard F, Qi Y, Tu Z, della Torre A (2008) Insertion polymorphisms of SINE200 retrotransposons within speciation islands of Anopheles gambiae molecular forms. Malar J 7(1):163

    Article  Google Scholar 

  34. Sethi TS, Kantardzic M (2017) On the reliable detection of concept drift from streaming unlabeled data. Expert Syst Appl 82:77–99

    Article  Google Scholar 

  35. Sobolewski P, Woźniak M (2013) Comparable study of statistical tests for virtual concept drift detection. In: International conference on computer recognition systems (CORES), pp 329–337

  36. Souza VMA (2018) Asphalt pavement classification using smartphone accelerometer and complexity invariant distance. Eng Appl Artif Intell 74:198–211

    Article  Google Scholar 

  37. Souza VMA, Cherman EA, Rossi RG, Souza RA (2017) Towards automatic evaluation of asphalt irregularity using smartphone’s sensors. In: International symposium on intelligent data analysis (IDA). Springer, pp 322–333

  38. Souza VMA, Chowdhury FA, Mueen A (2020) Unsupervised drift detection on high-speed data streams. In: International conference on big data. IEEE, pp 102–111

  39. Souza VMA, Giusti R, Batista AJL (2018) Asfault: a low-cost system to evaluate pavement conditions in real-time using smartphones and machine learning. Pervasive Mobile Comput 51:121–137

    Article  Google Scholar 

  40. Souza VMA, Pinho T, Batista GEAPA (2018) Evaluating stream classifiers with delayed labels information. In: Brazilian conference on intelligent systems (BRACIS). IEEE, pp 408–413

  41. Souza VMA, Reis DM, Maletzke AG, Batista G (2020) Challenges in benchmarking stream learning algorithms with real-world data. Data Min Knowl Discov 34:1805–1858

    Article  MathSciNet  Google Scholar 

  42. Souza VMA, Silva DF, Batista G, Gama J (2015) Classification of evolving data streams with infinitely delayed labels. In: International conference on machine learning and applications (ICMLA). IEEE, pp 214–219

  43. Souza VMA, Silva DF, Batista GEAPA (2013) Classification of data streams applied to insect recognition: initial results. In: Brazilian conference on intelligent systems (BRACIS), pp 76–81

  44. Souza VMA, Silva DF, Gama J, Batista GEAPA (2015) Data stream classification guided by clustering on nonstationary environments and extreme verification latency. In: International conference on data mining (SDM). SIAM, pp 873–881

  45. Tschandl P, Rosendahl C, Kittler H (2018) The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci Data 5:180161

    Article  Google Scholar 

  46. Ulanova L, Begum N, Shokoohi-Yekta M, Keogh E (2016) Clustering in the face of fast changing streams. In: International conference on data mining (SDM). SIAM, pp 1–9

  47. Wang Z, Bovik AC (2009) Mean squared error: love it or leave it? a new look at signal fidelity measures. IEEE Signal Process Mag 26(1):98–117

    Article  Google Scholar 

  48. Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: International conference on machine learning (ICML), pp 856–863

  49. Yu S, Wang X, Príncipe JC (2018) Request-and-reverify: hierarchical hypothesis testing for concept drift detection with expensive labels. In: International joint conference on artificial intelligence (IJCAI), pp 3033–3039

  50. Žliobaite I (2010) Change with delayed labeling: when is it detectable? In: International conference on data mining workshops (ICDMW). IEEE, pp 843–850

Download references

Acknowledgements

This material is based upon work supported by the National Science Foundation under Award #OIA-1757207 and the Brazilian National Council for Scientific and Technological Development under Grant No. 142050/2019-9.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vinicius M. A. Souza.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Souza, V.M.A., Parmezan, A.R.S., Chowdhury, F.A. et al. Efficient unsupervised drift detector for fast and high-dimensional data streams. Knowl Inf Syst 63, 1497–1527 (2021). https://doi.org/10.1007/s10115-021-01564-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-021-01564-6

Keywords

Navigation