Skip to main content

Streaming Data Analytics for Feature Importance Measures in Concept Drift Detection and Adaptation

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2023)

Abstract

Numerous applications require the ability to detect and adapt to concept drifts in streaming data on the fly. This is challenged by limited computational resources and access to archival storage. In this paper, we study features that capture the evolving relationship between raw data features and target labels, and techniques to extract those features. In particular, we focus on the relationship between feature importance measures in streaming data and predictability performance of the main classifier. For this, we consider two groups of feature importance measures: impurity-based and permutation-based, both of which are computed over an auxiliary online gradient boosted decision trees ensemble that runs in parallel to the main classifier in processing the same data stream. We found strong evidence that feature importance measures follow the long-term trend of the performance metrics even if the data streams are non-stationary or deviate from the performance metrics in short-term. Our study also shows that classification models that process data with constant or monotonic rate of drift, are robust in terms of stationary nature of feature importance measures and learner’s predictability performance. Moreover, we found evidence for more consistency and reliability of permutation feature importance measurements over impurity-based ones if data exhibits periodic or non-monotonic rates of drift, or if this knowledge is not known a priori. Our study and results indicate that the feature importance measures considered are viable sources of information for concept drift detection and adaptation problems. This has been established through a solution to these problems we developed based on vector error-correction analysis.

This work was partially supported by Concordia University.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abdallah, A., Maarof, M.A., Zainal, A.: Fraud detection system: a survey. J. Netw. Comput. Appl. 68, 90–113 (2016)

    Article  Google Scholar 

  2. Alizadeh Mansouri, A., Javadtalab, A., Shiri, N.: An ensemble learning augmentation method for concept drift detection over data streams. In: Advances in Data Science and Information Engineering. Springer (2022)

    Google Scholar 

  3. Barddal, J.P., Enembreck, F., Gomes, H.M., Bifet, A., Pfahringer, B.: Boosting decision stumps for dynamic feature selection on data streams. Inf. Syst. 83, 13–29 (2019)

    Article  Google Scholar 

  4. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  5. Breiman, L.: Manual on setting up, using, and understanding random forests v3.1. Stat. Dept. Univ. Calif. Berkeley CA, USA 1(58), 3–42 (2002)

    Google Scholar 

  6. Cassidy, A.P., Deviney, F.A.: Calculating feature importance in data streams with concept drift using online random forest. In: 2014 IEEE International Conference on Big Data (Big Data), pp. 23–28 (2014)

    Google Scholar 

  7. Castro-Cabrera, P.A., Orozco-Alzate, M., Castellanos-Domínguez, C.G., Huenupán, F., Franco, L.E.: Supervised and unsupervised identification of concept drifts in data streams of seismic-volcanic signals. In: Simari, G.R., Fermé, E., Gutiérrez Segura, F., Rodríguez Melquiades, J.A. (eds.) IBERAMIA 2018. LNCS (LNAI), vol. 11238, pp. 193–205. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03928-8_16

    Chapter  Google Scholar 

  8. Ditzler, G., Polikar, R.: Hellinger distance based drift detection for nonstationary environments. In: 2011 IEEE Symposium on Computational Intelligence in Dynamic and Uncertain Environments (CIDUE), pp. 41–48 (2011)

    Google Scholar 

  9. Elwell, R., Polikar, R.: Incremental learning of concept drift in nonstationary environments. IEEE Trans. Neural Netw. 22(10), 1517–1531 (2011)

    Article  Google Scholar 

  10. Engle, R.F., Granger, C.W.J.: Co-integration and error correction: representation, estimation, and testing. Econometrica 55(2), 251–276 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  11. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  12. Gama, J., Medas, P., Castillo, G., Rodrigues, P.: Learning with drift detection. In: Bazzan, A.L.C., Labidi, S. (eds.) SBIA 2004. LNCS (LNAI), vol. 3171, pp. 286–295. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28645-5_29

    Chapter  Google Scholar 

  13. Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. (CSUR) 46(4), 44 (2014)

    Article  MATH  Google Scholar 

  14. Gomes, H.M., de Mello, R.F., Pfahringer, B., Bifet, A.: Feature scoring using tree-based ensembles for evolving data streams. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 761–769 (2019)

    Google Scholar 

  15. Hand, D.J., Adams, N.M.: Selection bias in credit scorecard evaluation. J. Oper. Res. Soc. 65(3), 408–415 (2014)

    Article  Google Scholar 

  16. Harries, M., Wales, N.S.: SPLICE-2 Comparative Evaluation: Electricity Pricing (1999)

    Google Scholar 

  17. He, Z., Maekawa, K.: On spurious Granger causality. Econ. Lett. 73(3), 307–313 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  18. Johansen, S.: Estimation and hypothesis testing of cointegration vectors in gaussian vector autoregressive models. Econometrica 59(6), 1551–1580 (1991)

    Article  MathSciNet  MATH  Google Scholar 

  19. Khamassi, I., Sayed-Mouchaweh, M., Hammami, M., Ghédira, K.: Discussion and review on evolving data streams and concept drift adapting. Evol. Syst. 9(1), 1–23 (2018)

    Article  Google Scholar 

  20. Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms, 2 edn. John Wiley & Sons, Hoboken (2014)

    Google Scholar 

  21. Liang, N.y., Huang, G.b., Saratchandran, P., Sundararajan, N.: A fast and accurate online sequential learning algorithm for feedforward networks. IEEE Trans. Neural Netw. 17(6), 1411–1423 (2006)

    Google Scholar 

  22. Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., Zhang, G.: Learning under concept drift: a review. IEEE Trans. Knowl. Data Eng. 31(12), 2346–2363 (2019)

    Google Scholar 

  23. Maziarz, M.: A review of the Granger-causality fallacy. J. Philos. Econ. Reflect. Econ. Soc. Issues VIII 2, 86–105 (2015)

    Google Scholar 

  24. Michaelides, M.P., Reppa, V., Panayiotou, C., Polycarpou, M.: Contaminant event monitoring in intelligent buildings using a multi-zone formulation. IFAC Proc. Vol. 45(20), 492–497 (2012)

    Article  Google Scholar 

  25. Sethi, T.S., Kantardzic, M.: On the reliable detection of concept drift from streaming unlabeled data. Expert Syst. Appl. 82, 77–99 (2017)

    Article  Google Scholar 

  26. Sims, C.A., Stock, J.H., Watson, M.W.: Inference in linear time series models with some unit roots. Econometrica 58(1), 113–144 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  27. Stolfo, S., Fan, W., Lee, W., Prodromidis, A., Chan, P.: Cost-based modeling for fraud and intrusion detection: results from the JAM project. In: Proceedings DARPA Information Survivability Conference and Exposition. DISCEX’00, vol. 2, pp. 130–144 (2000)

    Google Scholar 

  28. Street, W.N., Kim, Y.: A streaming ensemble algorithm (SEA) for large-scale classification. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 377–382. KDD ’01, Association for Computing Machinery (2001)

    Google Scholar 

  29. Unknown: Global Surface Summary of the Day - GSOD

    Google Scholar 

  30. Vergara, A., Vembu, S., Ayhan, T., Ryan, M.A., Homer, M.L., Huerta, R.: Chemical gas sensor drift compensation using classifier ensembles. Sens. Actuators B Chem. 166–167, 320–329 (2012)

    Article  Google Scholar 

  31. Wang, J., Lu, S., Wang, S.H., Zhang, Y.D.: A review on extreme learning machine. Multimed. Tools Appl. 81(29), 41611–41660 (2022)

    Article  Google Scholar 

  32. Wang, K., Lu, J., Liu, A., Zhang, G., Xiong, L.: Evolving gradient boost: a pruning scheme based on loss improvement ratio for learning under concept drift. IEEE Trans. Cybern. 53(4), 2110–2123 (2023). https://doi.org/10.1109/TCYB.2021.3109796

  33. White, A.P., Liu, W.Z.: Bias in information-based measures in decision tree induction. Mach. Learn. 15(3), 321–329 (1994)

    Article  MATH  Google Scholar 

  34. Yang, Z., Al-Dahidi, S., Baraldi, P., Zio, E., Montelatici, L.: A novel concept drift detection method for incremental learning in nonstationary environments. IEEE Trans. Neural Netw. Learn. Syst. 31(1), 309–320 (2020)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Alizadeh Mansouri .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alizadeh Mansouri, A., Javadtalab, A., Shiri, N. (2023). Streaming Data Analytics for Feature Importance Measures in Concept Drift Detection and Adaptation. In: Strauss, C., Amagasa, T., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2023. Lecture Notes in Computer Science, vol 14146. Springer, Cham. https://doi.org/10.1007/978-3-031-39847-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-39847-6_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-39846-9

  • Online ISBN: 978-3-031-39847-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics