Abstract
Numerous applications require the ability to detect and adapt to concept drifts in streaming data on the fly. This is challenged by limited computational resources and access to archival storage. In this paper, we study features that capture the evolving relationship between raw data features and target labels, and techniques to extract those features. In particular, we focus on the relationship between feature importance measures in streaming data and predictability performance of the main classifier. For this, we consider two groups of feature importance measures: impurity-based and permutation-based, both of which are computed over an auxiliary online gradient boosted decision trees ensemble that runs in parallel to the main classifier in processing the same data stream. We found strong evidence that feature importance measures follow the long-term trend of the performance metrics even if the data streams are non-stationary or deviate from the performance metrics in short-term. Our study also shows that classification models that process data with constant or monotonic rate of drift, are robust in terms of stationary nature of feature importance measures and learner’s predictability performance. Moreover, we found evidence for more consistency and reliability of permutation feature importance measurements over impurity-based ones if data exhibits periodic or non-monotonic rates of drift, or if this knowledge is not known a priori. Our study and results indicate that the feature importance measures considered are viable sources of information for concept drift detection and adaptation problems. This has been established through a solution to these problems we developed based on vector error-correction analysis.
This work was partially supported by Concordia University.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abdallah, A., Maarof, M.A., Zainal, A.: Fraud detection system: a survey. J. Netw. Comput. Appl. 68, 90–113 (2016)
Alizadeh Mansouri, A., Javadtalab, A., Shiri, N.: An ensemble learning augmentation method for concept drift detection over data streams. In: Advances in Data Science and Information Engineering. Springer (2022)
Barddal, J.P., Enembreck, F., Gomes, H.M., Bifet, A., Pfahringer, B.: Boosting decision stumps for dynamic feature selection on data streams. Inf. Syst. 83, 13–29 (2019)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Breiman, L.: Manual on setting up, using, and understanding random forests v3.1. Stat. Dept. Univ. Calif. Berkeley CA, USA 1(58), 3–42 (2002)
Cassidy, A.P., Deviney, F.A.: Calculating feature importance in data streams with concept drift using online random forest. In: 2014 IEEE International Conference on Big Data (Big Data), pp. 23–28 (2014)
Castro-Cabrera, P.A., Orozco-Alzate, M., Castellanos-Domínguez, C.G., Huenupán, F., Franco, L.E.: Supervised and unsupervised identification of concept drifts in data streams of seismic-volcanic signals. In: Simari, G.R., Fermé, E., Gutiérrez Segura, F., Rodríguez Melquiades, J.A. (eds.) IBERAMIA 2018. LNCS (LNAI), vol. 11238, pp. 193–205. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03928-8_16
Ditzler, G., Polikar, R.: Hellinger distance based drift detection for nonstationary environments. In: 2011 IEEE Symposium on Computational Intelligence in Dynamic and Uncertain Environments (CIDUE), pp. 41–48 (2011)
Elwell, R., Polikar, R.: Incremental learning of concept drift in nonstationary environments. IEEE Trans. Neural Netw. 22(10), 1517–1531 (2011)
Engle, R.F., Granger, C.W.J.: Co-integration and error correction: representation, estimation, and testing. Econometrica 55(2), 251–276 (1987)
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)
Gama, J., Medas, P., Castillo, G., Rodrigues, P.: Learning with drift detection. In: Bazzan, A.L.C., Labidi, S. (eds.) SBIA 2004. LNCS (LNAI), vol. 3171, pp. 286–295. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28645-5_29
Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. (CSUR) 46(4), 44 (2014)
Gomes, H.M., de Mello, R.F., Pfahringer, B., Bifet, A.: Feature scoring using tree-based ensembles for evolving data streams. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 761–769 (2019)
Hand, D.J., Adams, N.M.: Selection bias in credit scorecard evaluation. J. Oper. Res. Soc. 65(3), 408–415 (2014)
Harries, M., Wales, N.S.: SPLICE-2 Comparative Evaluation: Electricity Pricing (1999)
He, Z., Maekawa, K.: On spurious Granger causality. Econ. Lett. 73(3), 307–313 (2001)
Johansen, S.: Estimation and hypothesis testing of cointegration vectors in gaussian vector autoregressive models. Econometrica 59(6), 1551–1580 (1991)
Khamassi, I., Sayed-Mouchaweh, M., Hammami, M., Ghédira, K.: Discussion and review on evolving data streams and concept drift adapting. Evol. Syst. 9(1), 1–23 (2018)
Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms, 2 edn. John Wiley & Sons, Hoboken (2014)
Liang, N.y., Huang, G.b., Saratchandran, P., Sundararajan, N.: A fast and accurate online sequential learning algorithm for feedforward networks. IEEE Trans. Neural Netw. 17(6), 1411–1423 (2006)
Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., Zhang, G.: Learning under concept drift: a review. IEEE Trans. Knowl. Data Eng. 31(12), 2346–2363 (2019)
Maziarz, M.: A review of the Granger-causality fallacy. J. Philos. Econ. Reflect. Econ. Soc. Issues VIII 2, 86–105 (2015)
Michaelides, M.P., Reppa, V., Panayiotou, C., Polycarpou, M.: Contaminant event monitoring in intelligent buildings using a multi-zone formulation. IFAC Proc. Vol. 45(20), 492–497 (2012)
Sethi, T.S., Kantardzic, M.: On the reliable detection of concept drift from streaming unlabeled data. Expert Syst. Appl. 82, 77–99 (2017)
Sims, C.A., Stock, J.H., Watson, M.W.: Inference in linear time series models with some unit roots. Econometrica 58(1), 113–144 (1990)
Stolfo, S., Fan, W., Lee, W., Prodromidis, A., Chan, P.: Cost-based modeling for fraud and intrusion detection: results from the JAM project. In: Proceedings DARPA Information Survivability Conference and Exposition. DISCEX’00, vol. 2, pp. 130–144 (2000)
Street, W.N., Kim, Y.: A streaming ensemble algorithm (SEA) for large-scale classification. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 377–382. KDD ’01, Association for Computing Machinery (2001)
Unknown: Global Surface Summary of the Day - GSOD
Vergara, A., Vembu, S., Ayhan, T., Ryan, M.A., Homer, M.L., Huerta, R.: Chemical gas sensor drift compensation using classifier ensembles. Sens. Actuators B Chem. 166–167, 320–329 (2012)
Wang, J., Lu, S., Wang, S.H., Zhang, Y.D.: A review on extreme learning machine. Multimed. Tools Appl. 81(29), 41611–41660 (2022)
Wang, K., Lu, J., Liu, A., Zhang, G., Xiong, L.: Evolving gradient boost: a pruning scheme based on loss improvement ratio for learning under concept drift. IEEE Trans. Cybern. 53(4), 2110–2123 (2023). https://doi.org/10.1109/TCYB.2021.3109796
White, A.P., Liu, W.Z.: Bias in information-based measures in decision tree induction. Mach. Learn. 15(3), 321–329 (1994)
Yang, Z., Al-Dahidi, S., Baraldi, P., Zio, E., Montelatici, L.: A novel concept drift detection method for incremental learning in nonstationary environments. IEEE Trans. Neural Netw. Learn. Syst. 31(1), 309–320 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Alizadeh Mansouri, A., Javadtalab, A., Shiri, N. (2023). Streaming Data Analytics for Feature Importance Measures in Concept Drift Detection and Adaptation. In: Strauss, C., Amagasa, T., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2023. Lecture Notes in Computer Science, vol 14146. Springer, Cham. https://doi.org/10.1007/978-3-031-39847-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-39847-6_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39846-9
Online ISBN: 978-3-031-39847-6
eBook Packages: Computer ScienceComputer Science (R0)