Abstract
Collections of text documents such as product reviews and microblogs often evolve over time. In practice, however, classifiers trained on them are updated infrequently, leading to performance degradation over time. While approaches for automatic drift detection have been proposed, they were often designed for low-dimensional sensor data, and it is unclear how well they perform for state-of-the-art text classifiers based on high-dimensional document embeddings. In this paper, we empirically compare drift detectors on document embeddings on two benchmarking datasets with varying amounts of drift. Our results show that multivariate drift detectors based on the Kernel Two-Sample Test and Least-Squares Density Difference outperform univariate drift detectors based on the Kolmogorov-Smirnov Test. Moreover, our experiments show that current drift detectors perform better on smaller embedding dimensions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
References
Baena-Garcıa, M., del Campo-Ávila, J., Fidalgo, R., Bifet, A., Gavalda, R., Morales-Bueno, R.: Early drift detection method. In: Fourth International Workshop on Knowledge Discovery from Data Streams, vol. 6 (2006)
Baier, L., Jöhren, F., Seebacher, S.: Challenges in the deployment and operation of machine learning in practice. In: ECIS (2019)
Baier, L., Kühl, N., Satzger, G.: How to cope with change? - preserving validity of predictive services over time. In: HICSS, ScholarSpace (2019)
Basseville, M., Nikiforov, I.V.: Detection of Abrupt Changes: Theory and Application. Prentice Hall, Hoboken (1993)
Bifet, A., Gavaldà, R.: Learning from time-changing data with adaptive windowing. In: SDM, pp. 443–448, SIAM (2007)
Bu, L., Alippi, C., Zhao, D.: A pdf-free change detection test based on density difference estimation. IEEE Trans. Neural Networks Learn. Syst. 29(2), 324–334 (2018)
Chen, Y., Conroy, N.J., Rubin, V.L.: Misleading online content: recognizing clickbait as “false news". In: WMDD@ICMI, pp. 15–19. ACM (2015)
Chowdhury, A.G., Sawhney, R., Shah, R.R., Mahata, D.: #youtoo? Detection of personal recollections of sexual harassment on social media. In: ACL, pp. 2527–2537 (2019)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR (2018)
Gama, J., Castillo, G.: Learning with local drift detection. In: Li, X., Zaïane, O.R., Li, Z. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 42–55. Springer, Heidelberg (2006). https://doi.org/10.1007/11811305_4
Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. 46(4), 44:1–44:37 (2014)
Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.J.: A Kernel two-sample test. J. Mach. Learn. Res. 13, 723–773 (2012)
Heit, J., Liu, J., Shah, M.: An architecture for the deployment of statistical models for the big data era. In: IEEE BigData (2016)
Hu, M., Liu, B.: Mining and summarizing customer reviews. In: KDD, pp. 168–177. ACM (2004)
Kumar, S., West, R., Leskovec, J.: Disinformation on the web: Impact, characteristics, and detection of Wikipedia hoaxes. In: WWW. ACM (2016)
Lindstrom, P., Namee, B.M., Delany, S.J.: Drift detection using uncertainty distribution divergence. Evol. Syst. 4(1), 13–25 (2013)
Lopez-Paz, D., Oquab, M.: Revisiting classifier two-sample tests. In: ICLR (Poster), OpenReview.net (2017)
Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., Zhang, G.: Learning under concept drift: a review. IEEE Trans. Knowl. Data Eng. 31(12), 2346–2363 (2019)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (Workshop Poster) (2013)
Ng, A.Y., Jordan, M.I.: On discriminative vs. generative classifiers: a comparison of logistic regression and Naive Bayes. In: NIPS, pp. 841–848, MIT Press (2001)
Nishida, K., Yamauchi, K.: Detecting concept drift using statistical testing. In: Corruble, V., Takeda, M., Suzuki, E. (eds.) DS 2007. LNCS (LNAI), vol. 4755, pp. 264–269. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75488-6_27
Rabanser, S., Günnemann, S., Lipton, Z.C.: Failing loudly: an empirical study of methods for detecting dataset shift. In: NeurIPS (2019)
Shoemark, P., Liza, F.F., Nguyen, D., Hale, S.A., McGillivray, B.: Room to Glo: a systematic comparison of semantic change detection approaches with word embeddings. In: EMNLP/IJCNLP, pp. 66–76. Association for Computational Linguistics (2019)
Tsymbal, A.: The problem of concept drift: definitions and related work. Comput. Sci. Dept. Trinity College Dublin 106(2), 58 (2004)
Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden contexts. Mach. Learn. 23(1), 69–101 (1996)
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: EMNLP (Demos), pp. 38–45. ACL (2020)
Yin, D., Xue, Z., Hong, L., Davison, B.D., Kontostathis, A., Edwards, L.: Detection of harassment on web 2.0. In: Proceedings of the Content Analysis in the WEB 2, pp. 1–7 (2009)
Žliobaitė, I., Pechenizkiy, M., Gama, J.: An overview of concept drift applications. Big data analysis: new algorithms for a new society (2016)
Acknowledgments
This work has been supported by the German Federal Ministry of Education and Research (BMBF) within the project EML4U under the grant no 01IS19080 A and B.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Feldhans, R. et al. (2021). Drift Detection in Text Data with Document Embeddings. In: Yin, H., et al. Intelligent Data Engineering and Automated Learning – IDEAL 2021. IDEAL 2021. Lecture Notes in Computer Science(), vol 13113. Springer, Cham. https://doi.org/10.1007/978-3-030-91608-4_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-91608-4_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91607-7
Online ISBN: 978-3-030-91608-4
eBook Packages: Computer ScienceComputer Science (R0)