Skip to main content

Drift Detection in Text Data with Document Embeddings

  • Conference paper
  • First Online:
Intelligent Data Engineering and Automated Learning – IDEAL 2021 (IDEAL 2021)

Abstract

Collections of text documents such as product reviews and microblogs often evolve over time. In practice, however, classifiers trained on them are updated infrequently, leading to performance degradation over time. While approaches for automatic drift detection have been proposed, they were often designed for low-dimensional sensor data, and it is unclear how well they perform for state-of-the-art text classifiers based on high-dimensional document embeddings. In this paper, we empirically compare drift detectors on document embeddings on two benchmarking datasets with varying amounts of drift. Our results show that multivariate drift detectors based on the Kernel Two-Sample Test and Least-Squares Density Difference outperform univariate drift detectors based on the Kolmogorov-Smirnov Test. Moreover, our experiments show that current drift detectors perform better on smaller embedding dimensions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/EML4U/Drift-detector-comparison.

  2. 2.

    https://github.com/SeldonIO/alibi-detect.

  3. 3.

    https://github.com/emanuele/kernel_two_sample_test.

  4. 4.

    https://snap.stanford.edu/data/web-Movies.html.

  5. 5.

    https://radimrehurek.com/gensim/.

  6. 6.

    https://www.kaggle.com/manchunhui/us-election-2020-tweets.

  7. 7.

    https://pypi.org/project/langdetect/.

  8. 8.

    http://code.google.com/p/language-detection/.

  9. 9.

    https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon.

References

  1. Baena-Garcıa, M., del Campo-Ávila, J., Fidalgo, R., Bifet, A., Gavalda, R., Morales-Bueno, R.: Early drift detection method. In: Fourth International Workshop on Knowledge Discovery from Data Streams, vol. 6 (2006)

    Google Scholar 

  2. Baier, L., Jöhren, F., Seebacher, S.: Challenges in the deployment and operation of machine learning in practice. In: ECIS (2019)

    Google Scholar 

  3. Baier, L., Kühl, N., Satzger, G.: How to cope with change? - preserving validity of predictive services over time. In: HICSS, ScholarSpace (2019)

    Google Scholar 

  4. Basseville, M., Nikiforov, I.V.: Detection of Abrupt Changes: Theory and Application. Prentice Hall, Hoboken (1993)

    Google Scholar 

  5. Bifet, A., Gavaldà, R.: Learning from time-changing data with adaptive windowing. In: SDM, pp. 443–448, SIAM (2007)

    Google Scholar 

  6. Bu, L., Alippi, C., Zhao, D.: A pdf-free change detection test based on density difference estimation. IEEE Trans. Neural Networks Learn. Syst. 29(2), 324–334 (2018)

    Google Scholar 

  7. Chen, Y., Conroy, N.J., Rubin, V.L.: Misleading online content: recognizing clickbait as “false news". In: WMDD@ICMI, pp. 15–19. ACM (2015)

    Google Scholar 

  8. Chowdhury, A.G., Sawhney, R., Shah, R.R., Mahata, D.: #youtoo? Detection of personal recollections of sexual harassment on social media. In: ACL, pp. 2527–2537 (2019)

    Google Scholar 

  9. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR (2018)

    Google Scholar 

  10. Gama, J., Castillo, G.: Learning with local drift detection. In: Li, X., Zaïane, O.R., Li, Z. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 42–55. Springer, Heidelberg (2006). https://doi.org/10.1007/11811305_4

    Chapter  Google Scholar 

  11. Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. 46(4), 44:1–44:37 (2014)

    Google Scholar 

  12. Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.J.: A Kernel two-sample test. J. Mach. Learn. Res. 13, 723–773 (2012)

    MathSciNet  MATH  Google Scholar 

  13. Heit, J., Liu, J., Shah, M.: An architecture for the deployment of statistical models for the big data era. In: IEEE BigData (2016)

    Google Scholar 

  14. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: KDD, pp. 168–177. ACM (2004)

    Google Scholar 

  15. Kumar, S., West, R., Leskovec, J.: Disinformation on the web: Impact, characteristics, and detection of Wikipedia hoaxes. In: WWW. ACM (2016)

    Google Scholar 

  16. Lindstrom, P., Namee, B.M., Delany, S.J.: Drift detection using uncertainty distribution divergence. Evol. Syst. 4(1), 13–25 (2013)

    Article  Google Scholar 

  17. Lopez-Paz, D., Oquab, M.: Revisiting classifier two-sample tests. In: ICLR (Poster), OpenReview.net (2017)

    Google Scholar 

  18. Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., Zhang, G.: Learning under concept drift: a review. IEEE Trans. Knowl. Data Eng. 31(12), 2346–2363 (2019)

    Google Scholar 

  19. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (Workshop Poster) (2013)

    Google Scholar 

  20. Ng, A.Y., Jordan, M.I.: On discriminative vs. generative classifiers: a comparison of logistic regression and Naive Bayes. In: NIPS, pp. 841–848, MIT Press (2001)

    Google Scholar 

  21. Nishida, K., Yamauchi, K.: Detecting concept drift using statistical testing. In: Corruble, V., Takeda, M., Suzuki, E. (eds.) DS 2007. LNCS (LNAI), vol. 4755, pp. 264–269. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75488-6_27

    Chapter  Google Scholar 

  22. Rabanser, S., Günnemann, S., Lipton, Z.C.: Failing loudly: an empirical study of methods for detecting dataset shift. In: NeurIPS (2019)

    Google Scholar 

  23. Shoemark, P., Liza, F.F., Nguyen, D., Hale, S.A., McGillivray, B.: Room to Glo: a systematic comparison of semantic change detection approaches with word embeddings. In: EMNLP/IJCNLP, pp. 66–76. Association for Computational Linguistics (2019)

    Google Scholar 

  24. Tsymbal, A.: The problem of concept drift: definitions and related work. Comput. Sci. Dept. Trinity College Dublin 106(2), 58 (2004)

    Google Scholar 

  25. Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden contexts. Mach. Learn. 23(1), 69–101 (1996)

    Google Scholar 

  26. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: EMNLP (Demos), pp. 38–45. ACL (2020)

    Google Scholar 

  27. Yin, D., Xue, Z., Hong, L., Davison, B.D., Kontostathis, A., Edwards, L.: Detection of harassment on web 2.0. In: Proceedings of the Content Analysis in the WEB 2, pp. 1–7 (2009)

    Google Scholar 

  28. Žliobaitė, I., Pechenizkiy, M., Gama, J.: An overview of concept drift applications. Big data analysis: new algorithms for a new society (2016)

    Google Scholar 

Download references

Acknowledgments

This work has been supported by the German Federal Ministry of Education and Research (BMBF) within the project EML4U under the grant no 01IS19080 A and B.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robert Feldhans .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Feldhans, R. et al. (2021). Drift Detection in Text Data with Document Embeddings. In: Yin, H., et al. Intelligent Data Engineering and Automated Learning – IDEAL 2021. IDEAL 2021. Lecture Notes in Computer Science(), vol 13113. Springer, Cham. https://doi.org/10.1007/978-3-030-91608-4_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-91608-4_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-91607-7

  • Online ISBN: 978-3-030-91608-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics