Skip to main content
Log in

Survey of distance measures for quantifying concept drift and shift in numeric data

  • Survey Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Deployed machine learning systems are necessarily learned from historical data and are often applied to current data. When the world changes, the learned models can lose fidelity. Such changes to the statistical properties of data over time are known as concept drift. Similarly, models are often learned in one context, but need to be applied in another. This is called concept shift. Quantifying the magnitude of drift or shift, especially in the context of covariate drift or shift, or unsupervised learning, requires use of measures of distance between distributions. In this paper, we survey such distance measures with respect to their suitability for estimating drift and shift magnitude between samples of numeric data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Hereafter, for ease of exposition, we refer to both concept drift and concept shift as ”concept drift”.

  2. A sample implementation in R (using R function KernSmooth::bkde, stats::approx and sfsmisc::integrate.xy) can be found in the file ‘Distances.R.

References

  1. Adell JA, Jodrá P (2006) Exact kolmogorov and total variation distances between some familiar discrete distributions. J Inequal Appl 1:1–8

    Article  MathSciNet  MATH  Google Scholar 

  2. Bartlett M (1935) The effect of non-normality on the t distribution. In: Mathematical proceedings of the Cambridge philosophical society, vol 31. Cambridge University Press, pp 223–231

  3. Beirlant J, Devroye L, Györfi L, Vajda I (2001) Large deviations of divergence measures on partitions. J Stat Plan Inference 93(1):1–16

    Article  MathSciNet  MATH  Google Scholar 

  4. Brereton RG (2015) The mahalanobis distance and its relationship to principal component scores. J Chemom 29(3):143–145

    Article  Google Scholar 

  5. Cha SH (2007) Comprehensive survey on distance/similarity measures between probability density functions. City 1(2):1

    MathSciNet  Google Scholar 

  6. Cieslak DA, Hoens TR, Chawla NV, Kegelmeyer WP (2012) Hellinger distance decision trees are robust and skew-insensitive. Data Min Knowl Discov 24(1):136–158. https://doi.org/10.1007/s10618-011-0222-1

    Article  MathSciNet  MATH  Google Scholar 

  7. De Maesschalck R, Jouan-Rimbaud D, Massart DL (2000) The mahalanobis distance. Chemomet Intell Lab Syst 50(1):1–18

    Article  Google Scholar 

  8. Duchi J (2007) Derivations for linear algebra and optimization. California, Berkeley

  9. Gama J, Žliobaite I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv (CSUR) 46(4):44

    Article  MATH  Google Scholar 

  10. Grigelionis B (2013) Student’s T-distribution and related stochastic processes. Springer, Berlin

    Book  MATH  Google Scholar 

  11. Higham NJ (1988) Computing a nearest symmetric positive semidefinite matrix. Linear Algebra Appl 103:103–118

    Article  MathSciNet  MATH  Google Scholar 

  12. Hitchcock FL (1941) The distribution of a product from several sources to numerous localities. Stud Appl Math 20(1–4):224–230

    MathSciNet  MATH  Google Scholar 

  13. Hoens TR, Polikar R, Chawla NV (2012) Learning from streaming data with concept drift and imbalance: an overview. Prog Artif Intell 1(1):89–101

    Article  Google Scholar 

  14. Hotelling H (1931) The generalization of student’s ratio. Ann Math Stat 360–378. https://doi.org/10.1214/aoms/1177732979, http://projecteuclid.org/euclid.aoms/1177732979

  15. Jia R, Koh YS, Dobbie G (2017) Predicting concept drift severity. In: Workshop on learning in the presence of class imbalance and concept drift (LPCICD’17)

  16. Joyce JM (2011) Kullback–Leibler divergence. Springer, Berlin, pp 720–722

    Google Scholar 

  17. Justel A, Peña D, Zamar R (1997) A multivariate Kolmogorov–Smirnov test of goodness of fit. Stat Probab Lett 35(3):251–259

    Article  MathSciNet  MATH  Google Scholar 

  18. Kalpić D, Hlupić N, Lovrić M (2011) Students t-tests. Springer, Berlin, pp 1559–1563

    Google Scholar 

  19. Kennedy J, Quine M (1989) The total variation distance between the binomial and poisson distributions. Ann Probab 17(1):396–400

    Article  MathSciNet  MATH  Google Scholar 

  20. Kosina P, Gama J, Sebastiao R (2010) Drift severity metric. In: ECAI, pp 1119–1120

  21. Lilliefors HW (1967) On the kolmogorov–Smirnov test for normality with mean and variance unknown. J Am Stat Assoc 62(318):399–402

    Article  Google Scholar 

  22. MacKay DJ (2003) Information theory, inference and learning algorithms. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  23. Markowski CA, Markowski EP (1990) Conditions for the effectiveness of a preliminary test of variance. Am Stat 44(4):322–326

    Google Scholar 

  24. Mason RL, Young JC (2002) Multivariate statistical process control with industrial applications. SIAM, University City

    Book  MATH  Google Scholar 

  25. Massey FJ Jr (1951) The Kolmogorov–Smirnov test for goodness of fit. J Am Stat Assoc 46(253):68–78

    Article  MATH  Google Scholar 

  26. McAssey MP (2013) An empirical goodness-of-fit test for multivariate distributions. J Appl Stat 40(5):1120–1131

    Article  MathSciNet  Google Scholar 

  27. Minku LL, White AP, Yao X (2010) The impact of diversity on online ensemble learning in the presence of concept drift. IEEE Trans Knowl Data Eng 22(5):730–742. https://doi.org/10.1109/TKDE.2009.156

    Article  Google Scholar 

  28. Pratt JW, Gibbons JD (1981) Kolmogorov–Smirnov two-sample tests. Springer, New York, pp 318–344. https://doi.org/10.1007/978-1-4612-5931-2_7

    Book  Google Scholar 

  29. Qahtan AA, Alharbi B, Wang S, Zhang X (2015) A pca-based change detection framework for multidimensional data streams: change detection in multidimensional data streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 935–944

  30. Reschenhofer E (1997) Generalization of the Kolmogorov–Smirnov test. Comput Stat Data Anal 24(4):433–441

    Article  MATH  Google Scholar 

  31. Rice J (2006) Mathematical statistics and data analysis. Nelson Education, Scarborough

    Google Scholar 

  32. Rizzo ML, Székely GJ (2016) Energy distance. Wiley Interdiscip Rev Comput Stat 8(1):27–38

    Article  MathSciNet  Google Scholar 

  33. Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover’s distance as a metric for image retrieval. Int J Comput Vis 40(2):99–121

    Article  MATH  Google Scholar 

  34. Ruxton GD (2006) The unequal variance t-test is an underused alternative to student’s t-test and the mannwhitney u test. Behav Ecol 17(4):688–690

    Article  Google Scholar 

  35. Steerneman T (1983) On the total variation and hellinger distance between signed measures; an application to product measures. Proc Am Math Soc 88(4):684–688

    Article  MathSciNet  MATH  Google Scholar 

  36. Szekely GJ (1989) Potential and kinetic energy in statistics. Lecture Notes, Budapest Institute

  37. Wang F, Guibas L (2012) Supervised earth movers distance learning and its computer vision applications. Comput Vis ECCV 2012:442–455

    Google Scholar 

  38. Webb GI, Hyde R, Cao H, Nguyen HL, Petitjean F (2016) Characterizing concept drift. Data Min Knowl Discov 30(4):964–994. https://doi.org/10.1007/s10618-015-0448-4

    Article  MathSciNet  MATH  Google Scholar 

  39. Webb GI, Lee LK, Goethals B, Petitjean F (2018) Analyzing concept drift and shift from sample data. Data Min Knowl Discov 2018:1–21

    MathSciNet  Google Scholar 

  40. Weisstein EW (2007) Metric. From math world—a wolfram web resource. http://mathworld.wolfram.com/Metric.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Igor Goldenberg.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Goldenberg, I., Webb, G.I. Survey of distance measures for quantifying concept drift and shift in numeric data. Knowl Inf Syst 60, 591–615 (2019). https://doi.org/10.1007/s10115-018-1257-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-018-1257-z

Keywords

Navigation