Abstract
Deployed machine learning systems are necessarily learned from historical data and are often applied to current data. When the world changes, the learned models can lose fidelity. Such changes to the statistical properties of data over time are known as concept drift. Similarly, models are often learned in one context, but need to be applied in another. This is called concept shift. Quantifying the magnitude of drift or shift, especially in the context of covariate drift or shift, or unsupervised learning, requires use of measures of distance between distributions. In this paper, we survey such distance measures with respect to their suitability for estimating drift and shift magnitude between samples of numeric data.
Similar content being viewed by others
Notes
Hereafter, for ease of exposition, we refer to both concept drift and concept shift as ”concept drift”.
A sample implementation in R (using R function KernSmooth::bkde, stats::approx and sfsmisc::integrate.xy) can be found in the file ‘Distances.R.
References
Adell JA, Jodrá P (2006) Exact kolmogorov and total variation distances between some familiar discrete distributions. J Inequal Appl 1:1–8
Bartlett M (1935) The effect of non-normality on the t distribution. In: Mathematical proceedings of the Cambridge philosophical society, vol 31. Cambridge University Press, pp 223–231
Beirlant J, Devroye L, Györfi L, Vajda I (2001) Large deviations of divergence measures on partitions. J Stat Plan Inference 93(1):1–16
Brereton RG (2015) The mahalanobis distance and its relationship to principal component scores. J Chemom 29(3):143–145
Cha SH (2007) Comprehensive survey on distance/similarity measures between probability density functions. City 1(2):1
Cieslak DA, Hoens TR, Chawla NV, Kegelmeyer WP (2012) Hellinger distance decision trees are robust and skew-insensitive. Data Min Knowl Discov 24(1):136–158. https://doi.org/10.1007/s10618-011-0222-1
De Maesschalck R, Jouan-Rimbaud D, Massart DL (2000) The mahalanobis distance. Chemomet Intell Lab Syst 50(1):1–18
Duchi J (2007) Derivations for linear algebra and optimization. California, Berkeley
Gama J, Žliobaite I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv (CSUR) 46(4):44
Grigelionis B (2013) Student’s T-distribution and related stochastic processes. Springer, Berlin
Higham NJ (1988) Computing a nearest symmetric positive semidefinite matrix. Linear Algebra Appl 103:103–118
Hitchcock FL (1941) The distribution of a product from several sources to numerous localities. Stud Appl Math 20(1–4):224–230
Hoens TR, Polikar R, Chawla NV (2012) Learning from streaming data with concept drift and imbalance: an overview. Prog Artif Intell 1(1):89–101
Hotelling H (1931) The generalization of student’s ratio. Ann Math Stat 360–378. https://doi.org/10.1214/aoms/1177732979, http://projecteuclid.org/euclid.aoms/1177732979
Jia R, Koh YS, Dobbie G (2017) Predicting concept drift severity. In: Workshop on learning in the presence of class imbalance and concept drift (LPCICD’17)
Joyce JM (2011) Kullback–Leibler divergence. Springer, Berlin, pp 720–722
Justel A, Peña D, Zamar R (1997) A multivariate Kolmogorov–Smirnov test of goodness of fit. Stat Probab Lett 35(3):251–259
Kalpić D, Hlupić N, Lovrić M (2011) Students t-tests. Springer, Berlin, pp 1559–1563
Kennedy J, Quine M (1989) The total variation distance between the binomial and poisson distributions. Ann Probab 17(1):396–400
Kosina P, Gama J, Sebastiao R (2010) Drift severity metric. In: ECAI, pp 1119–1120
Lilliefors HW (1967) On the kolmogorov–Smirnov test for normality with mean and variance unknown. J Am Stat Assoc 62(318):399–402
MacKay DJ (2003) Information theory, inference and learning algorithms. Cambridge University Press, Cambridge
Markowski CA, Markowski EP (1990) Conditions for the effectiveness of a preliminary test of variance. Am Stat 44(4):322–326
Mason RL, Young JC (2002) Multivariate statistical process control with industrial applications. SIAM, University City
Massey FJ Jr (1951) The Kolmogorov–Smirnov test for goodness of fit. J Am Stat Assoc 46(253):68–78
McAssey MP (2013) An empirical goodness-of-fit test for multivariate distributions. J Appl Stat 40(5):1120–1131
Minku LL, White AP, Yao X (2010) The impact of diversity on online ensemble learning in the presence of concept drift. IEEE Trans Knowl Data Eng 22(5):730–742. https://doi.org/10.1109/TKDE.2009.156
Pratt JW, Gibbons JD (1981) Kolmogorov–Smirnov two-sample tests. Springer, New York, pp 318–344. https://doi.org/10.1007/978-1-4612-5931-2_7
Qahtan AA, Alharbi B, Wang S, Zhang X (2015) A pca-based change detection framework for multidimensional data streams: change detection in multidimensional data streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 935–944
Reschenhofer E (1997) Generalization of the Kolmogorov–Smirnov test. Comput Stat Data Anal 24(4):433–441
Rice J (2006) Mathematical statistics and data analysis. Nelson Education, Scarborough
Rizzo ML, Székely GJ (2016) Energy distance. Wiley Interdiscip Rev Comput Stat 8(1):27–38
Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover’s distance as a metric for image retrieval. Int J Comput Vis 40(2):99–121
Ruxton GD (2006) The unequal variance t-test is an underused alternative to student’s t-test and the mannwhitney u test. Behav Ecol 17(4):688–690
Steerneman T (1983) On the total variation and hellinger distance between signed measures; an application to product measures. Proc Am Math Soc 88(4):684–688
Szekely GJ (1989) Potential and kinetic energy in statistics. Lecture Notes, Budapest Institute
Wang F, Guibas L (2012) Supervised earth movers distance learning and its computer vision applications. Comput Vis ECCV 2012:442–455
Webb GI, Hyde R, Cao H, Nguyen HL, Petitjean F (2016) Characterizing concept drift. Data Min Knowl Discov 30(4):964–994. https://doi.org/10.1007/s10618-015-0448-4
Webb GI, Lee LK, Goethals B, Petitjean F (2018) Analyzing concept drift and shift from sample data. Data Min Knowl Discov 2018:1–21
Weisstein EW (2007) Metric. From math world—a wolfram web resource. http://mathworld.wolfram.com/Metric.html
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Goldenberg, I., Webb, G.I. Survey of distance measures for quantifying concept drift and shift in numeric data. Knowl Inf Syst 60, 591–615 (2019). https://doi.org/10.1007/s10115-018-1257-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-018-1257-z