Survey of distance measures for quantifying concept drift and shift in numeric data

Goldenberg, Igor; Webb, Geoffrey I.

doi:10.1007/s10115-018-1257-z

Survey of distance measures for quantifying concept drift and shift in numeric data

Survey Paper
Published: 08 September 2018

Volume 60, pages 591–615, (2019)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

2074 Accesses
39 Citations
6 Altmetric
Explore all metrics

Abstract

Deployed machine learning systems are necessarily learned from historical data and are often applied to current data. When the world changes, the learned models can lose fidelity. Such changes to the statistical properties of data over time are known as concept drift. Similarly, models are often learned in one context, but need to be applied in another. This is called concept shift. Quantifying the magnitude of drift or shift, especially in the context of covariate drift or shift, or unsupervised learning, requires use of measures of distance between distributions. In this paper, we survey such distance measures with respect to their suitability for estimating drift and shift magnitude between samples of numeric data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Hereafter, for ease of exposition, we refer to both concept drift and concept shift as ”concept drift”.
A sample implementation in R (using R function KernSmooth::bkde, stats::approx and sfsmisc::integrate.xy) can be found in the file ‘Distances.R.

References

Adell JA, Jodrá P (2006) Exact kolmogorov and total variation distances between some familiar discrete distributions. J Inequal Appl 1:1–8
Article MathSciNet MATH Google Scholar
Bartlett M (1935) The effect of non-normality on the t distribution. In: Mathematical proceedings of the Cambridge philosophical society, vol 31. Cambridge University Press, pp 223–231
Beirlant J, Devroye L, Györfi L, Vajda I (2001) Large deviations of divergence measures on partitions. J Stat Plan Inference 93(1):1–16
Article MathSciNet MATH Google Scholar
Brereton RG (2015) The mahalanobis distance and its relationship to principal component scores. J Chemom 29(3):143–145
Article Google Scholar
Cha SH (2007) Comprehensive survey on distance/similarity measures between probability density functions. City 1(2):1
MathSciNet Google Scholar
Cieslak DA, Hoens TR, Chawla NV, Kegelmeyer WP (2012) Hellinger distance decision trees are robust and skew-insensitive. Data Min Knowl Discov 24(1):136–158. https://doi.org/10.1007/s10618-011-0222-1
Article MathSciNet MATH Google Scholar
De Maesschalck R, Jouan-Rimbaud D, Massart DL (2000) The mahalanobis distance. Chemomet Intell Lab Syst 50(1):1–18
Article Google Scholar
Duchi J (2007) Derivations for linear algebra and optimization. California, Berkeley
Gama J, Žliobaite I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv (CSUR) 46(4):44
Article MATH Google Scholar
Grigelionis B (2013) Student’s T-distribution and related stochastic processes. Springer, Berlin
Book MATH Google Scholar
Higham NJ (1988) Computing a nearest symmetric positive semidefinite matrix. Linear Algebra Appl 103:103–118
Article MathSciNet MATH Google Scholar
Hitchcock FL (1941) The distribution of a product from several sources to numerous localities. Stud Appl Math 20(1–4):224–230
MathSciNet MATH Google Scholar
Hoens TR, Polikar R, Chawla NV (2012) Learning from streaming data with concept drift and imbalance: an overview. Prog Artif Intell 1(1):89–101
Article Google Scholar
Hotelling H (1931) The generalization of student’s ratio. Ann Math Stat 360–378. https://doi.org/10.1214/aoms/1177732979, http://projecteuclid.org/euclid.aoms/1177732979
Jia R, Koh YS, Dobbie G (2017) Predicting concept drift severity. In: Workshop on learning in the presence of class imbalance and concept drift (LPCICD’17)
Joyce JM (2011) Kullback–Leibler divergence. Springer, Berlin, pp 720–722
Google Scholar
Justel A, Peña D, Zamar R (1997) A multivariate Kolmogorov–Smirnov test of goodness of fit. Stat Probab Lett 35(3):251–259
Article MathSciNet MATH Google Scholar
Kalpić D, Hlupić N, Lovrić M (2011) Students t-tests. Springer, Berlin, pp 1559–1563
Google Scholar
Kennedy J, Quine M (1989) The total variation distance between the binomial and poisson distributions. Ann Probab 17(1):396–400
Article MathSciNet MATH Google Scholar
Kosina P, Gama J, Sebastiao R (2010) Drift severity metric. In: ECAI, pp 1119–1120
Lilliefors HW (1967) On the kolmogorov–Smirnov test for normality with mean and variance unknown. J Am Stat Assoc 62(318):399–402
Article Google Scholar
MacKay DJ (2003) Information theory, inference and learning algorithms. Cambridge University Press, Cambridge
MATH Google Scholar
Markowski CA, Markowski EP (1990) Conditions for the effectiveness of a preliminary test of variance. Am Stat 44(4):322–326
Google Scholar
Mason RL, Young JC (2002) Multivariate statistical process control with industrial applications. SIAM, University City
Book MATH Google Scholar
Massey FJ Jr (1951) The Kolmogorov–Smirnov test for goodness of fit. J Am Stat Assoc 46(253):68–78
Article MATH Google Scholar
McAssey MP (2013) An empirical goodness-of-fit test for multivariate distributions. J Appl Stat 40(5):1120–1131
Article MathSciNet Google Scholar
Minku LL, White AP, Yao X (2010) The impact of diversity on online ensemble learning in the presence of concept drift. IEEE Trans Knowl Data Eng 22(5):730–742. https://doi.org/10.1109/TKDE.2009.156
Article Google Scholar
Pratt JW, Gibbons JD (1981) Kolmogorov–Smirnov two-sample tests. Springer, New York, pp 318–344. https://doi.org/10.1007/978-1-4612-5931-2_7
Book Google Scholar
Qahtan AA, Alharbi B, Wang S, Zhang X (2015) A pca-based change detection framework for multidimensional data streams: change detection in multidimensional data streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 935–944
Reschenhofer E (1997) Generalization of the Kolmogorov–Smirnov test. Comput Stat Data Anal 24(4):433–441
Article MATH Google Scholar
Rice J (2006) Mathematical statistics and data analysis. Nelson Education, Scarborough
Google Scholar
Rizzo ML, Székely GJ (2016) Energy distance. Wiley Interdiscip Rev Comput Stat 8(1):27–38
Article MathSciNet Google Scholar
Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover’s distance as a metric for image retrieval. Int J Comput Vis 40(2):99–121
Article MATH Google Scholar
Ruxton GD (2006) The unequal variance t-test is an underused alternative to student’s t-test and the mannwhitney u test. Behav Ecol 17(4):688–690
Article Google Scholar
Steerneman T (1983) On the total variation and hellinger distance between signed measures; an application to product measures. Proc Am Math Soc 88(4):684–688
Article MathSciNet MATH Google Scholar
Szekely GJ (1989) Potential and kinetic energy in statistics. Lecture Notes, Budapest Institute
Wang F, Guibas L (2012) Supervised earth movers distance learning and its computer vision applications. Comput Vis ECCV 2012:442–455
Google Scholar
Webb GI, Hyde R, Cao H, Nguyen HL, Petitjean F (2016) Characterizing concept drift. Data Min Knowl Discov 30(4):964–994. https://doi.org/10.1007/s10618-015-0448-4
Article MathSciNet MATH Google Scholar
Webb GI, Lee LK, Goethals B, Petitjean F (2018) Analyzing concept drift and shift from sample data. Data Min Knowl Discov 2018:1–21
MathSciNet Google Scholar
Weisstein EW (2007) Metric. From math world—a wolfram web resource. http://mathworld.wolfram.com/Metric.html

Download references

Author information

Authors and Affiliations

Faculty of Information Technology, Monash University, Clayton, 3800, Australia
Igor Goldenberg & Geoffrey I. Webb

Authors

Igor Goldenberg
View author publications
You can also search for this author in PubMed Google Scholar
Geoffrey I. Webb
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Igor Goldenberg.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Goldenberg, I., Webb, G.I. Survey of distance measures for quantifying concept drift and shift in numeric data. Knowl Inf Syst 60, 591–615 (2019). https://doi.org/10.1007/s10115-018-1257-z

Download citation

Received: 05 July 2017
Accepted: 24 August 2018
Published: 08 September 2018
Issue Date: 01 August 2019
DOI: https://doi.org/10.1007/s10115-018-1257-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Survey of distance measures for quantifying concept drift and shift in numeric data

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Learning from imbalanced data: open challenges and future directions

Uncertainty in big data analytics: survey, opportunities, and challenges

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Survey of distance measures for quantifying concept drift and shift in numeric data

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Learning from imbalanced data: open challenges and future directions

Uncertainty in big data analytics: survey, opportunities, and challenges

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation