Skip to main content
Log in

Analyzing concept drift and shift from sample data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Concept drift and shift are major issues that greatly affect the accuracy and reliability of many real-world applications of machine learning. We propose a new data mining task, concept drift mapping—the description and analysis of instances of concept drift or shift. We argue that concept drift mapping is an essential prerequisite for tackling concept drift and shift. We propose tools for this purpose, arguing for the importance of quantitative descriptions of drift and shift in marginal distributions. We present quantitative concept drift mapping techniques, along with methods for visualizing their results. We illustrate their effectiveness for real-world applications across energy-pricing, vegetation monitoring and airline scheduling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  • Aggarwal CC (2009) Data streams: an overview and scientific applications. Springer, Berlin, pp 377–397. https://doi.org/10.1007/978-3-642-02788-8_14

    Google Scholar 

  • Baena-Garcıa M, del Campo-Ávila J, Fidalgo R, Bifet A, Gavalda R, Morales-Bueno R (2006) Early drift detection method. In: Fourth international workshop on knowledge discovery from data streams, vol 6, pp 77–86

  • Bifet A, Gama J, Pechenizkiy M, Zliobaite I (2011) Handling concept drift: importance, challenges and solutions. PAKDD-2011 Tutorial, Shenzhen, China

  • Bifet A, Read J, Pfahringer B, Holmes G, Žliobaite I (2013) CD-MOA: change detection framework for massive online analysis. In: International symposium on intelligent data analysis. Springer, Berlin, pp 92–103

  • Brzezinski D, Stefanowski J (2014) Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans Neural Netw Learn Syst 25(1):81–94

    Article  Google Scholar 

  • Ditzler G, Roveri M, Alippi C, Polikar R (2015) Learning in nonstationary environments: a survey. IEEE Comput Intell Mag 10(4):12–25

    Article  Google Scholar 

  • Dries A, Rückert U (2009) Adaptive concept drift detection. Stat Anal Data Min 2(5–6):311–327

    Article  MathSciNet  Google Scholar 

  • Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM SIGMOD Rec 34(2):18–26

    Article  MATH  Google Scholar 

  • Gama J, Žliobaite I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):44:1–44:37. https://doi.org/10.1145/2523813

    Article  MATH  Google Scholar 

  • Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: Brazilian symposium on artificial intelligence. Springer, pp 286–295

  • Gama J, Rodrigues P (2009) An overview on mining data streams, vol 206. Studies in computational intelligence. Springer, Berlin, pp 29–45. https://doi.org/10.1007/978-3-642-01091-0_2

    Google Scholar 

  • Hagolle O, Sylvander S, Huc M, Claverie M, Clesse D, Dechoz C, Lonjou V, Poulain V (2015) Spot-4 (take 5): simulation of sentinel-2 time series on 45 large sites. Remote Sens 7(9):12242–12264. https://doi.org/10.3390/rs70912242

    Article  Google Scholar 

  • Harries M (1999) Splice-2 comparative evaluation: electricity pricing. Technical Report UNSW-CSE-TR-9905, University of New South Wales

  • Hellinger E (1909) Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. Journal für die reine und angewandte Mathematik 136:210–271

    MathSciNet  MATH  Google Scholar 

  • Hoens TR, Chawla NV, Polikar R (2011) Heuristic updatable weighted random subspaces for non-stationary environments. In: Cook DJ, Pei J, Wang W, Zaiane OR, Wu X (eds) IEEE international conference on data mining, ICDM-11. IEEE, pp 241–250

  • Hoens TR, Polikar R, Chawla NV (2012) Learning from streaming data with concept drift and imbalance: an overview. Prog Artif Intell 1(1):89–101. https://doi.org/10.1007/s13748-011-0008-0

    Article  Google Scholar 

  • Inglada J, Vincent A, Arias M, Tardy B, Morin D, Rodes I (2017) Operational high resolution land cover map production at the country scale using satellite image time series. Remote Sens. https://doi.org/10.3390/rs9010095

    Google Scholar 

  • Kifer D, Ben-David S, Gehrke J (2004) Detecting change in data streams. In: Proceedings of the thirtieth international conference on very large data bases—volume 30, VLDB Endowment, VLDB ’04, pp 180–191

  • Krempl G, Zliobaite I, Brzezinski D, Hullermeier E, Last M, Lemaire V, Noack T, Shaker A, Sievi S, Spiliopoulou M, Stefanowski J (2014) Open challenges for data stream mining research. ACM SIGKDD Explor Newsl 16–1:1–10

    Article  Google Scholar 

  • Levin D, Peres Y, Wilmer E (2008) Markov chains and mixing times. American Mathematical Society, Providence

    Book  Google Scholar 

  • MOA dataset repository (2017) http://moa.cms.waikato.ac.nz/datasets/. Accessed 1 Sept 2017

  • Moreno-Torres JG, Raeder T, Alaiz-Rodriguez R, Chawla NV, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recognit 45(1):521–530

    Article  Google Scholar 

  • Nguyen HL, Woon YK, Ng WK (2015) A survey on data stream clustering and classification. Knowl Inf Syst 45:535–569

    Article  Google Scholar 

  • Nishida K, Yamauchi K (2007) Detecting concept drift using statistical testing. In: International conference on discovery science. Springer, pp 264–269

  • Pratt KB, Tschapek G (2003) Visualizing concept drift. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 735–740

  • Qahtan AA, Alharbi B, Wang S, Zhang X (2015) A PCA-based change detection framework for multidimensional data streams: Change detection in multidimensional data streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 935–944

  • Roarty M (1998) Electricity industry restructuring: the state of play. Research Paper 14, Science, Technology, Environment and Resources Group. http://www.aph.gov.au/About_Parliament/Parliamentary_Departments/Parliamentary_Library/pubs/rp/RP9798/98rp14. Accessed 1 Sept 2017

  • Webb GI, Hyde R, Cao H, Nguyen HL, Petitjean F (2016) Characterizing concept drift. Data Min Knowl Discov 30:964–994

    Article  MathSciNet  Google Scholar 

  • Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach Learn 23(1):69–101. https://doi.org/10.1007/BF00116900

    Google Scholar 

  • Yao Y, Feng L, Chen F (2013) Concept drift visualization. J Inf Comput Sci 10(10):3021–3029

    Article  Google Scholar 

  • Yu S, Abraham Z (2017) Concept drift detection with hierarchical hypothesis testing. In: Proceedings of the 2017 SIAM international conference on data mining. SIAM, pp 768–776

  • Žliobaite I (2010) Learning under concept drift: an overview. CoRR arXiv:1010.4784

Download references

Acknowledgements

This work was supported by the Australian Research Council under Awards DP140100087 and DE170100037. This material is based upon work supported by the Air Force Office of Scientific Research, Asian Office of Aerospace Research and Development (AOARD) under Award Number FA2386-17-1-4033. The authors would like to thank the colleagues from CESBIO (Jordi Inglada, Arthur Vincent, Marcela Arias, Benjamin Tardy, David Morin and Isabel Rodes) for providing the Satellite dataset (data and labels).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Geoffrey I. Webb.

Additional information

Responsible editor: Jesse Davis, Elisa Fromont, Derek Greene, and Bjorn Bringmann.

Proof that drift magnitude is monotone under increasing dimensionality

Proof that drift magnitude is monotone under increasing dimensionality

We here prove that total variation distance is monotone under increasing dimensionality. The proof generalizes trivially to Hellinger distance. Note that where one set of variables is conditioned on another, it is the dimensionality of the conditioned variable rather than the conditioning variables over which this monotone increase in distance applies.

Let XZ be sets of covariates.

$$\begin{aligned} \sigma _{t,u}(X)\le & {} \sigma _{t,u}(X,Z) \\&\Updownarrow&\\ \frac{1}{2}\sum _{\begin{array}{c} {\bar{x}\in \mathrm{dom}(X)} \end{array}}\left| P_t(\bar{x})-P_u(\bar{x})\right|\le & {} \frac{1}{2}\sum _{\begin{array}{c} {\bar{x}\in \mathrm{dom}(X)} \\ {\bar{z}\in \mathrm{dom}(Z)} \end{array}}\left| P_t(\bar{x},\bar{z})-P_u(\bar{x},\bar{z})\right| \\&\Updownarrow&\\ \sum _{\begin{array}{c} {\bar{x}\in \mathrm{dom}(X)} \end{array}}\left| \sum _{\begin{array}{c} {\bar{z}\in \mathrm{dom}(Z)} \end{array}} P_t(\bar{x},\bar{z})- \sum _{\begin{array}{c} {\bar{z}\in \mathrm{dom}(Z)} \end{array}} P_u(\bar{x},\bar{z})\right|\le & {} \sum _{\begin{array}{c} {\bar{x}\in \mathrm{dom}(X)} \end{array}}\sum _{\begin{array}{c} {\bar{z}\in \mathrm{dom}(Z)} \end{array}}\left| P_t(\bar{x},\bar{z})-P_u(\bar{x},\bar{z})\right| \\&\Updownarrow&\\ \sum _{\begin{array}{c} {\bar{x}\in \mathrm{dom}(X)} \end{array}}\left| \sum _{\begin{array}{c} {\bar{z}\in \mathrm{dom}(Z)} \end{array}} P_t(\bar{x},\bar{z})-P_u(\bar{x},\bar{z})\right|\le & {} \sum _{\begin{array}{c} {\bar{x}\in \mathrm{dom}(X)} \end{array}}\sum _{\begin{array}{c} {\bar{z}\in \mathrm{dom}(Z)} \end{array}}\left| P_t(\bar{x},\bar{z})-P_u(\bar{x},\bar{z})\right| \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Webb, G.I., Lee, L.K., Goethals, B. et al. Analyzing concept drift and shift from sample data. Data Min Knowl Disc 32, 1179–1199 (2018). https://doi.org/10.1007/s10618-018-0554-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-018-0554-1

Keywords

Navigation