Elsevier

Pattern Recognition

Volume 45, Issue 1, January 2012, Pages 521-530
Pattern Recognition

A unifying view on dataset shift in classification

https://doi.org/10.1016/j.patcog.2011.06.019Get rights and content

Abstract

The field of dataset shift has received a growing amount of interest in the last few years. The fact that most real-world applications have to cope with some form of shift makes its study highly relevant. The literature on the topic is mostly scattered, and different authors use different names to refer to the same concepts, or use the same name for different concepts. With this work, we attempt to present a unifying framework through the review and comparison of some of the most important works in the literature.

Highlights

► Presentation of a unifying framework for the field of dataset shift, focusing on classification. ► Analysis of the terminology used in the most relevant works of the field. ► Formal definitions for each of the concepts appearing in the study of dataset shift, including sample selection bias.

Introduction

The machine learning community has analyzed data quality in classification problems from different perspectives, including data complexity [29], [7], missing values [19], [21], [39], noise [11], [64], [58], [38], imbalance [52], [27], [53] and, as is the case with this paper, dataset shift [4], [44], [14]. Dataset shift occurs when the testing (unseen) data experience a phenomenon that leads to a change in the distribution of a single feature, a combination of features, or the class boundaries. As a result the common assumption that the training and testing data follow the same distributions is often violated in real-world applications and scenarios.

While the research area of dataset shift has received significant attention in recent years (most of the work is published in the last eight years), the field suffers from a lack of standard terminology. Independent authors working under different conditions use different terms, making it difficult to find and compare proposals and studies in the field.

Contributions. The main goal of this work is to provide a unifying framework through the review and analysis of some of the most important publications in the field, comparing the terminology used in each of them and the exact definitions that were given. We present a framework that can be useful in future research and, at the same time, provide researchers unfamiliar with the topic a brief introduction to it. Our goal with this work is to not only unify different methods and terminologies under a taxonomical structure, but also provide a guide to a researcher as well as a practitioner in machine learning and pattern recognition. We use the notation in [44] as the base for the comparisons. We also present a brief summary of solutions proposed in the literature.

The remainder of this paper is organized as follows: Some basic notation is introduced in Section 2. In Section 3, an analysis of the name given to the field of study is presented. Section 4 details the terminology used for the different types of dataset shift that can appear. Section 5 presents examples demonstrating the effect of these shifts on classifier performance. An analysis of some common causes of dataset shift is presented in Section 6. A brief summary of the solutions proposed in the literature is shown in Section 7. Finally, some conclusions are presented in Section 8.

Section snippets

Notation

In this work, we focus on the analysis of dataset shift in classification problems. A classification problem is defined by:

  • A set of features or covariates x.

  • A target variable y (the class variable).

  • A joint distribution P(y,x).

When analyzing dataset shift, the relationships between the covariates and the class label are particularly relevant. Fawcett and Flach [20] proposed a taxonomy to classify problems according to an intrinsic property of the data generation process: the causal relationship

Dataset shift

The term “dataset shift” was first used in the book by Quiñonero-Candela et al. [44], the first compilation on the field, where it was defined as “cases where the joint distribution of inputs and outputs differs between training and test stage” [49].

One of the main problems in the field is the lack of visibility most works suffer, since there is not even a standard term to refer to it. So far, each author has chosen a different name to refer to the same basic idea. As an example, the following

Types of dataset shift

In this section, we present an analysis of the different kinds of shift that can appear in a classification problem. Section 4.1 deals with covariate shift, while 4.2 Prior probability shift, 4.3 Concept shift explain prior probability shift and concept shift, respectively. A graphical example is introduced to illustrate each of these cases. The section is closed with Section 4.4, where other potential types of shifts are explained.

Examples of the relevance of dataset shift

The examples presented in 4.1 Covariate shift, 4.2 Prior probability shift were designed to showcase as clearly as possible what covariate and prior probability shift mean. However, they do not show why its study is important: the negative effect dataset shift often has on classifier performance.

This section presents new examples for both covariate shift and prior probability shift, where the said shifts actually produce a change in the Bayes error boundary.

Fig. 4 depicts a case of covariate

Causes of dataset shift

In this section we comment on some of the most common causes of dataset shift. These concepts have created confusion at times, so it is important to remark that these terms are factors that can lead to the appearance of some of the shifts explained in Section 4, but they do not constitute dataset shift themselves.

There are several possible causes for dataset shift, out of which this section mentions the two we deem most important: Sample selection bias and non-stationary environments. In the

Proposals in the literature for the analysis of dataset shift

In this section we give a brief overview of the different proposals that have appeared in the literature to work under the different types of dataset shift.

Covariate shift has been extensively studied in the literature, and a number of proposals to work under it have been published. Some of the most important ones include weighting the log-likelihood function [47], importance-weighted cross-validation [51], asymptotic Bayesian generalization error [59], discriminative learning [9], kernel mean

Concluding remarks

In many practical applications of machine learning, the data available for model-building (training data) are not strictly representative of the data on which the classifier will ultimately be deployed (test data). This problem, which we call dataset shift in accordance with [44] generalizes a wide variety of researches that are scattered throughout the machine learning literature. The purpose of this paper is to survey and unify this research in order to better inform future endeavors in the

Acknowledgments

Jose García Moreno-Torres is currently supported by an FPU Grant from the Ministerio de Educación y Ciencia of the Spanish Government. This work was supported in part by the Spanish Government's KEEL project (TIN2008-06681-C06-01). This work was also supported in part by the National Science Foundation (NSF) Grant ECCS-0926170. Lastly, the work was also partially supported by the Spanish projects DPI2009-08424 and TEC2008-01348/TEC.

Jose G. Moreno-Torres received the M.Sc. degree in Computer Science in 2008 from the University of Granada, Spain. After spending a year as a fellow of an international “la Caixa” scholarship, during which he did research at the IlliGAL laboratory under the supervision of Prof. David E. Goldberg, he is currently a Ph.D. candidate under the supervision of Prof. Francisco Herrera, working with the Soft Computing and Intelligent Information Systems Group in the Department of Computer Science and

References (62)

  • R. Alaiz-Rodríguez, A. Guerrero-Curieses, J. Cid-Sueiro, Classification under changes in class and within-class...
  • R. Alaiz-Rodríguez, N. Japkowicz, Assessing the impact of changing environments on classifier performance, in:...
  • J. Banasik et al.

    Sample selection bias in credit scoring models

    Journal of the Operational Research Society

    (2003)
  • M. Barreno et al.

    The security of machine learning

    Machine Learning

    (2010)
  • M. Basu et al.

    Data Complexity in Pattern Recognition

    (2006)
  • S. Bickel, M. Brückner, T. Scheffer, Discriminative learning for differing training and test distributions, in:...
  • S. Bickel et al.

    Discriminative learning under covariate shift

    Journal of Machine Learning Research

    (2009)
  • B. Biggio et al.

    Multiple classifier systems for robust classifier design in adversarial environments

    International Journal of Machine Learning and Cybernetics

    (2010)
  • C.E. Brodley et al.

    Identifying mislabeled training data

    Journal of Artificial Intelligence Research

    (1999)
  • N. Chawla et al.

    Learning from labeled and unlabeled data: an empirical study across techniques and domains

    Journal of Artificial Intelligence Research

    (2005)
  • D.A. Cieslak et al.

    A framework for monitoring classifiers' performance: when and why failure occurs?

    Knowledge and Information Systems

    (2009)
  • N. Dalvi, P. Domingos, Mausam, S. Sanghai, D. Verma, Adversarial classification, in: Proceedings of the 10th ACM SIGKDD...
  • T.G. Dietterich et al.

    Special issue on context sensitivity and concept drift

    Machine Learning

    (1998)
  • C. Drummond, R.C. Holte, Explicitly representing expected cost: an alternative to ROC representation, in: Proceedings...
  • T. Fawcett et al.

    A response to Webb and Ting's ‘on the application of ROC analysis to predict classification performance under varying class distributions’

    Machine Learning

    (2005)
  • A. Globerson et al.

    An adversarial view of covariate shift and a minimax approach

  • A. Gretton et al.

    Covariate shift by kernel mean matching

  • D. Hand

    Reject inference in credit operations

    in: Credit risk modeling: design and application

    (1998)
  • D. Hand et al.

    Statistical classification methods in consumer credit scoring: a review

    Journal of the Royal Statistical Society: Series A

    (1997)
  • D.J. Hand

    Rejoinder: classifier technology and the illusion of progress

    Statistical Science

    (2006)
  • H. He et al.

    Learning from imbalanced data

    IEEE Transactions on Knowledge and Data Engineering

    (2009)
  • Cited by (827)

    View all citing articles on Scopus

    Jose G. Moreno-Torres received the M.Sc. degree in Computer Science in 2008 from the University of Granada, Spain. After spending a year as a fellow of an international “la Caixa” scholarship, during which he did research at the IlliGAL laboratory under the supervision of Prof. David E. Goldberg, he is currently a Ph.D. candidate under the supervision of Prof. Francisco Herrera, working with the Soft Computing and Intelligent Information Systems Group in the Department of Computer Science and Artificial Intelligence at the University of Granada. His current research interests include dataset shift, imbalanced classification, bibliometrics and multi-instance learning.

    Troy Raeder is a Ph.D. student in Computer Science at the University of Notre Dame in South Bend, IN, USA. His research interests include scenario analysis in machine learning, evaluation methodologies in machine learning, and robust models for changing data distributions. He received B.S. and M.S. degrees in Computer Science from Notre Dame in 2005 and 2009, respectively.

    Rocío Alaiz-Rodríguez received the B.S. degree in Electrical Engineering from the University of Valladolid, Spain, in 1999 and the Ph.D. degree from Carlos III University of Madrid, Spain. She is currently an Associate Professor at the Department of Electrical and Systems Engineering, University of Leon, Spain. Her research interests include learning theory, statistical pattern recognition, neural networks and their applications to image processing and quality assessment (in particular, food and frozen–thawed animal semen).

    Nitesh V. Chawla is an Associate Professor in the Department of Computer Science and Engineering at the University of Notre Dame. He directs the Data Inference Analysis and Learning Lab (DIAL) and is also the co-director of the Interdisciplinary Center of the Network Science and Applications (iCenSA) at Notre Dame. His research is supported with research grants from organizations such as the National Science Foundation, the National Institute of Justice, the Army Research Labs, and Industry Sponsors. His research group has received numerous honors, including best papers, outstanding dissertation, and a variety of fellowships. He has also been noted for his teaching accomplishments, receiving the National Academy of Engineers CASEE New Faculty Fellowship, and the Outstanding Undergraduate Teacher Award in 2008 and 2011. He is an Associated Editor for IEEE Transactions of Systems, Man and Cybernetics Part B and Pattern Recognition Letters. More information is available at http://www.nd.edu/∼nchawla.

    Francisco Herrera received his M.Sc. degree in Mathematics in 1988 and Ph.D. degree in Mathematics in 1991, both from the University of Granada, Spain. He is currently a Professor in the Department of Computer Science and Artificial Intelligence at the University of Granada. He has had more than 200 papers published in international journals. He is coauthor of the book “Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases” (World Scientific, 2001). He currently acts as Editor in Chief of the international journal “Progress in Artificial Intelligence” (Springer) and serves as Area Editor of the Journal Soft Computing (area of evolutionary and bioinspired algorithms) and International Journal of Computational Intelligence Systems (area of information systems). He acts as Associated Editor of the journals: IEEE Transactions on Fuzzy Systems, Information Sciences, Advances in Fuzzy Systems, and International Journal of Applied Metaheuristics Computing; and he serves as a member of several journal editorial boards, among others: Fuzzy Sets and Systems, Applied Intelligence, Knowledge and Information Systems, Information Fusion, Evolutionary Intelligence, International Journal of Hybrid Intelligent Systems, Memetic Computation, Swarm and Evolutionary Computation. He received the following honors and awards: ECCAI Fellow 2009, 2010 Spanish National Award on Computer Science ARITMEL to the “Spanish Engineer on Computer Science”, and International Cajastur “Mamdani” Prize for Soft Computing (Fourth Edition, 2010). His current research interests include computing with words and decision-making, data mining, bibliometrics, data preparation, instance selection, fuzzy rule-based systems, genetic fuzzy systems, knowledge extraction based on evolutionary algorithms, memetic algorithms and genetic algorithms.

    View full text