A unifying view on dataset shift in classification
Highlights
► Presentation of a unifying framework for the field of dataset shift, focusing on classification. ► Analysis of the terminology used in the most relevant works of the field. ► Formal definitions for each of the concepts appearing in the study of dataset shift, including sample selection bias.
Introduction
The machine learning community has analyzed data quality in classification problems from different perspectives, including data complexity [29], [7], missing values [19], [21], [39], noise [11], [64], [58], [38], imbalance [52], [27], [53] and, as is the case with this paper, dataset shift [4], [44], [14]. Dataset shift occurs when the testing (unseen) data experience a phenomenon that leads to a change in the distribution of a single feature, a combination of features, or the class boundaries. As a result the common assumption that the training and testing data follow the same distributions is often violated in real-world applications and scenarios.
While the research area of dataset shift has received significant attention in recent years (most of the work is published in the last eight years), the field suffers from a lack of standard terminology. Independent authors working under different conditions use different terms, making it difficult to find and compare proposals and studies in the field.
Contributions. The main goal of this work is to provide a unifying framework through the review and analysis of some of the most important publications in the field, comparing the terminology used in each of them and the exact definitions that were given. We present a framework that can be useful in future research and, at the same time, provide researchers unfamiliar with the topic a brief introduction to it. Our goal with this work is to not only unify different methods and terminologies under a taxonomical structure, but also provide a guide to a researcher as well as a practitioner in machine learning and pattern recognition. We use the notation in [44] as the base for the comparisons. We also present a brief summary of solutions proposed in the literature.
The remainder of this paper is organized as follows: Some basic notation is introduced in Section 2. In Section 3, an analysis of the name given to the field of study is presented. Section 4 details the terminology used for the different types of dataset shift that can appear. Section 5 presents examples demonstrating the effect of these shifts on classifier performance. An analysis of some common causes of dataset shift is presented in Section 6. A brief summary of the solutions proposed in the literature is shown in Section 7. Finally, some conclusions are presented in Section 8.
Section snippets
Notation
In this work, we focus on the analysis of dataset shift in classification problems. A classification problem is defined by:
- •
A set of features or covariates x.
- •
A target variable y (the class variable).
- •
A joint distribution .
When analyzing dataset shift, the relationships between the covariates and the class label are particularly relevant. Fawcett and Flach [20] proposed a taxonomy to classify problems according to an intrinsic property of the data generation process: the causal relationship
Dataset shift
The term “dataset shift” was first used in the book by Quiñonero-Candela et al. [44], the first compilation on the field, where it was defined as “cases where the joint distribution of inputs and outputs differs between training and test stage” [49].
One of the main problems in the field is the lack of visibility most works suffer, since there is not even a standard term to refer to it. So far, each author has chosen a different name to refer to the same basic idea. As an example, the following
Types of dataset shift
In this section, we present an analysis of the different kinds of shift that can appear in a classification problem. Section 4.1 deals with covariate shift, while 4.2 Prior probability shift, 4.3 Concept shift explain prior probability shift and concept shift, respectively. A graphical example is introduced to illustrate each of these cases. The section is closed with Section 4.4, where other potential types of shifts are explained.
Examples of the relevance of dataset shift
The examples presented in 4.1 Covariate shift, 4.2 Prior probability shift were designed to showcase as clearly as possible what covariate and prior probability shift mean. However, they do not show why its study is important: the negative effect dataset shift often has on classifier performance.
This section presents new examples for both covariate shift and prior probability shift, where the said shifts actually produce a change in the Bayes error boundary.
Fig. 4 depicts a case of covariate
Causes of dataset shift
In this section we comment on some of the most common causes of dataset shift. These concepts have created confusion at times, so it is important to remark that these terms are factors that can lead to the appearance of some of the shifts explained in Section 4, but they do not constitute dataset shift themselves.
There are several possible causes for dataset shift, out of which this section mentions the two we deem most important: Sample selection bias and non-stationary environments. In the
Proposals in the literature for the analysis of dataset shift
In this section we give a brief overview of the different proposals that have appeared in the literature to work under the different types of dataset shift.
Covariate shift has been extensively studied in the literature, and a number of proposals to work under it have been published. Some of the most important ones include weighting the log-likelihood function [47], importance-weighted cross-validation [51], asymptotic Bayesian generalization error [59], discriminative learning [9], kernel mean
Concluding remarks
In many practical applications of machine learning, the data available for model-building (training data) are not strictly representative of the data on which the classifier will ultimately be deployed (test data). This problem, which we call dataset shift in accordance with [44] generalizes a wide variety of researches that are scattered throughout the machine learning literature. The purpose of this paper is to survey and unify this research in order to better inform future endeavors in the
Acknowledgments
Jose García Moreno-Torres is currently supported by an FPU Grant from the Ministerio de Educación y Ciencia of the Spanish Government. This work was supported in part by the Spanish Government's KEEL project (TIN2008-06681-C06-01). This work was also supported in part by the National Science Foundation (NSF) Grant ECCS-0926170. Lastly, the work was also partially supported by the Spanish projects DPI2009-08424 and TEC2008-01348/TEC.
Jose G. Moreno-Torres received the M.Sc. degree in Computer Science in 2008 from the University of Granada, Spain. After spending a year as a fellow of an international “la Caixa” scholarship, during which he did research at the IlliGAL laboratory under the supervision of Prof. David E. Goldberg, he is currently a Ph.D. candidate under the supervision of Prof. Francisco Herrera, working with the Soft Computing and Intelligent Information Systems Group in the Department of Computer Science and
References (62)
- et al.
Comparing classifiers when the misallocation costs are uncertain
Pattern Recognition
(1999) - et al.
Does reject inference really improve the performance of application scoring models?
Journal of Banking & Finance
(2004) - et al.
Impact of imputation of missing values on classification error for discrete data
Pattern Recognition
(2008) - et al.
Selection–fusion approach for classification of datasets with missing values
Pattern Recognition
(2010) - et al.
Multiple ellipses detection in noisy environments: a hierarchical approach
Pattern Recognition
(2009) - et al.
A study on the use of imputation methods for experimentation with Radial Basis Function Network classifiers handling missing attribute values: the good synergy between rbfns and eventcovering method
Neural Networks
(2010) Improving predictive inference under covariate shift by weighting the log-likelihood function
Journal of Statistical Planning and Inference
(2000)- et al.
Cost-sensitive boosting for classification of imbalanced data
Pattern Recognition
(2007) - et al.
Conceptual equivalence for contrast mining in classification learning
Data & Knowledge Engineering
(2008) - et al.
Minimax regret classifier for imprecise class distributions
Journal of Machine Learning Research
(2007)
Sample selection bias in credit scoring models
Journal of the Operational Research Society
The security of machine learning
Machine Learning
Data Complexity in Pattern Recognition
Discriminative learning under covariate shift
Journal of Machine Learning Research
Multiple classifier systems for robust classifier design in adversarial environments
International Journal of Machine Learning and Cybernetics
Identifying mislabeled training data
Journal of Artificial Intelligence Research
Learning from labeled and unlabeled data: an empirical study across techniques and domains
Journal of Artificial Intelligence Research
A framework for monitoring classifiers' performance: when and why failure occurs?
Knowledge and Information Systems
Special issue on context sensitivity and concept drift
Machine Learning
A response to Webb and Ting's ‘on the application of ROC analysis to predict classification performance under varying class distributions’
Machine Learning
An adversarial view of covariate shift and a minimax approach
Covariate shift by kernel mean matching
Reject inference in credit operations
in: Credit risk modeling: design and application
Statistical classification methods in consumer credit scoring: a review
Journal of the Royal Statistical Society: Series A
Rejoinder: classifier technology and the illusion of progress
Statistical Science
Learning from imbalanced data
IEEE Transactions on Knowledge and Data Engineering
Cited by (827)
A review of machine learning for modeling air quality: Overlooked but important issues
2024, Atmospheric ResearchNeural network informed photon filtering reduces fluorescence correlation spectroscopy artifacts
2024, Biophysical JournalWhy do probabilistic clinical models fail to transport between sites
2024, npj Digital MedicineSubdomain adaptation via correlation alignment with entropy minimization for unsupervised domain adaptation
2024, Pattern Analysis and Applications
Jose G. Moreno-Torres received the M.Sc. degree in Computer Science in 2008 from the University of Granada, Spain. After spending a year as a fellow of an international “la Caixa” scholarship, during which he did research at the IlliGAL laboratory under the supervision of Prof. David E. Goldberg, he is currently a Ph.D. candidate under the supervision of Prof. Francisco Herrera, working with the Soft Computing and Intelligent Information Systems Group in the Department of Computer Science and Artificial Intelligence at the University of Granada. His current research interests include dataset shift, imbalanced classification, bibliometrics and multi-instance learning.
Troy Raeder is a Ph.D. student in Computer Science at the University of Notre Dame in South Bend, IN, USA. His research interests include scenario analysis in machine learning, evaluation methodologies in machine learning, and robust models for changing data distributions. He received B.S. and M.S. degrees in Computer Science from Notre Dame in 2005 and 2009, respectively.
Rocío Alaiz-Rodríguez received the B.S. degree in Electrical Engineering from the University of Valladolid, Spain, in 1999 and the Ph.D. degree from Carlos III University of Madrid, Spain. She is currently an Associate Professor at the Department of Electrical and Systems Engineering, University of Leon, Spain. Her research interests include learning theory, statistical pattern recognition, neural networks and their applications to image processing and quality assessment (in particular, food and frozen–thawed animal semen).
Nitesh V. Chawla is an Associate Professor in the Department of Computer Science and Engineering at the University of Notre Dame. He directs the Data Inference Analysis and Learning Lab (DIAL) and is also the co-director of the Interdisciplinary Center of the Network Science and Applications (iCenSA) at Notre Dame. His research is supported with research grants from organizations such as the National Science Foundation, the National Institute of Justice, the Army Research Labs, and Industry Sponsors. His research group has received numerous honors, including best papers, outstanding dissertation, and a variety of fellowships. He has also been noted for his teaching accomplishments, receiving the National Academy of Engineers CASEE New Faculty Fellowship, and the Outstanding Undergraduate Teacher Award in 2008 and 2011. He is an Associated Editor for IEEE Transactions of Systems, Man and Cybernetics Part B and Pattern Recognition Letters. More information is available at http://www.nd.edu/∼nchawla.
Francisco Herrera received his M.Sc. degree in Mathematics in 1988 and Ph.D. degree in Mathematics in 1991, both from the University of Granada, Spain. He is currently a Professor in the Department of Computer Science and Artificial Intelligence at the University of Granada. He has had more than 200 papers published in international journals. He is coauthor of the book “Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases” (World Scientific, 2001). He currently acts as Editor in Chief of the international journal “Progress in Artificial Intelligence” (Springer) and serves as Area Editor of the Journal Soft Computing (area of evolutionary and bioinspired algorithms) and International Journal of Computational Intelligence Systems (area of information systems). He acts as Associated Editor of the journals: IEEE Transactions on Fuzzy Systems, Information Sciences, Advances in Fuzzy Systems, and International Journal of Applied Metaheuristics Computing; and he serves as a member of several journal editorial boards, among others: Fuzzy Sets and Systems, Applied Intelligence, Knowledge and Information Systems, Information Fusion, Evolutionary Intelligence, International Journal of Hybrid Intelligent Systems, Memetic Computation, Swarm and Evolutionary Computation. He received the following honors and awards: ECCAI Fellow 2009, 2010 Spanish National Award on Computer Science ARITMEL to the “Spanish Engineer on Computer Science”, and International Cajastur “Mamdani” Prize for Soft Computing (Fourth Edition, 2010). His current research interests include computing with words and decision-making, data mining, bibliometrics, data preparation, instance selection, fuzzy rule-based systems, genetic fuzzy systems, knowledge extraction based on evolutionary algorithms, memetic algorithms and genetic algorithms.