Class distribution estimation based on the Hellinger distance

doi:10.1016/j.ins.2012.05.028

Information Sciences

Volume 218, 1 January 2013, Pages 146-164

https://doi.org/10.1016/j.ins.2012.05.028 Get rights and content

Abstract

Class distribution estimation (quantification) plays an important role in many practical classification problems. Firstly, it is important in order to adapt the classifier to the operational conditions when they differ from those assumed in learning. Additionally, there are some real domains where the quantification task is itself valuable due to the high variability of the class prior probabilities. Our novel quantification approach for two-class problems is based on distributional divergence measures. The mismatch between the test data distribution and validation distributions generated in a fully controlled way is measured by the Hellinger distance in order to estimate the prior probability that minimizes this divergence. Experimental results on several binary classification problems show the benefits of this approach when compared to such approaches as counting the predicted class labels and other methods based on the classifier confusion matrix or on posterior probability estimations. We also illustrate these techniques as well as their robustness against the base classifier performance (a neural network) with a boar semen quality control setting. Empirical results show that the quantification can be conducted with a mean absolute error lower than 0.008, which seems very promising in this field.

Introduction

Supervised learning aims at computing a classifier with good prediction ability on future unseen data. A set of labeled instances is required in order to train the classifier. Once the classifier is designed, it is assumed that it is applied as-is to new data in order to predict the class to which each individual belongs to.

Most work assumes that training and future (test) data follow the same, although unknown, distribution [1]. In particular, class prior probabilities estimated from the training data set are considered to reflect truly the class distribution of the operational environment. However, time or space class stationarity cannot be assumed in many practical fields [2], [3]. For example, if a system for word sense disambiguation is trained using words from a certain domain (i.e. sports news), but it is then used with instances from a different domain (i.e. political news), where the sense priors are different, the classifier will be affected [4]. Remote sensing applications also suffer from this problem since a dataset collected in a given season from a region with different terrains (industrial, hay, wheat, corn, grass, etc.) is usually employed to train the classifier. However, when that classifier is deployed, mismatches in terrain distribution may appear just because of seasonal or location changes [5]. Another illustrative example is direct mail marketing as the target proportion or customer profile may vary from one area to another.

It is well known that a mismatch between the actual class prior probabilities and those for which the classifier has been optimized, leads to suboptimal solutions [1]. Whenever there is such a change, some authors rely on an eventual perfect knowledge of the new conditions by the end user [6], but when this is not possible, estimating this new class proportion is important in adapting the classifier to the new context [7], [8], [9]. Adapting the classifier to the new operating conditions, based on an unlabeled data set, is a problem that has received a lot of attention lately from several perspectives [7], [8], [9], [10], [11], [12], [13], [14], with the ultimate goal of improving the individual classification performance. Some techniques include those described in [8], [11], [15], [16], [17]. Wang et al. [18] proposed a method video annotation (which is formulated as a classification task) when there is a large variation in the training data (i.e. the assumed model may change). This method uses an iterative process to update class densities and posterior probabilities, similar to what Saerens et al. [8] did, which based on Bayes rule.

In other applications where the class proportions are subject to high variability, their estimation is itself valuable, in particular when the classes are imbalanced [19]. For instance, artificial insemination techniques in the veterinarian field should guarantee that semen samples are optimal for fertilization. There is a direct relationship between sperm fertility and the state of the acrosome: a sample containing a high percentage of spermatozoa with a damaged acrosome will not be useful for fertilizing purposes [20]. In this case, the class prior probabilities estimated from the labeled training data cannot be considered representative of future samples since they are subject to variation due to factors like the animal/farm variability, or the manipulation and conservation conditions. Quantifying the proportion of damaged cells is traditionally carried out manually, using stains, which makes this process time-consuming, costly and, what is more important, not objective. In this field, then, the aim is to estimate the proportion of damaged cells with no concerns about the individual classification of each one [21].

To the best of our knowledge, only a few works address directly the problem of estimating the class distribution (also known as quantification) in real domains. Quantification has been applied to such domains as quality control [22], [23], [24], news categorization [25], [26], analysis of technical-support call logs [27] and word text disambiguation [4].

To sum up, estimating the class prior probabilities of an unlabeled dataset plays an important role in supervised learning in order to be able to detect changes in classifier performance due to shifts in class prior probabilities (assuming that class conditional densities are fixed) and in order to adapt the classifier to the new operating conditions whenever it is possible. It also plays an important role in applications where the class distribution shows high variability and its estimation has practical interest.

The quantification techniques proposed in the literature are either based on the classifier confusion matrix [7], [4], [25], [26], [9] or on the posterior probability estimates provided by the classifier [4], [23], [22], [24], [28]. Forman has also explored a method, Mixture Model, based on the estimation of the class conditional probability densities [26], but when it was evaluated on text classification data sets, found it was outperformed by simple methods that rely on the confusion matrix. There is also a preliminary work based on assessing mismatches between data distributions [24].

Our proposal to estimate the class distributions is based on measuring distributional divergences. We focus on problems where the class conditional densities are assumed to be fixed, but class prior probabilities may vary. It is well known that a shift in class prior probabilities between the training and test sets makes the data distributions, as well as the classifier output distribution, change. Basically, our approach assesses the similarity between distributions, comparing the test data distribution with validation data distributions generated in a fully controlled way from the training data set. Finding the class distribution (a simple for-loop can be used in binary problems as in this work) that achieves the maximum similarity, provides the estimated value. A distributional divergence metric, the Hellinger Distance (HD) [29], may be applied at different stages of the classification process: (i) between data distributions themselves for each input feature (referred to as HDx) and (ii) between the classifier output distributions (referred to as HDy). The HDy proposal is similar to the Mixture Model of [30], but we use the HD to measure the goodness of the fit instead of using the PP-Area metric developed in [30] to compare two cumulative distribution functions.

The goal of this paper is: (a) to explore an information theoretic approach to quantify automatically the class distribution of an unlabeled dataset, (b) to compare it with previously proposed approaches (in 15 applications from the UCI Machine Learning repository with a neural based classifier, Naive Bayes and logistic regression) and (c) to evaluate these quantification methods and check whether or not reliable estimates can be achieved for a real specific application of boar semen analysis. Note that, unlike most prior work that focuses on text classification tasks, here we apply our algorithms to a variety of domains collected in the UCI repository and to a real computer vision application.

The rest of this paper is organized as follows: Section 2 briefly describes previous proposed approaches to this problem and Section 3 presents the theoretical approach and algorithms of the estimation method based on the Hellinger distance proposed in this paper. Empirical evaluation methodology is presented in Section 4, the experimental results are shown in Section 5 and finally, Section 6 summarizes the main conclusions.

Section snippets

Quantification: the problem of class distribution estimation

Consider a classification problem with a labeled training data set S_t = {(x^k, d^k), k = 1, … , K} where x^k is the feature vector of the kth element and d^k is its class label, which takes its value in Ω = {d₀, d₁, … , d_M−1}.

Let us consider that all the samples x^k ∈ S_t have been independently recorded according to the class probability density function p(x∣d_i) and the a priori probability of the class d_i in the training data set S_t is denoted by P_t(d_i). Note that hereafter the subscript t will be used for

Quantification based on the Hellinger distance

In this section we present two quantification techniques (HDx and HDy) based on assessing the Hellinger Distance (HD) between the test data distribution and a validation data distribution. The HDx approach works directly with the feature vectors x and therefore it does not require any classification model. The proposal HDy works with the outputs $\hat{y}$ that a classifier (calibrated with instances from the training set S_t) generates for the samples x. Note that the data HDx works with, has a

Datasets

In this paper several public real-world datasets have been used to assess the performance of the quantification methods, which have also been tested in a real computer-vision-based quality control application of boar sperm.

Experiments and results

In this section, the quantification techniques based on the Hellinger Distance (HDx and HDy) proposed in this work as well as other previous approaches are evaluated on the 15 datasets presented in Section 4.1.1, and then in the context of a real boar semen quality control application described in Section 4.1.2. Comparisons will be carried out with the baseline approach Classify and Count (CC), the method Adjusted Count (AC) by Forman based on the classifier confusion matrix and the Median

Conclusions and future work

In this work we have addressed the problem of automatically estimating the class distribution of an unlabeled dataset (also known as quantification).

Our proposal is based on the Hellinger distance in order to measure distributional divergences between a new dataset and validation sets with known proportions and find the most similar one (quantification method HDx). Likewise, a classifier can be used and its outputs for the data examples can be used instead (quantification method HDy).

These

Acknowledgments

This work has been partially supported by the research projects DPI2009-08424 and TEC2011-22480 from the Spanish Ministry of Education and Science.

The authors thank CENTROTEC for providing us the semen samples and for their collaboration in the acquisition of the sperm images.

References (44)

R. Alaiz-Rodríguez et al.
Class and subclass probability re-estimation to adapt a classifier in the presence of concept drift
Neurocomputing
(2011)
A. Fernández et al.
Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets
International Journal of Approximate Reasoning
(2009)
A. Fernández et al.
On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets
Information Sciences
(2010)
M.-C. Chen et al.
An information granulation based data mining approach for classifying imbalanced data
Information Sciences
(2008)
J. Liu et al.
A weighted rough set based method developed for class imbalance learning
Information Sciences
(2008)
M. Wang et al.
Semi-supervised kernel density estimation for video annotation
Computer Vision and Image Understanding
(2009)
J.G. Moreno-Torres et al.
A unifying view on dataset shift in classification
Pattern Recognition
(2012)
S. Arivazhagan et al.
Texture classification using wavelet transform
Pattern Recognition Letters
(2003)
R.O. Duda et al.
Pattern Classification
(2001)
L.L. Minku et al.
The impact of diversity on online ensemble learning in the presence of concept drift
IEEE Transactions on Knowledge and Data Engineering
(2010)

S. Gu, Y. Tan, X. He, Recentness biased learning for time series forecasting, Information Sciences,...

Y.S. Chan, H.T. Ng, Estimating class priors in domain adaptation for word sense disambiguation, in: ACL-44: Proc. of...

A. Guerrero-Curieses et al.

Cost-sensitive and modular land-cover classification based on posterior probability estimates

International Journal of Remote Sensing

(2009)

C. Drummond et al.

Cost curves: an improved method for visualizing classifier performance

Machine Learning

(2006)

S. Vucetic, Z. Obradovic, Classification on data with biased class distribution, in: Proceedings of the 12th European...

M. Saerens et al.

Adjusting a classifier for new a priori probabilities: a simple procedure

Neural Computation

(2002)

J.C. Xue, G.M. Weiss, Quantification and semi-supervised classification methods for handling changes in class...

R. Alaiz-Rodrı´guez, N. Japkowicz, Assessing the impact of changing environments on classifier performance, in:...

C. Seiffert, T.M. Khoshgoftaar, J.V. Hulse, A. Folleco, An empirical study of the classification performance of...

F. Provost et al.

Robust classification systems for imprecise environments

Machine Learning

(2001)

H. He et al.

Learning from imbalanced data

IEEE Transactions on Knowledge and Data Engineering

(2009)

R. Yanagimachi

(1994)

Cited by (86)

QuantificationLib: A Python library for quantification and prevalence estimation
2024, SoftwareX
QuantificationLib is an open-source Python library that provides a comprehensive set of algorithms for quantification learning. Quantification, also known as prevalence estimation, is a supervised machine-learning task where the objective is to train a model that is able to predict the distribution of classes in a set of unseen examples or bags. This library offers a wide variety of quantification methods suited for easy prototyping and experimentation, applicable to a wide range of quantification applications.
Leveraging relevant summarized information and multi-layer classification to generalize the detection of misleading headlines
2023, Data and Knowledge Engineering
Disinformation is an important problem facing society nowadays. Given the rapid and easy access to information, news stories quickly go viral, the vast majority of which are misleading and with no prospect of verification. Specifically, the headline of a correctly designed news item must correspond to a summary of the main information of that news item and it should be neutral. However, many headlines circulating on the Internet use false or distorted information, seeking to confuse or mislead the reader. Misleading headlines indicate a dissonance between the headline and the content of the news story. From a computational perspective, this problem is being tackled as a Stance Detection problem between the headline and the body text of the news item. This paper contributes to the fight against the spread of misleading information by presenting a generic and flexible multi-level hierarchical classification. The approach is based on two stages that enable the detection of the stance between the news headline and the body text. The proposed architecture, called HeadlineStanceChecker+ uses the headline and only the essential information of the news item (not the full body text) as inputs. To extract this essential information, different summarization approaches (extractive and abstractive) are analyzed in order to determine the most relevant information for the task. The experimentation has been carried out using the Fake News Challenge (FNC-1) dataset. A 94.49% accuracy was obtained using extractive summaries, which were more helpful than abstractive ones. HeadlineStanceChecker+ improves the accuracy results of existing state-of-the-art systems. In conclusion, using automatic extractive summaries together with the two-stage generic architecture is an effective solution to the problem.
QUANTIFICATION USING PERMUTATION-INVARIANT NETWORKS BASED ON HISTOGRAMS
2024, arXiv
Rethinking U-net Skip Connections for Biomedical Image Segmentation
2024, arXiv
Statistical Inference of Normal Distribution Based on Several Divergence Measures: A Comparative Study
2024, Symmetry
Producing Plankton Classifiers that are Robust to Dataset Shift
2024, arXiv

View all citing articles on Scopus

View full text

Class distribution estimation based on the Hellinger distance

Abstract

Introduction

Section snippets

Quantification: the problem of class distribution estimation

Quantification based on the Hellinger distance

Datasets

Experiments and results

Conclusions and future work

Acknowledgments

Neurocomputing

International Journal of Approximate Reasoning

Information Sciences

Information Sciences

Information Sciences

Computer Vision and Image Understanding

Pattern Recognition

Pattern Recognition Letters

Pattern Classification

The impact of diversity on online ensemble learning in the presence of concept drift

IEEE Transactions on Knowledge and Data Engineering

Cost-sensitive and modular land-cover classification based on posterior probability estimates

International Journal of Remote Sensing

Cost curves: an improved method for visualizing classifier performance

Machine Learning

Adjusting a classifier for new a priori probabilities: a simple procedure

Neural Computation

Robust classification systems for imprecise environments

Machine Learning

Learning from imbalanced data

IEEE Transactions on Knowledge and Data Engineering