Elsevier

Information Sciences

Volume 218, 1 January 2013, Pages 146-164
Information Sciences

Class distribution estimation based on the Hellinger distance

https://doi.org/10.1016/j.ins.2012.05.028Get rights and content

Abstract

Class distribution estimation (quantification) plays an important role in many practical classification problems. Firstly, it is important in order to adapt the classifier to the operational conditions when they differ from those assumed in learning. Additionally, there are some real domains where the quantification task is itself valuable due to the high variability of the class prior probabilities. Our novel quantification approach for two-class problems is based on distributional divergence measures. The mismatch between the test data distribution and validation distributions generated in a fully controlled way is measured by the Hellinger distance in order to estimate the prior probability that minimizes this divergence. Experimental results on several binary classification problems show the benefits of this approach when compared to such approaches as counting the predicted class labels and other methods based on the classifier confusion matrix or on posterior probability estimations. We also illustrate these techniques as well as their robustness against the base classifier performance (a neural network) with a boar semen quality control setting. Empirical results show that the quantification can be conducted with a mean absolute error lower than 0.008, which seems very promising in this field.

Introduction

Supervised learning aims at computing a classifier with good prediction ability on future unseen data. A set of labeled instances is required in order to train the classifier. Once the classifier is designed, it is assumed that it is applied as-is to new data in order to predict the class to which each individual belongs to.

Most work assumes that training and future (test) data follow the same, although unknown, distribution [1]. In particular, class prior probabilities estimated from the training data set are considered to reflect truly the class distribution of the operational environment. However, time or space class stationarity cannot be assumed in many practical fields [2], [3]. For example, if a system for word sense disambiguation is trained using words from a certain domain (i.e. sports news), but it is then used with instances from a different domain (i.e. political news), where the sense priors are different, the classifier will be affected [4]. Remote sensing applications also suffer from this problem since a dataset collected in a given season from a region with different terrains (industrial, hay, wheat, corn, grass, etc.) is usually employed to train the classifier. However, when that classifier is deployed, mismatches in terrain distribution may appear just because of seasonal or location changes [5]. Another illustrative example is direct mail marketing as the target proportion or customer profile may vary from one area to another.

It is well known that a mismatch between the actual class prior probabilities and those for which the classifier has been optimized, leads to suboptimal solutions [1]. Whenever there is such a change, some authors rely on an eventual perfect knowledge of the new conditions by the end user [6], but when this is not possible, estimating this new class proportion is important in adapting the classifier to the new context [7], [8], [9]. Adapting the classifier to the new operating conditions, based on an unlabeled data set, is a problem that has received a lot of attention lately from several perspectives [7], [8], [9], [10], [11], [12], [13], [14], with the ultimate goal of improving the individual classification performance. Some techniques include those described in [8], [11], [15], [16], [17]. Wang et al. [18] proposed a method video annotation (which is formulated as a classification task) when there is a large variation in the training data (i.e. the assumed model may change). This method uses an iterative process to update class densities and posterior probabilities, similar to what Saerens et al. [8] did, which based on Bayes rule.

In other applications where the class proportions are subject to high variability, their estimation is itself valuable, in particular when the classes are imbalanced [19]. For instance, artificial insemination techniques in the veterinarian field should guarantee that semen samples are optimal for fertilization. There is a direct relationship between sperm fertility and the state of the acrosome: a sample containing a high percentage of spermatozoa with a damaged acrosome will not be useful for fertilizing purposes [20]. In this case, the class prior probabilities estimated from the labeled training data cannot be considered representative of future samples since they are subject to variation due to factors like the animal/farm variability, or the manipulation and conservation conditions. Quantifying the proportion of damaged cells is traditionally carried out manually, using stains, which makes this process time-consuming, costly and, what is more important, not objective. In this field, then, the aim is to estimate the proportion of damaged cells with no concerns about the individual classification of each one [21].

To the best of our knowledge, only a few works address directly the problem of estimating the class distribution (also known as quantification) in real domains. Quantification has been applied to such domains as quality control [22], [23], [24], news categorization [25], [26], analysis of technical-support call logs [27] and word text disambiguation [4].

To sum up, estimating the class prior probabilities of an unlabeled dataset plays an important role in supervised learning in order to be able to detect changes in classifier performance due to shifts in class prior probabilities (assuming that class conditional densities are fixed) and in order to adapt the classifier to the new operating conditions whenever it is possible. It also plays an important role in applications where the class distribution shows high variability and its estimation has practical interest.

The quantification techniques proposed in the literature are either based on the classifier confusion matrix [7], [4], [25], [26], [9] or on the posterior probability estimates provided by the classifier [4], [23], [22], [24], [28]. Forman has also explored a method, Mixture Model, based on the estimation of the class conditional probability densities [26], but when it was evaluated on text classification data sets, found it was outperformed by simple methods that rely on the confusion matrix. There is also a preliminary work based on assessing mismatches between data distributions [24].

Our proposal to estimate the class distributions is based on measuring distributional divergences. We focus on problems where the class conditional densities are assumed to be fixed, but class prior probabilities may vary. It is well known that a shift in class prior probabilities between the training and test sets makes the data distributions, as well as the classifier output distribution, change. Basically, our approach assesses the similarity between distributions, comparing the test data distribution with validation data distributions generated in a fully controlled way from the training data set. Finding the class distribution (a simple for-loop can be used in binary problems as in this work) that achieves the maximum similarity, provides the estimated value. A distributional divergence metric, the Hellinger Distance (HD) [29], may be applied at different stages of the classification process: (i) between data distributions themselves for each input feature (referred to as HDx) and (ii) between the classifier output distributions (referred to as HDy). The HDy proposal is similar to the Mixture Model of [30], but we use the HD to measure the goodness of the fit instead of using the PP-Area metric developed in [30] to compare two cumulative distribution functions.

The goal of this paper is: (a) to explore an information theoretic approach to quantify automatically the class distribution of an unlabeled dataset, (b) to compare it with previously proposed approaches (in 15 applications from the UCI Machine Learning repository with a neural based classifier, Naive Bayes and logistic regression) and (c) to evaluate these quantification methods and check whether or not reliable estimates can be achieved for a real specific application of boar semen analysis. Note that, unlike most prior work that focuses on text classification tasks, here we apply our algorithms to a variety of domains collected in the UCI repository and to a real computer vision application.

The rest of this paper is organized as follows: Section 2 briefly describes previous proposed approaches to this problem and Section 3 presents the theoretical approach and algorithms of the estimation method based on the Hellinger distance proposed in this paper. Empirical evaluation methodology is presented in Section 4, the experimental results are shown in Section 5 and finally, Section 6 summarizes the main conclusions.

Section snippets

Quantification: the problem of class distribution estimation

Consider a classification problem with a labeled training data set St = {(xk, dk), k = 1,  , K} where xk is the feature vector of the kth element and dk is its class label, which takes its value in Ω = {d0, d1,  , dM−1}.

Let us consider that all the samples xk  St have been independently recorded according to the class probability density function p(xdi) and the a priori probability of the class di in the training data set St is denoted by Pt(di). Note that hereafter the subscript t will be used for

Quantification based on the Hellinger distance

In this section we present two quantification techniques (HDx and HDy) based on assessing the Hellinger Distance (HD) between the test data distribution and a validation data distribution. The HDx approach works directly with the feature vectors x and therefore it does not require any classification model. The proposal HDy works with the outputs yˆ that a classifier (calibrated with instances from the training set St) generates for the samples x. Note that the data HDx works with, has a

Datasets

In this paper several public real-world datasets have been used to assess the performance of the quantification methods, which have also been tested in a real computer-vision-based quality control application of boar sperm.

Experiments and results

In this section, the quantification techniques based on the Hellinger Distance (HDx and HDy) proposed in this work as well as other previous approaches are evaluated on the 15 datasets presented in Section 4.1.1, and then in the context of a real boar semen quality control application described in Section 4.1.2. Comparisons will be carried out with the baseline approach Classify and Count (CC), the method Adjusted Count (AC) by Forman based on the classifier confusion matrix and the Median

Conclusions and future work

In this work we have addressed the problem of automatically estimating the class distribution of an unlabeled dataset (also known as quantification).

Our proposal is based on the Hellinger distance in order to measure distributional divergences between a new dataset and validation sets with known proportions and find the most similar one (quantification method HDx). Likewise, a classifier can be used and its outputs for the data examples can be used instead (quantification method HDy).

These

Acknowledgments

This work has been partially supported by the research projects DPI2009-08424 and TEC2011-22480 from the Spanish Ministry of Education and Science.

The authors thank CENTROTEC for providing us the semen samples and for their collaboration in the acquisition of the sperm images.

References (44)

  • S. Gu, Y. Tan, X. He, Recentness biased learning for time series forecasting, Information Sciences,...
  • Y.S. Chan, H.T. Ng, Estimating class priors in domain adaptation for word sense disambiguation, in: ACL-44: Proc. of...
  • A. Guerrero-Curieses et al.

    Cost-sensitive and modular land-cover classification based on posterior probability estimates

    International Journal of Remote Sensing

    (2009)
  • C. Drummond et al.

    Cost curves: an improved method for visualizing classifier performance

    Machine Learning

    (2006)
  • S. Vucetic, Z. Obradovic, Classification on data with biased class distribution, in: Proceedings of the 12th European...
  • M. Saerens et al.

    Adjusting a classifier for new a priori probabilities: a simple procedure

    Neural Computation

    (2002)
  • J.C. Xue, G.M. Weiss, Quantification and semi-supervised classification methods for handling changes in class...
  • R. Alaiz-Rodrı´guez, N. Japkowicz, Assessing the impact of changing environments on classifier performance, in:...
  • C. Seiffert, T.M. Khoshgoftaar, J.V. Hulse, A. Folleco, An empirical study of the classification performance of...
  • F. Provost et al.

    Robust classification systems for imprecise environments

    Machine Learning

    (2001)
  • H. He et al.

    Learning from imbalanced data

    IEEE Transactions on Knowledge and Data Engineering

    (2009)
  • R. Yanagimachi
    (1994)
  • Cited by (86)

    View all citing articles on Scopus
    View full text