Exact correlation between actual and estimated errors in discrete classification

https://doi.org/10.1016/j.patrec.2009.10.017Get rights and content

Abstract

Discrete classification problems are important in pattern recognition applications. The most often used discrete classification rule is the discrete histogram rule. In this letter we provide exact expressions for the correlation coefficient between the actual error and the resubstitution and leave-one-out cross-validation error estimators for the discrete histogram rule. We show with an example that correlations between actual and estimated errors are generally poor, and that in fact leave-one-out cross-validation can display negative correlation when sample sizes are small and classifier complexity is large. We observe that correlation decreases with increasing classifier complexity and increasing sample size does not necessarily produce an increase in correlation. The exact expressions given here can be computed reasonably fast for given sample size, dimensionality, and model parameters, which is useful because, as also illustrated in this letter, Monte–Carlo approximations of the correlation coefficient are generally poor, even at a large number of simulated data sets.

Introduction

A full probabilistic understanding of the relationship between an error estimator and the actual error of a sample-designed classifier rests with the joint distribution of the actual and estimated errors relative to the sampling distribution for the underlying feature-label distribution. While knowing the distribution of the error estimator is very important, it alone does not give the complete description of the interaction between the error estimator and the actual error. Of particular importance is the correlation between the actual and estimator errors, which we will label by εn and εˆn, respectively, where n is the size of the sample. Since εˆn is used in place of εn in classifier application, ideally we would like εn and εˆn to be perfectly correlated. In fact, as investigated via simulations in (Hanczar et al., 2007), they are often very poorly correlated. The effect of this lack of correlation can be seen by considering the variance of the deviation εˆn-εn, which is given byVar(εˆn-εn)=Var(εˆn)+Var(εn)-2ρ(εˆn,εn)Var(εˆn)Var(εn),where ρ is the correlation coefficient for εn and εˆn. A smaller correlation between error estimator and actual error leads to a larger variance for the deviation and vice versa. The larger the deviation variance, the larger the root-mean-square (RMS) error between εn and εˆn. If the sample is very large, then the variances of εn and εˆn tend to be small, so that the deviation variance is small; however, when the sample is small, these variances tend to be large, so that strong correlation is needed to offset these variances. Thus, the correlation between the actual and estimated errors plays a vital role in assessing the goodness of the error estimator.

In this letter we provide an exact representation for the correlation coefficient between the actual error and the resubstitution and leave-one-out cross-validation error estimators for the discrete histogram rule, also called multinomial discrimination (Devroye et al., 1996), which is important in many practical applications, particularly in medicine, economics, psychology and social science (Goldstein and Dillon, 1978). While other discrete classification rules of practical significance exist (e.g., see Asparoukhov and Krzanowski, 2001, Celeux and Mkhadri, 1992), the discrete histogram rule is simple enough to allow the exact analytical study of its properties, while at the same time being able to illuminate issues related to classification in general. The classical references on classification error for the discrete histogram rule concern only the actual classification error, or the bias of the apparent, or resubstitution, error (Hills, 1966, Hills, 1967, Hughes, 1968, Hughes, 1969, Glick, 1973).

In (Braga-Neto and Dougherty, 2005), the authors found analytical expressions for exact calculation of the bias, variance and RMS of not only resubstitution, but also of the leave-one-out cross-validation error estimator, for the discrete histogram rule. The authors also described in (Braga-Neto and Dougherty, 2005) a complete enumeration method to compute the marginal and joint sampling distributions of resubstitution and leave-one-out cross-validation, with respect to the actual classification error; complete enumeration methods, which have been used extensively for discrete data analysis in statistics (Agresti, 1992, Verbeek, 1985, Klotz, 1966, Hirji et al., 1987), rely on intensive computational power to list all possible configurations of data and their probabilities, and from this to derive exact statistical properties of the methods of interest. Efficient computer algorithms were discussed in (Braga-Neto and Dougherty, 2005), in order to implement the proposed complete enumeration methods. In (Xu et al., 2006), these results were extended to the exact computation of confidence intervals and conditional bias.

In (Braga-Neto and Dougherty, 2005), we did not consider the problem of computing the correlation between the resubstitution or leave-one-out cross-validation errors and the actual classification error. We do this in the present letter, by providing exact expressions for the correlation coefficient, which are faster to compute than by complete enumeration. They are also exact, providing an advantage over Monte–Carlo approximations, which are quite inaccurate for the computation of the correlation coefficient. Not only will we see that the resubstitution and leave-one-out cross-validation error estimators are generally poorly correlated with the actual error, but that it is even possible for leave-one-out cross-validation to display negative correlation when sample sizes are small and classifier complexity is large, exactly the situation in which strong correlation is needed to obtain useful estimates. In general, we will see that the correlation decreases with increasing classifier complexity and that increasing sample size does not produce a corresponding increase in correlation between the actual and estimated errors.

Section snippets

Discrete classification

Let X1,X2,,Xd be a set of quantized predictor random variables such that each Xi is quantized into a finite number bi of values, and let Y be a target random variable taking values in {0,1,,c-1} (for simplicity, we assume c=2). Since the predictors as a group take on values in a finite space of b=i=1dbi possible states and a bijection can be established between this finite state-space and the sequence of integers 1,,b, one can alternatively and equivalently assume, without loss of

Correlation between actual and estimated errors

We provide in this section an exact representation for the correlation coefficients ρ(εn,εˆnr) and ρ(εn,εˆnl) in terms of the random variables Ui and Vi, for i=1,,b.

It follows from (4), after some algebraic manipulation, that the variance of the actual error can be written asVar(εn)=i=1b(c1qi-c0pi)2Var(IUi<Vi)+2i<j(c1qi-c0pi)(c1qj-c0pj)Cov(IUi<Vi,IUj<Vj),whereVar(IUi<Vi)=P(Ui<Vi)[1-P(Ui<Vi)]Cov(IUi<Vi,IUj<Vj)=P(Ui<Vi,Uj<Vj)-P(Ui<Vi)P(Uj<Vj),in whichP(Ui<Vi)=k<lP(Ui=k,Vi=l)P(Ui<Vi,Uj<Vj)=k<l

Examples

Fig. 1 displays plots of the exact correlation between the actual error and the resubstitution and leave-one-out cross-validation error estimators, obtained with the previous expressions. Correlation is plotted versus sample size, for different bin sizes and probability models of distinct difficulty, as determined by the optimal (Bayes) classification error, from easy (Bayes error = 10%) to difficult (Bayes error = 40%). The bin sizes are selected to correspond to the cases of 2,3,4, and 5 binary

Acknowledgements

This work was supported by the National Science Foundation, through NSF awards CCF-0845407 (Braga-Neto) and CCF-0634794 (Dougherty).

References (19)

There are more references available in the full text version of this article.

Cited by (22)

  • Soil mapping, classification, and pedologic modeling: History and future directions

    2016, Geoderma
    Citation Excerpt :

    Some works have used the cross validation method (leave-one-out) to validate their models, observing that the prediction accuracies were poor (Mueller et al., 2004; Hengl et al., 2014). Braga-Neto and Dougherty (2010) observed that leaving out one cross-variance can produce poor correlations between actual and estimated errors when the sample sizes are small, or when the classifier complexity is large. The authors observed that the correlation decreases as the classifier complexity increases.

  • Moments and root-mean-square error of the Bayesian MMSE estimator of classification error in the Gaussian model

    2014, Pattern Recognition
    Citation Excerpt :

    Recent work has aimed at characterizing joint behavior. For multinomial discrimination, exact representations of the second-order moments, both marginal and mixed, for the true error and the resubstitution and leave-one-out estimators have been obtained [13]. For LDA, the exact joint distributions for both resubstitution and leave-one-out have been found in the univariate Gaussian model and approximations have been found in the multivariate model with a common known covariance matrix [14,15].

  • The reliability of estimated confidence intervals for classification error rates when only a single sample is available

    2013, Pattern Recognition
    Citation Excerpt :

    When samples are small, training and testing must be done on the same set of data and two troublesome issues arise with respect to the goodness of error estimation: first, training-data-based error estimates tend to suffer from inaccuracy unless the sample is large, which is not the case; and, second, quantification of error estimation accuracy is much more difficult. Indeed, only relatively recently have joint distributions of the true and estimated errors and RMS expressions been derived for training-data-based error estimators, these being, to date, for multinomial discrimination [5,6], linear discrimination analysis in a known-covariance Gaussian model [27–30], and the sample-conditioned RMS for the optimal minimum-mean-square-error estimator and classical counting estimators when the feature-label distribution possesses a prior distribution in a Bayesian framework, for both discrete classification and linear classification in the Gaussian model [7]. While there remains a paucity of analytic results, there have been a number of simulation-based studies that show large inaccuracy, and a lack of correlation, when using training-data-based error estimators for many distributional models and classification rules when samples are small [4,13–15,22,17,26].

  • Optimal mean-square-error calibration of classifier error estimators under Bayesian models

    2012, Pattern Recognition
    Citation Excerpt :

    Indeed, here we even see slightly negative regression. This is not an abberation; it has been theoretically shown that negative correlation can occur for a standard model [20]. In Table 3, we observe results that are similar to the synthetic data results.

  • Exact representation of the second-order moments for resubstitution and leave-one-out error estimation for linear discriminant analysis in the univariate heteroskedastic Gaussian model

    2012, Pattern Recognition
    Citation Excerpt :

    More recently, error estimation in high-throughput biological classification, where sample sizes are typically quite small, has motivated the desire for distributional knowledge concerning error estimators, both their full joint distribution and their second-order moments (and therefore the RMS and correlation). For multinomial discrimination, exact representations of the second-order moments, both marginal and mixed, for the true error and the resubstitution and leave-one-out estimators have been found [12,13]. For LDA in the Gaussian model with common known covariance matrix, for both resubstitution and leave-one-out, we have found the marginal distributions for the error estimators [14] and obtained the joint distributions between the true error and both error estimators [15].

View all citing articles on Scopus
View full text