Exact correlation between actual and estimated errors in discrete classification
Introduction
A full probabilistic understanding of the relationship between an error estimator and the actual error of a sample-designed classifier rests with the joint distribution of the actual and estimated errors relative to the sampling distribution for the underlying feature-label distribution. While knowing the distribution of the error estimator is very important, it alone does not give the complete description of the interaction between the error estimator and the actual error. Of particular importance is the correlation between the actual and estimator errors, which we will label by and , respectively, where is the size of the sample. Since is used in place of in classifier application, ideally we would like and to be perfectly correlated. In fact, as investigated via simulations in (Hanczar et al., 2007), they are often very poorly correlated. The effect of this lack of correlation can be seen by considering the variance of the deviation , which is given bywhere is the correlation coefficient for and . A smaller correlation between error estimator and actual error leads to a larger variance for the deviation and vice versa. The larger the deviation variance, the larger the root-mean-square (RMS) error between and . If the sample is very large, then the variances of and tend to be small, so that the deviation variance is small; however, when the sample is small, these variances tend to be large, so that strong correlation is needed to offset these variances. Thus, the correlation between the actual and estimated errors plays a vital role in assessing the goodness of the error estimator.
In this letter we provide an exact representation for the correlation coefficient between the actual error and the resubstitution and leave-one-out cross-validation error estimators for the discrete histogram rule, also called multinomial discrimination (Devroye et al., 1996), which is important in many practical applications, particularly in medicine, economics, psychology and social science (Goldstein and Dillon, 1978). While other discrete classification rules of practical significance exist (e.g., see Asparoukhov and Krzanowski, 2001, Celeux and Mkhadri, 1992), the discrete histogram rule is simple enough to allow the exact analytical study of its properties, while at the same time being able to illuminate issues related to classification in general. The classical references on classification error for the discrete histogram rule concern only the actual classification error, or the bias of the apparent, or resubstitution, error (Hills, 1966, Hills, 1967, Hughes, 1968, Hughes, 1969, Glick, 1973).
In (Braga-Neto and Dougherty, 2005), the authors found analytical expressions for exact calculation of the bias, variance and RMS of not only resubstitution, but also of the leave-one-out cross-validation error estimator, for the discrete histogram rule. The authors also described in (Braga-Neto and Dougherty, 2005) a complete enumeration method to compute the marginal and joint sampling distributions of resubstitution and leave-one-out cross-validation, with respect to the actual classification error; complete enumeration methods, which have been used extensively for discrete data analysis in statistics (Agresti, 1992, Verbeek, 1985, Klotz, 1966, Hirji et al., 1987), rely on intensive computational power to list all possible configurations of data and their probabilities, and from this to derive exact statistical properties of the methods of interest. Efficient computer algorithms were discussed in (Braga-Neto and Dougherty, 2005), in order to implement the proposed complete enumeration methods. In (Xu et al., 2006), these results were extended to the exact computation of confidence intervals and conditional bias.
In (Braga-Neto and Dougherty, 2005), we did not consider the problem of computing the correlation between the resubstitution or leave-one-out cross-validation errors and the actual classification error. We do this in the present letter, by providing exact expressions for the correlation coefficient, which are faster to compute than by complete enumeration. They are also exact, providing an advantage over Monte–Carlo approximations, which are quite inaccurate for the computation of the correlation coefficient. Not only will we see that the resubstitution and leave-one-out cross-validation error estimators are generally poorly correlated with the actual error, but that it is even possible for leave-one-out cross-validation to display negative correlation when sample sizes are small and classifier complexity is large, exactly the situation in which strong correlation is needed to obtain useful estimates. In general, we will see that the correlation decreases with increasing classifier complexity and that increasing sample size does not produce a corresponding increase in correlation between the actual and estimated errors.
Section snippets
Discrete classification
Let be a set of quantized predictor random variables such that each is quantized into a finite number of values, and let be a target random variable taking values in (for simplicity, we assume ). Since the predictors as a group take on values in a finite space of possible states and a bijection can be established between this finite state-space and the sequence of integers , one can alternatively and equivalently assume, without loss of
Correlation between actual and estimated errors
We provide in this section an exact representation for the correlation coefficients and in terms of the random variables and , for .
It follows from (4), after some algebraic manipulation, that the variance of the actual error can be written aswherein which
Examples
Fig. 1 displays plots of the exact correlation between the actual error and the resubstitution and leave-one-out cross-validation error estimators, obtained with the previous expressions. Correlation is plotted versus sample size, for different bin sizes and probability models of distinct difficulty, as determined by the optimal (Bayes) classification error, from easy (Bayes error = 10%) to difficult (Bayes error = 40%). The bin sizes are selected to correspond to the cases of 2,3,4, and 5 binary
Acknowledgements
This work was supported by the National Science Foundation, through NSF awards CCF-0845407 (Braga-Neto) and CCF-0634794 (Dougherty).
References (19)
- et al.
A comparison of discriminant procedures for binary variables
Comput. Stat. Data Anal.
(2001) - et al.
Exact performance of error estimators for discrete classifiers
Pattern Recognition
(2005) A survey of algorithms for exact distributions of test statistics in rxc contingency tables with fixed margins
Comput. Stat. Data Anal.
(1985)A survey of exact inference for contingency tables
Stat. Sci.
(1992)- et al.
Is cross-validation valid for microarray classification?
Bioinformatics
(2004) - et al.
Discrete regularized discriminant analysis
Stat. Comput.
(1992) - et al.
A Probabilistic Theory of Pattern Recognition
(1996) Sample-based multinomial classification
Biometrics
(1973)- et al.
Discrete Discriminant Analysis
(1978)
Cited by (22)
Soil mapping, classification, and pedologic modeling: History and future directions
2016, GeodermaCitation Excerpt :Some works have used the cross validation method (leave-one-out) to validate their models, observing that the prediction accuracies were poor (Mueller et al., 2004; Hengl et al., 2014). Braga-Neto and Dougherty (2010) observed that leaving out one cross-variance can produce poor correlations between actual and estimated errors when the sample sizes are small, or when the classifier complexity is large. The authors observed that the correlation decreases as the classifier complexity increases.
Discrete optimal Bayesian classification with error-conditioned sequential sampling
2015, Pattern RecognitionMoments and root-mean-square error of the Bayesian MMSE estimator of classification error in the Gaussian model
2014, Pattern RecognitionCitation Excerpt :Recent work has aimed at characterizing joint behavior. For multinomial discrimination, exact representations of the second-order moments, both marginal and mixed, for the true error and the resubstitution and leave-one-out estimators have been obtained [13]. For LDA, the exact joint distributions for both resubstitution and leave-one-out have been found in the univariate Gaussian model and approximations have been found in the multivariate model with a common known covariance matrix [14,15].
The reliability of estimated confidence intervals for classification error rates when only a single sample is available
2013, Pattern RecognitionCitation Excerpt :When samples are small, training and testing must be done on the same set of data and two troublesome issues arise with respect to the goodness of error estimation: first, training-data-based error estimates tend to suffer from inaccuracy unless the sample is large, which is not the case; and, second, quantification of error estimation accuracy is much more difficult. Indeed, only relatively recently have joint distributions of the true and estimated errors and RMS expressions been derived for training-data-based error estimators, these being, to date, for multinomial discrimination [5,6], linear discrimination analysis in a known-covariance Gaussian model [27–30], and the sample-conditioned RMS for the optimal minimum-mean-square-error estimator and classical counting estimators when the feature-label distribution possesses a prior distribution in a Bayesian framework, for both discrete classification and linear classification in the Gaussian model [7]. While there remains a paucity of analytic results, there have been a number of simulation-based studies that show large inaccuracy, and a lack of correlation, when using training-data-based error estimators for many distributional models and classification rules when samples are small [4,13–15,22,17,26].
Optimal mean-square-error calibration of classifier error estimators under Bayesian models
2012, Pattern RecognitionCitation Excerpt :Indeed, here we even see slightly negative regression. This is not an abberation; it has been theoretically shown that negative correlation can occur for a standard model [20]. In Table 3, we observe results that are similar to the synthetic data results.
Exact representation of the second-order moments for resubstitution and leave-one-out error estimation for linear discriminant analysis in the univariate heteroskedastic Gaussian model
2012, Pattern RecognitionCitation Excerpt :More recently, error estimation in high-throughput biological classification, where sample sizes are typically quite small, has motivated the desire for distributional knowledge concerning error estimators, both their full joint distribution and their second-order moments (and therefore the RMS and correlation). For multinomial discrimination, exact representations of the second-order moments, both marginal and mixed, for the true error and the resubstitution and leave-one-out estimators have been found [12,13]. For LDA in the Gaussian model with common known covariance matrix, for both resubstitution and leave-one-out, we have found the marginal distributions for the error estimators [14] and obtained the joint distributions between the true error and both error estimators [15].