Optimal classifiers with minimum expected error within a Bayesian framework — Part II: Properties and performance analysis
Highlights
► Recent work uses a Bayesian modeling framework to optimize and analyze classifier error estimates. ► Here we use the same Bayesian framework to also optimize classifier design. ► This work thus completes a Bayesian theory of classification based on optimizing performance. ► Here, in Part II, we explore invariance to invertible maps, consistency and special cases. ► We also compare to Bayesian robust classifiers and test robustness to false modeling assumptions.
Introduction
In the first part of this two-part study [1], we defined an optimal Bayesian classifier to be a classifier that minimizes the probability of misclassifying a future point relative to the assumed model conditioned on the observed sample, or equivalently minimizes the Bayesian error estimate. The problem of optimal Bayesian classification over an uncertainty class of feature-label distributions arises naturally from two related sources: the need for accurate classification and the need for accurate error estimation. With small samples, the latter is only possible with application of prior knowledge in conjunction with the sample data. Given prior knowledge, it behooves us to find an optimal error estimator and classifier relative to the prior knowledge. Having found optimal Bayesian error estimators in [2], [3], found analytic representation of the MSE of these error estimates in [4], [5], and found expressions for optimal Bayesian classifiers in terms of the effective class-conditional densities in [1], here, in part II we examine basic properties of optimal Bayesian classifiers.
We study invariance to invertible transformations in discrete and continuous models, convergence to the Bayes classifier, and a connection to robust classification. The latter is a classical filtering problem [6], [7], where in the context of classification one wishes to find an optimal classifier over a parameterized uncertainty class of feature-label distributions absent new data [8]. Heretofore, the robust classification problem had only been solved in a suboptimal manner and now the optimal robust classifier falls out from the theory of optimal Bayesian classification. We also explicitly derive optimal Bayesian classifiers using non-informative priors and, using Gaussian modeling assumptions, compare these to plug-in classification rules, such as linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA), which are optimal in fixed Gaussian models with common covariance matrix and different covariance matrices, respectively. Finally, we present several simulations addressing the robustness of optimal Bayesian classifiers to false modeling assumptions. Having some robustness to incorrect modeling assumptions is always important in practice because, even if one utilizes statistical techniques, such as hypothesis tests, for model checking, these can at best, even for very small p values, lead to not rejecting the assumed model.
For the sake of completeness, we begin by stating some key definitions and propositions from Part I [1]. An optimal Bayesian classifier is any classifier, , satisfyingfor all , where is the true error of classifier under a feature-label distribution parameterized by and is an arbitrary family of classifiers. In (1), the expectations are taken relative to a posterior distribution, , on the parameters that is updated from a prior, , after observing a sample, Sn, of size n. An optimal Bayesian classifier minimizes the Bayesian error estimate, . For a binary classification problem, the Bayesian framework defines , where c is the a priori probability that a future point comes from class 0 and and parameterize the class-0 and class-1-conditional distributions, respectively. For a fixed class, , we let be the class-conditional density parameterized by and denote the marginal posterior of by . If , then the optimal Bayesian classifier is a constant and always assigns class 1; if then it always assigns class 0. Hence, we typically assume that . Two important theorems from Part I follow. Theorem 1 Evaluating Bayesian error estimators Let be a fixed classifier given by if and if , where measurable sets R0 and R1 partition the sample space. Thenwhere is an indicator function equal to one if E is true and zero otherwise, andis known as the effective class-conditional density. Theorem 2 Optimal Bayesian classification An optimal Bayesian classifier, , satisfying (1) for all , the set of all classifiers with measurable decision regions, exists and is given pointwise by
Section snippets
Transformations of the feature space
Consider an invertible transformation, , mapping from some original feature space, , to a new space, (in the continuous case we also assume that the inverse map is continuously differentiable). The following theorem shows that the optimal Bayesian classifier in the transformed space can be found by transforming the optimal Bayesian classifier in the original feature space pointwise, and that both classifiers have the same expected true error.
The advantages of this fundamental property
Convergence to the Bayes classifier
A key property of a classification rule is consistency: does the classifier converges to a Bayes classifier as ? In contrast to the Bayesian modeling framework, analysis in this section uses frequentist asymptotics, which concern behavior with respect to a fixed parameter and its sampling distribution. In particular, the next theorem shows that consistency holds for optimal Bayesian classification, as long as the true distribution is contained in the parameterized family with mild conditions
Optimal Bayesian classifiers for Gaussian models with non-informative priors
In this section, we compare optimal Bayesian classifiers using non-informative priors with plug-in classifiers under Gaussian modeling assumptions, including quadratic discriminant analysis (QDA), linear discriminant analysis (LDA) and nearest mean classification (NMC). Our focus is on the close relationships between optimal Bayesian classifiers and their plug-in counterparts in terms of analytic formulation, approximation, and convergence as .
With mean , covariance , and c known (see
Relationship to optimal Bayesian robust classifiers
The optimal Bayesian classifier has robust modeling assumptions in the sense that it is not optimal for a specific assumed feature-label distribution, in which case no data is required and the optimal classifier is the Bayes classifier for the given feature-label distribution; rather, for optimal Bayesian classification the actual feature-label distribution is assumed to belong to an uncertainty class governed by a prior distribution and the optimal Bayesian classifier minimizes the expected
Robustness of optimal Bayesian classifiers to false modeling assumptions
Optimal Bayesian classification is equivalent to Bayesian robust classification with the posterior. In this sense, optimal Bayesian classification is “robust” when operating within the assumed model. We next consider the important issue of robustness to false modeling assumptions, with emphasis on incorrect priors with varying degrees of information.
Conclusion
This work ties Bayesian classifier design and Bayesian error estimation together with the old problem of optimal robust filtering. As with Wiener filtering, we first find representations for some error measure (e.g., expected error or MSE) and then find optimizing parameters. Optimal Bayesian classification has a connection with Bayesian robust classification, with the distinction that it permits optimization over an arbitrary space of classifiers and utilizes a posterior distribution of the
Lori A. Dalton received the B.Sc., M.Sc. and Ph.D. degrees in electrical engineering at Texas A&M University, College Station, in 2001, 2002, and 2012, respectively. She is currently an Assistant Professor of Electrical and Computer Engineering and an Assistant Professor of Biomedical Informatics at The Ohio State University in Columbus, OH. Dr. Dalton was awarded an NSF Graduate Research Fellowship in 2001, and she was awarded the Association of Former Students Distinguished Graduate Student
References (18)
- et al.
Robust wiener filters
Journal of the Franklin Institute
(1977) - et al.
Optimal robust classifiers
Pattern Recognition
(2005) Unbounded mappings and weak convergence of measures
Statistics & Probability Letters
(2008)- et al.
Robust optimal granulometric bandpass filters
Signal Processing
(2001) - et al.
Bayesian robust optimal linear filters
Signal Processing
(2001) - L.A. Dalton, E.R. Dougherty, Optimal classifiers with minimum expected error within a Bayesian framework—part I:...
- et al.
Bayesian minimum mean-square error estimation for classification error—part Idefinition and the Bayesian MMSE error estimator for discrete classification
IEEE Transactions on Signal Processing
(2011) - et al.
Bayesian minimum mean-square error estimation for classification error—Part IIthe Bayesian MMSE error estimator for linear classification of Gaussian distributions
IEEE Transactions on Signal Processing
(2011) - et al.
Exact sample conditioned MSE performance of the Bayesian MMSE estimator for classification error—part Irepresentation
IEEE Transactions on Signal Processing
(2012)
Cited by (44)
Boolean Kalman filter and smoother under model uncertainty
2020, AutomaticaCitation Excerpt :The idea has roots going back to the 1960s in control theory (Martin, 1967; Silver, 1963), but has more recently applied in a fully optimized form with intrinsically Bayesian optimal (IBR) filters, in which optimization is relative to a prior distribution (Dalton & Dougherty, 2014), and with optimal Bayesian filters, in particular, regression, where optimization is relative to a posterior distribution (Qian & Dougherty, 2016). These concepts have also been recently applied to classification in the form of optimal Bayesian classifiers (Dalton & Dougherty, 2013a) and IBR classifiers (Dalton & Dougherty, 2013b). Directly relevant to the developments in the current paper is their application in recursive linear filtering: the IBR Kalman filter (Dehghannasiri, Esfahani, & Dougherty, 2017), the optimal Bayesian Kalman filter, which uses the data to update the prior, thereby producing superior filtering to the IBR Kalman filter (Dehghannasiri, Esfahani, Qian, & Dougherty, 2018), and the optimal Bayesian Kalman smoother (Dehghannasiri & Dougherty, 2018).
Optimal experimental design for materials discovery
2017, Computational Materials ScienceCitation Excerpt :For instance, in genomics there is a large body of knowledge regarding gene/protein signaling pathways. This knowledge can be transformed in such a way as to be useful for constructing biomarkers [20,21] and then incorporated into a Bayesian framework to design optimal classifiers for decisions involving patient diagnosis, prognosis, and therapy [22–25]. How these methods can be utilized to predict properties and guide new experiments is therefore of significant importance and has not been demonstrated in materials science.
Discrete optimal Bayesian classification with error-conditioned sequential sampling
2015, Pattern RecognitionCitation Excerpt :Table 1 summarizes the main variables in this and the next sections. For a complete description of the details used in our framework, the reader is referred to [10,11]. The superior performance of the OBC over that of the histogram rule classifier comes from the fact that the OBC incorporates prior knowledge into the model instead of relying only on the observed data points.
Zonotic diseases detection using ensemble machine learning algorithms
2022, Fundamentals and Methods of Machine and Deep Learning: Algorithms, Tools, and ApplicationsOptimal Bayesian Transfer Learning for Count Data
2021, IEEE/ACM Transactions on Computational Biology and Bioinformatics
Lori A. Dalton received the B.Sc., M.Sc. and Ph.D. degrees in electrical engineering at Texas A&M University, College Station, in 2001, 2002, and 2012, respectively. She is currently an Assistant Professor of Electrical and Computer Engineering and an Assistant Professor of Biomedical Informatics at The Ohio State University in Columbus, OH. Dr. Dalton was awarded an NSF Graduate Research Fellowship in 2001, and she was awarded the Association of Former Students Distinguished Graduate Student Masters Research Award in 2003. Her current research interests include genomic signal processing, pattern recognition, estimation, optimization, robust filtering, information theory and systems biology.
Edward R. Dougherty received the Ph.D. degree in mathematics from Rutgers University, New Brunswick, NJ, and has been awarded the Doctor Honoris Causa by the Tampere University of Technology, Finland. He is a Professor in the Department of Electrical and Computer Engineering, Texas A&M University, College Station, where he holds the Robert M. Kennedy 26 Chair in Electrical Engineering and is Director of the Genomic Signal Processing Laboratory. He is also co-Director of the Computational Biology Division of the Translational Genomics Research Institute, Phoenix, AZ. Dr. Dougherty is a Fellow of SPIE and has received the SPIE Presidents Award.