Elsevier

Pattern Recognition

Volume 46, Issue 5, May 2013, Pages 1288-1300
Pattern Recognition

Optimal classifiers with minimum expected error within a Bayesian framework — Part II: Properties and performance analysis

https://doi.org/10.1016/j.patcog.2012.10.019Get rights and content

Abstract

In part I of this two-part study, we introduced a new optimal Bayesian classification methodology that utilizes the same modeling framework proposed in Bayesian minimum-mean-square error (MMSE) error estimation. Optimal Bayesian classification thus completes a Bayesian theory of classification, where both the classifier error and our estimate of the error may be simultaneously optimized and studied probabilistically within the assumed model. Having developed optimal Bayesian classifiers in discrete and Gaussian models in part I, here we explore properties of optimal Bayesian classifiers, in particular, invariance to invertible transformations, convergence to the Bayes classifier, and a connection to Bayesian robust classifiers. We also explicitly derive optimal Bayesian classifiers with non-informative priors, and explore relationships to linear and quadratic discriminant analysis (LDA and QDA), which may be viewed as plug-in rules under Gaussian modeling assumptions. Finally, we present several simulations addressing the robustness of optimal Bayesian classifiers to false modeling assumptions. Companion website: http://gsp.tamu.edu/Publications/supplementary/dalton12a.

Highlights

► Recent work uses a Bayesian modeling framework to optimize and analyze classifier error estimates. ► Here we use the same Bayesian framework to also optimize classifier design. ► This work thus completes a Bayesian theory of classification based on optimizing performance. ► Here, in Part II, we explore invariance to invertible maps, consistency and special cases. ► We also compare to Bayesian robust classifiers and test robustness to false modeling assumptions.

Introduction

In the first part of this two-part study [1], we defined an optimal Bayesian classifier to be a classifier that minimizes the probability of misclassifying a future point relative to the assumed model conditioned on the observed sample, or equivalently minimizes the Bayesian error estimate. The problem of optimal Bayesian classification over an uncertainty class of feature-label distributions arises naturally from two related sources: the need for accurate classification and the need for accurate error estimation. With small samples, the latter is only possible with application of prior knowledge in conjunction with the sample data. Given prior knowledge, it behooves us to find an optimal error estimator and classifier relative to the prior knowledge. Having found optimal Bayesian error estimators in [2], [3], found analytic representation of the MSE of these error estimates in [4], [5], and found expressions for optimal Bayesian classifiers in terms of the effective class-conditional densities in [1], here, in part II we examine basic properties of optimal Bayesian classifiers.

We study invariance to invertible transformations in discrete and continuous models, convergence to the Bayes classifier, and a connection to robust classification. The latter is a classical filtering problem [6], [7], where in the context of classification one wishes to find an optimal classifier over a parameterized uncertainty class of feature-label distributions absent new data [8]. Heretofore, the robust classification problem had only been solved in a suboptimal manner and now the optimal robust classifier falls out from the theory of optimal Bayesian classification. We also explicitly derive optimal Bayesian classifiers using non-informative priors and, using Gaussian modeling assumptions, compare these to plug-in classification rules, such as linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA), which are optimal in fixed Gaussian models with common covariance matrix and different covariance matrices, respectively. Finally, we present several simulations addressing the robustness of optimal Bayesian classifiers to false modeling assumptions. Having some robustness to incorrect modeling assumptions is always important in practice because, even if one utilizes statistical techniques, such as hypothesis tests, for model checking, these can at best, even for very small p values, lead to not rejecting the assumed model.

For the sake of completeness, we begin by stating some key definitions and propositions from Part I [1]. An optimal Bayesian classifier is any classifier, ψOBC, satisfyingEπ[ε(θ,ψOBC)]Eπ[ε(θ,ψ)],for all ψC, where ε(θ,ψ) is the true error of classifier ψ under a feature-label distribution parameterized by θΘ and C is an arbitrary family of classifiers. In (1), the expectations are taken relative to a posterior distribution, π(θ), on the parameters that is updated from a prior, π(θ), after observing a sample, Sn, of size n. An optimal Bayesian classifier minimizes the Bayesian error estimate, ε^(ψ,Sn)=Eπ[ε(θ,ψ)]. For a binary classification problem, the Bayesian framework defines θ=[c,θ0,θ1], where c is the a priori probability that a future point comes from class 0 and θ0 and θ1 parameterize the class-0 and class-1-conditional distributions, respectively. For a fixed class, y{0,1}, we let fθy(x|y) be the class-conditional density parameterized by θy and denote the marginal posterior of θy by π(θy). If Eπ[c]=0, then the optimal Bayesian classifier is a constant and always assigns class 1; if Eπ[c]=1 then it always assigns class 0. Hence, we typically assume that 0<Eπ[c]<1. Two important theorems from Part I follow.

Theorem 1 Evaluating Bayesian error estimators

Let ψ be a fixed classifier given by ψ(x)=0 if xR0 and ψ(x)=1 if xR1, where measurable sets R0 and R1 partition the sample space. Thenε^(ψ,Sn)=Eπ[c]R1f(x|0)dx+(1Eπ[c])R0f(x|1)dx,where IE is an indicator function equal to one if E is true and zero otherwise, andf(x|y)=Θyfθy(x|y)π(θy)dθy,is known as the effective class-conditional density.

Theorem 2 Optimal Bayesian classification

An optimal Bayesian classifier, ψOBC, satisfying (1) for all ψC, the set of all classifiers with measurable decision regions, exists and is given pointwise byψOBC(x)=0ifEπ[c]f(x|0)(1Eπ[c])f(x|1),1otherwise.

Section snippets

Transformations of the feature space

Consider an invertible transformation, t:XX¯, mapping from some original feature space, X, to a new space, X¯ (in the continuous case we also assume that the inverse map is continuously differentiable). The following theorem shows that the optimal Bayesian classifier in the transformed space can be found by transforming the optimal Bayesian classifier in the original feature space pointwise, and that both classifiers have the same expected true error.

The advantages of this fundamental property

Convergence to the Bayes classifier

A key property of a classification rule is consistency: does the classifier converges to a Bayes classifier as n? In contrast to the Bayesian modeling framework, analysis in this section uses frequentist asymptotics, which concern behavior with respect to a fixed parameter and its sampling distribution. In particular, the next theorem shows that consistency holds for optimal Bayesian classification, as long as the true distribution is contained in the parameterized family with mild conditions

Optimal Bayesian classifiers for Gaussian models with non-informative priors

In this section, we compare optimal Bayesian classifiers using non-informative priors with plug-in classifiers under Gaussian modeling assumptions, including quadratic discriminant analysis (QDA), linear discriminant analysis (LDA) and nearest mean classification (NMC). Our focus is on the close relationships between optimal Bayesian classifiers and their plug-in counterparts in terms of analytic formulation, approximation, and convergence as n.

With mean μy, covariance Σy, and c known (see

Relationship to optimal Bayesian robust classifiers

The optimal Bayesian classifier has robust modeling assumptions in the sense that it is not optimal for a specific assumed feature-label distribution, in which case no data is required and the optimal classifier is the Bayes classifier for the given feature-label distribution; rather, for optimal Bayesian classification the actual feature-label distribution is assumed to belong to an uncertainty class governed by a prior distribution and the optimal Bayesian classifier minimizes the expected

Robustness of optimal Bayesian classifiers to false modeling assumptions

Optimal Bayesian classification is equivalent to Bayesian robust classification with the posterior. In this sense, optimal Bayesian classification is “robust” when operating within the assumed model. We next consider the important issue of robustness to false modeling assumptions, with emphasis on incorrect priors with varying degrees of information.

Conclusion

This work ties Bayesian classifier design and Bayesian error estimation together with the old problem of optimal robust filtering. As with Wiener filtering, we first find representations for some error measure (e.g., expected error or MSE) and then find optimizing parameters. Optimal Bayesian classification has a connection with Bayesian robust classification, with the distinction that it permits optimization over an arbitrary space of classifiers and utilizes a posterior distribution of the

Lori A. Dalton received the B.Sc., M.Sc. and Ph.D. degrees in electrical engineering at Texas A&M University, College Station, in 2001, 2002, and 2012, respectively. She is currently an Assistant Professor of Electrical and Computer Engineering and an Assistant Professor of Biomedical Informatics at The Ohio State University in Columbus, OH. Dr. Dalton was awarded an NSF Graduate Research Fellowship in 2001, and she was awarded the Association of Former Students Distinguished Graduate Student

References (18)

  • S.A. Kassam et al.

    Robust wiener filters

    Journal of the Franklin Institute

    (1977)
  • E.R. Dougherty et al.

    Optimal robust classifiers

    Pattern Recognition

    (2005)
  • A.M. Zapała

    Unbounded mappings and weak convergence of measures

    Statistics & Probability Letters

    (2008)
  • E.R. Dougherty et al.

    Robust optimal granulometric bandpass filters

    Signal Processing

    (2001)
  • A.M. Grigoryan et al.

    Bayesian robust optimal linear filters

    Signal Processing

    (2001)
  • L.A. Dalton, E.R. Dougherty, Optimal classifiers with minimum expected error within a Bayesian framework—part I:...
  • L.A. Dalton et al.

    Bayesian minimum mean-square error estimation for classification error—part Idefinition and the Bayesian MMSE error estimator for discrete classification

    IEEE Transactions on Signal Processing

    (2011)
  • L.A. Dalton et al.

    Bayesian minimum mean-square error estimation for classification error—Part IIthe Bayesian MMSE error estimator for linear classification of Gaussian distributions

    IEEE Transactions on Signal Processing

    (2011)
  • L.A. Dalton et al.

    Exact sample conditioned MSE performance of the Bayesian MMSE estimator for classification error—part Irepresentation

    IEEE Transactions on Signal Processing

    (2012)
There are more references available in the full text version of this article.

Cited by (44)

  • Boolean Kalman filter and smoother under model uncertainty

    2020, Automatica
    Citation Excerpt :

    The idea has roots going back to the 1960s in control theory (Martin, 1967; Silver, 1963), but has more recently applied in a fully optimized form with intrinsically Bayesian optimal (IBR) filters, in which optimization is relative to a prior distribution (Dalton & Dougherty, 2014), and with optimal Bayesian filters, in particular, regression, where optimization is relative to a posterior distribution (Qian & Dougherty, 2016). These concepts have also been recently applied to classification in the form of optimal Bayesian classifiers (Dalton & Dougherty, 2013a) and IBR classifiers (Dalton & Dougherty, 2013b). Directly relevant to the developments in the current paper is their application in recursive linear filtering: the IBR Kalman filter (Dehghannasiri, Esfahani, & Dougherty, 2017), the optimal Bayesian Kalman filter, which uses the data to update the prior, thereby producing superior filtering to the IBR Kalman filter (Dehghannasiri, Esfahani, Qian, & Dougherty, 2018), and the optimal Bayesian Kalman smoother (Dehghannasiri & Dougherty, 2018).

  • Optimal experimental design for materials discovery

    2017, Computational Materials Science
    Citation Excerpt :

    For instance, in genomics there is a large body of knowledge regarding gene/protein signaling pathways. This knowledge can be transformed in such a way as to be useful for constructing biomarkers [20,21] and then incorporated into a Bayesian framework to design optimal classifiers for decisions involving patient diagnosis, prognosis, and therapy [22–25]. How these methods can be utilized to predict properties and guide new experiments is therefore of significant importance and has not been demonstrated in materials science.

  • Discrete optimal Bayesian classification with error-conditioned sequential sampling

    2015, Pattern Recognition
    Citation Excerpt :

    Table 1 summarizes the main variables in this and the next sections. For a complete description of the details used in our framework, the reader is referred to [10,11]. The superior performance of the OBC over that of the histogram rule classifier comes from the fact that the OBC incorporates prior knowledge into the model instead of relying only on the observed data points.

  • Zonotic diseases detection using ensemble machine learning algorithms

    2022, Fundamentals and Methods of Machine and Deep Learning: Algorithms, Tools, and Applications
  • Optimal Bayesian Transfer Learning for Count Data

    2021, IEEE/ACM Transactions on Computational Biology and Bioinformatics
View all citing articles on Scopus

Lori A. Dalton received the B.Sc., M.Sc. and Ph.D. degrees in electrical engineering at Texas A&M University, College Station, in 2001, 2002, and 2012, respectively. She is currently an Assistant Professor of Electrical and Computer Engineering and an Assistant Professor of Biomedical Informatics at The Ohio State University in Columbus, OH. Dr. Dalton was awarded an NSF Graduate Research Fellowship in 2001, and she was awarded the Association of Former Students Distinguished Graduate Student Masters Research Award in 2003. Her current research interests include genomic signal processing, pattern recognition, estimation, optimization, robust filtering, information theory and systems biology.

Edward R. Dougherty received the Ph.D. degree in mathematics from Rutgers University, New Brunswick, NJ, and has been awarded the Doctor Honoris Causa by the Tampere University of Technology, Finland. He is a Professor in the Department of Electrical and Computer Engineering, Texas A&M University, College Station, where he holds the Robert M. Kennedy 26 Chair in Electrical Engineering and is Director of the Genomic Signal Processing Laboratory. He is also co-Director of the Computational Biology Division of the Translational Genomics Research Institute, Phoenix, AZ. Dr. Dougherty is a Fellow of SPIE and has received the SPIE Presidents Award.

View full text