Discrete optimal Bayesian classification with error-conditioned sequential sampling

doi:10.1016/j.patcog.2015.03.023

Pattern Recognition

Volume 48, Issue 11, November 2015, Pages 3766-3782

https://doi.org/10.1016/j.patcog.2015.03.023 Get rights and content

Highlights

•
A sampling algorithm for training the optimal Bayesian classifier is introduced.
•
The algorithm works based on minimization of the expected error on the uncertainty class of prior knowledge.
•
Using a Zipf model we show that our sampling algorithm leads to a less true error on average than random sampling.
•
Our algorithm shows robustness even in case when prior knowledge drifts away from true distributions.
•
An example on data from p53 network shows that our method works well on from real pathway data as well.

Abstract

When in possession of prior knowledge concerning the feature-label distribution, in particular, when it is known that the feature-label distribution belongs to an uncertainty class of distributions governed by a prior distribution, this prior knowledge can be used in conjunction with the training data to construct the optimal Bayesian classifier (OBC), whose performance is, on average, optimal among all classifiers relative to the posterior distribution derived from the prior distribution and the data. Typically in classification theory it is assumed that sampling is performed randomly in accordance with the prior probabilities on the classes and this has heretofore been true in the case of OBC. In the present paper we propose to forego random sampling and utilize the prior knowledge and previously collected data to determine which class to sample from at each step of the sampling. Specifically, we choose to sample from the class that leads to the smallest expected classification error with the addition of the new sample point. We demonstrate the superiority of the resulting nonrandom sampling procedure to random sampling on both synthetic data and data generated from known biological pathways.

Introduction

In many classification applications one is limited to small samples. For instance, in medicine, where classification may involve diagnosis, prognosis, or treatment option, data can be limited due to specimen availability, cost, or the time necessary to obtain and process specimens (which is related to cost). In classification theory it is generally assumed that sampling is random, meaning that the training data are independent and identically distributed (i.i.d.); indeed, assumption of random sampling is typically made throughout a text on classification. For instance, Devroye et al. declare on page 2 of their text that all sampling is random [1]. The assumption is so pervasive that it may be applied without being mentioned. Duda et al. state: “In typical supervised pattern classification problems, the estimation of the prior probabilities presents no serious difficulties.” [2]. Implicit in this statement is that the ratio of the number of data points in a class with respect to the total sample size converges to the class probability, as it does in the case of random sampling according to Bernoulli׳s law of large numbers. No doubt, random sampling has advantages, but is it most efficient in classifier design, especially when one is constrained to small samples?

The effects of nonrandom sampling owing to correlation in the training data have been examined as far back as the early 1970s using numerical examples [3] and the issue subsequently has been examined by studying the effects on asymptotic error rates in the context of linear discriminant analysis (LDA) [4], [5], [6]. With small samples, asymptotic results are not really relevant. More recently, nonrandom sampling has been addressed for finite samples by providing representation of the first- and second-order moments for expected errors arising from nonrandom sampling, again in the framework of LDA [7]. In particular, these results demonstrate that nonrandom sampling can be advantageous depending on the correlation structure within the data.

Here we consider a specific scenario for nonrandom sampling. Given a sample, S_n, consisting of n data points, if another data point is to be selected and a classifier designed from the larger sample, $S_{n + 1}$ , would it be better to select the new point in an i.i.d. fashion, which means it could come from either class-conditional distribution, or to predetermine the class from which it is to be chosen based on some class-selection criterion, in which case $S_{n + 1}$ would not be a random sample, even if S_n were a random sample? The answer depends on having a suitable criterion whose application leads to making a beneficial choice as to whether or not to select an i.i.d. data point. By working within the framework of optimal Bayesian classification, we can establish such a criterion and obtain an advantageous nonrandom sampling procedure. In this framework, one has an uncertainty class of possible feature-label distributions and a prior distribution governing the uncertainty class. This allows one to determine the minimum mean-square-error (MMSE) estimate of the error based on the prior distribution and the data [8], [9]. An optimal Bayesian classifier (OBC) possesses minimum expected error across the uncertainty class [10], [11]. Relative to the sampling procedure, the aim is to select the next data point in such a way as to minimize the expected error of the optimal Bayesian classifier, the critical point being that the Bayesian framework facilitates determination of the expected error, which is impossible in the ordinary purely data-driven setting.

This work focuses on discrete classification. Using simulations, both with synthetic and simulated data from real biological pathways, we demonstrate the effectiveness of the proposed nonrandom sampling paradigm relative to random sampling and also examine some of its properties.

Other methods for nonrandom sampling have been proposed that possess conceptual similarities as well as vital differences with the approach proposed herein. These include online learning and active sampling (learning).

In online learning, sequential measurements are made, one at a time, to improve an uncertain model. In particular, the knowledge gradient (KG) algorithm assumes that one of M alternatives can be measured at each time step, each yielding a random reward with an unknown mean and known variance (corresponding to measurement error) [12]. The aim is to make sequential measurements that will maximize the expected total reward to be collected over a time period, thereby treating the problem as a multi-armed bandit process [13]. To achieve this goal, at every time step one tries to identify the optimal KG policy that allows one to choose a measurement (among the M available alternatives) that is expected to bring the largest improvement. The alternative measurements (or rewards) are typically assumed to be independent Gaussian random variables and prior knowledge concerning the measurements and their correlations can be incorporated into the problem via their joint distribution. Our proposed Bayesian framework for nonrandom sampling utilizes a substantially different approach, in that it puts a prior distribution on an uncertainty class of feature-label distributions. Among the key differences resulting from this Bayesian framework is that the distribution of the reward (cost) is not directly modeled; instead, we estimate the expected cost, which is classification error. Moreover, we do not impose restrictions on the variance of our cost/reward in the case of pursuing each policy.

Active sampling has a long history in machine learning, going back to [14], [15]. As discussed in [16], the essence of active sampling algorithms is to control the selection of potential unlabeled training points in the sample space to be labeled and used for further training. A generic active sampling algorithm is described in [17]. While there are conceptual similarities with our work, there are fundamental differences. Our goal is not to search among unlabeled sample points for those for which we wish to generate labels; rather, we generate new sample points from a chosen known label. Moreover, we directly target reduction of classification error. Reducing uncertainty in our class probability distributions is a side effect, not the direct goal. Considering active learning under a Bayesian framework as in [18] does not eliminate the difference because the underlying strategy is to choose sample points to label.

The rest of the paper is organized as follows. In Section 2 the general framework of the discrete classification problem and the optimal Bayesian classifiers is introduced. In Section 3 the proposed sampling algorithm is introduced. Section 4 shows some results of applying the proposed sampling method in the classification problem with synthetic data from a Zipf model. In Section 5 the effect of the proposed method is studied on data generated from pathways. Section 6 concludes the paper.

Throughout this paper, we use bold letters to denote vectors, e.g. $p$ or U. Capital letters are used for random variables; when in bold they denote a random vector. The notation $E_{π (θ)} [\cdot]$ is the expectation with respect to the parameter θ distributed by $π (θ)$ .

Section snippets

The discrete model and optimal Bayesian classifier

The discrete model consists of b bins and two classes, $y \in {0, 1}$ , with ${p_{i}}_{1}^{b}$ and ${q_{i}}_{1}^{b}$ being class-conditional probabilities for $i \in X = {1, \dots, b}$ , and c being the prior probability of class 0, i.e. $P (X = i | y = 0) = p_{i}, P (X = i | y = 1) = q_{i} for i = 1, \dots, b$ , and $c = P (y = 0)$ .

A classifier is a function ψ that maps sample points to a class, $ψ : {1, \dots, b} \to {0, 1}$ . The true classification error ε is the probability that a sample point from class y is classified by ψ as belonging to a different class; $ε = P (ψ (X) \neq y)$ . The error can be

Error-conditioned sequential sampling algorithm

The aim of this paper, which is elaborated in this section, is to improve the performance of the OBC by the means of controlling the sampling procedure, the heuristic being that it would be better to iterate the updating of the posterior distribution by selecting from the class for which the selected point would best improve the performance of the OBC. As is often the case with an improvement in performance, greater knowledge must be assumed at the outset. In this case, since sampling would no

Simulations with synthetic data

This section utilizes a set of experiments to examine the effect of the proposed sampling procedure on the performance of optimal Bayesian classifiers via synthetic Monte Carlo simulations. We consider a discrete model with 2 classes and both 16 and 32 bins. Different values of the class prior probability c are considered. Furthermore, we assume that there is a true class conditional probability vector for each class, namely, vectors $p_{true}$ and $q_{true}$ , from which sample points are drawn. As

Numerical experiments on real pathways

A major area of research in translational genomics involves classification of cell condition based on genetic activity, which in medicine corresponds to diagnosing the presence or type of disease. This requires designing expression-based classifiers based on genes whose product abundances indicate critical differences in cell state. For cancer diagnosis, classification can be between different kinds of cancer, different stages of tumor development, different prognoses, or other such

Conclusion

This study has shown that prior knowledge concerning the classes can be used to select training points for classifier design in a more efficient fashion than random sampling. The method has been described mathematically and its performance studied via Monte Carlo simulations on both synthetic and real-pathway generated data. We have observed that the proposed method shows more improvement as the difference in the amounts of uncertainty regarding the two classes increases and that performance

Conflict of interest

None declared.

Ariana Broumand received the B.Sc. and M.Sc. degrees from the University of Tehran in 2009 and 2012, in electrical and biomedical (bioelectrical) engineering respectively. During his M.S. course he spent 8 months as a visiting researcher at University of Rostock, Germany. He is currently a Ph.D. student at Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX. His current research interests include genomic signal processing, and Bayesian statistics and

References (36)

C.R. Lawoko et al.
Asymptotic error rates of the W and Z statistics when the training observations are dependent
Pattern Recognit.
(1986)
C.R. Lawoko et al.
Discrimination with autocorrelated observations
Pattern Recognit.
(1985)
J.D. Tubbs
Effect of autocorrelated training samples on Bayes׳ probabilities of misclassification
Pattern Recognit.
(1980)
A. Zollanvari et al.
Analytical study of performance of linear discriminant analysis in stochastic settings
Pattern Recognit.
(2013)
L.A. Dalton et al.
Optimal classifiers with minimum expected error within a Bayesian framework. Part I: Discrete and Gaussian models
Pattern Recognit.
(2013)
L.A. Dalton et al.
Optimal classifiers with minimum expected error within a Bayesian framework. Part II: Properties and performance analysis
Pattern Recognit.
(2013)
A. Zollanvari et al.
Moments and root-mean-square error of the Bayesian MMSE estimator of classification error in the gaussian model
Pattern Recognit.
(2014)
U. Braga-Neto et al.
Exact performance of error estimators for discrete classifiers
Pattern Recognit.
(2005)
U.M. Braga-Neto et al.
Exact correlation between actual and estimated errors in discrete classification
Pattern Recognit. Lett.
(2010)
S.A. Kauffman
Metabolic stability and epigenesis in randomly constructed genetic nets
J. Theor. Biol.
(1969)

L. Devroye et al.

(1996)

R.O. Duda et al.

Pattern Classification

(2000)

J. Basu et al.

The effects of intraclass correlation on certain significance tests when sampling from multivariate normal population

Commun. Stat.-Theory Methods

(1974)

L.A. Dalton et al.

Bayesian minimum mean-square error estimation for classification error. Part I: Definition and the Bayesian MMSE error estimator for discrete classification

IEEE Trans. Signal Process.

(2011)

L.A. Dalton et al.

Bayesian minimum mean-square error estimation for classification error. Part II: Linear classification of Gaussian models

IEEE Trans. Signal Process.

(2011)

I.O. Ryzhov et al.

The knowledge gradient algorithm for a general class of online learning problems

Oper. Res.

(2012)

J.C. Gittins

Bandit processes and dynamic allocation indices

J. R. Stat. Soc. Ser. B (Methodological)

(1979)

H.A. Simon, G. Lea, Problem solving and rule induction: a unified view,...

Cited by (33)

Gaussian process classification bandits
2024, Pattern Recognition
Classification bandits are multi-armed bandit problems whose task is to classify a given set of arms into either positive or negative class depending on whether the rate of the arms with the expected reward of at least $h$ is not less than $w$ for given thresholds $h$ and $w$ . We study a special classification bandit problem in which arms correspond to points $x$ in $d$ -dimensional real space with expected rewards $f (x)$ which are generated according to a Gaussian process prior. We develop a framework algorithm for the problem using various arm selection policies and propose policies called FCB (Farthest Confidence Bound) and FTSV (Farthest Thompson Sampling Value). We show a smaller sample complexity upper bound for FCB than that for the existing algorithm of the level set estimation, in which whether $f (x)$ is at least $h$ or not must be decided for every arm’s $x$ . Arm selection policies depending on an estimated rate of arms with mean rewards of at least $h$ are also proposed and shown to improve empirical sample complexity. According to our experimental results, the rate-estimation versions of FCB and FTSV, together with that of the popular active learning policy which selects the point with the maximum variance, outperform other policies for synthetic functions, and the rate-estimation version of FTSV is also the best performer for our real-world dataset.
Knowledge-driven learning, optimization, and experimental design under uncertainty for materials discovery
2023, Patterns
Significant acceleration of the future discovery of novel functional materials requires a fundamental shift from the current materials discovery practice, which is heavily dependent on trial-and-error campaigns and high-throughput screening, to one that builds on knowledge-driven advanced informatics techniques enabled by the latest advances in signal processing and machine learning. In this review, we discuss the major research issues that need to be addressed to expedite this transformation along with the salient challenges involved. We especially focus on Bayesian signal processing and machine learning schemes that are uncertainty aware and physics informed for knowledge-driven learning, robust optimization, and efficient objective-driven experimental design.
Neural message-passing for objective-based uncertainty quantification and optimal experimental design
2023, Engineering Applications of Artificial Intelligence
Various real-world scientific applications involve the mathematical modeling of complex uncertain systems with numerous unknown parameters. Accurate parameter estimation is often practically infeasible in such systems, as the available training data may be insufficient and the cost of acquiring additional data may be high. In such cases, based on a Bayesian paradigm, we can design robust operators retaining the best overall performance across all possible models and design optimal experiments that can effectively reduce uncertainty to enhance the performance of such operators maximally. While objective-based uncertainty quantification (objective-UQ) based on MOCU (mean objective cost of uncertainty) provides an effective means for quantifying uncertainty in complex systems, the high computational cost of estimating MOCU has been a challenge in applying it to real-world scientific/engineering problems. In this work, we propose a novel scheme to reduce the computational cost for objective-UQ via MOCU based on a data-driven approach. We adopt a neural message-passing model for surrogate modeling, incorporating a novel axiomatic constraint loss that penalizes an increase in the estimated system uncertainty. As an illustrative example, we consider the optimal experimental design (OED) problem for uncertain Kuramoto models, where the goal is to predict the experiments that can most effectively enhance robust synchronization performance through uncertainty reduction. We show that our proposed approach can accelerate MOCU-based OED by four to five orders of magnitude, without any visible performance loss compared to the state-of-the-art. The proposed approach applies to general OED tasks, beyond the Kuramoto model.
A proxy learning curve for the Bayes classifier
2023, Pattern Recognition
Citation Excerpt :
Firstly, some possible knowledge about the model parameters could be incorporated into the analysis to obtain theoretical curves that better match those particular models. For example, some parameters could be assumed known, e.g. correlation matrices are diagonals, and/or equal for all classes, or some knowledge about prior probabilities could be considered [38]. Also, other models different from strictly Gaussian ones, e.g. imprecise Gaussian [39], or different from Gaussian mixtures, e.g. Independent Component Analysis mixtures [35] may be assumed, though they will probably be intractable in most cases.
In this paper, a theoretical learning curve is derived for the multi-class Bayes classifier. This curve fits general multivariate parametric models of the class-conditional probability density. The derivation uses a proxy approach based on analyzing the convergence of a statistic which is proportional to the posterior probability of the true class. By doing so, the curve depends only on the training set size and on the dimension of the feature vector; it does not depend on the model parameters. Essentially, the learning curve provides an estimate of the reduction in the excess of the probability of error that can be obtained by increasing the training set size. This makes it attractive in order to deal with the practical problems of defining appropriate training set sizes.
Robust importance sampling for error estimation in the context of optimal Bayesian transfer learning
2022, Patterns
Citation Excerpt :
A natural question is how one can maximize the “return-on-investment” for data acquisition given the available budget. Such strategies for optimal experimental design19–24 and active learning25–27 have been actively studied in a Bayesian paradigm that enables objective-based uncertainty quantification via mean objective cost of uncertainty.28,29 While this is beyond the scope of this current study, it opens up interesting directions for future research.
Classification has been a major task for building intelligent systems because it enables decision-making under uncertainty. Classifier design aims at building models from training data for representing feature-label distributions—either explicitly or implicitly. In many scientific or clinical settings, training data are typically limited, which impedes the design and evaluation of accurate classifiers. Atlhough transfer learning can improve the learning in target domains by incorporating data from relevant source domains, it has received little attention for performance assessment, notably in error estimation. Here, we investigate knowledge transferability in the context of classification error estimation within a Bayesian paradigm. We introduce a class of Bayesian minimum mean-square error estimators for optimal Bayesian transfer learning, which enables rigorous evaluation of classification error under uncertainty in small-sample settings. Using Monte Carlo importance sampling, we illustrate the outstanding performance of the proposed estimator for a broad family of classifiers that span diverse learning capabilities.
Sliding window correlation analysis: Modulating window shape for dynamic brain connectivity in resting state
2019, NeuroImage
Citation Excerpt :
Additionally, the brain's response to changing internal and external stimuli requires dynamic changes in connectivity networks organization over time (Chang and Glover, 2010; Chang et al., 2011). Among the various statistical methodologies, including time-frequency analyses (Allen et al., 2014; Chang and Glover, 2010; Thompson and Fransson, 2015) and data-driven modeling (Broumand et al., 2015; Broumand and Hu, 2015; Cribben et al., 2012; Lindquist et al., 2014), the sliding window correlation (SWC) analysis has remained the most popular approach to evaluate dynamic functional connectivity (Allen et al., 2014; Hutchison et al., 2013; Mokhtari et al., 2018a; Preti et al., 2016; Rashid et al., 2014; Sakoğlu et al., 2010). Analogous to a moving average function, a sliding window analysis computes a succession of pairwise correlation matrices using the time series from a given parcellation of brain regions.
The sliding window correlation (SWC) analysis is a straightforward and common approach for evaluating dynamic functional connectivity. Despite the fact that sliding window analyses have been long used, there are still considerable technical issues associated with the approach. A great effort has recently been dedicated to investigate the window setting effects on dynamic connectivity estimation. In this direction, tapered windows have been proposed to alleviate the effect of sudden changes associated with the edges of rectangular windows. Nevertheless, the majority of the windows exploited to estimate brain connectivity tend to suppress dynamic correlations, especially those with faster variations over time. Here, we introduced a window named modulated rectangular (mRect) to address the suppressing effect associated with the conventional windows. We provided a frequency domain analysis using simulated time series to investigate how sliding window analysis (using the regular window functions, e.g. rectangular and tapered windows) may lead to unwanted spectral modulations, and then we showed how this issue can be alleviated through the mRect window. Moreover, we created simulated dynamic network data with altering states over time using simulated fMRI time series, to examine the performance of different windows in tracking network states. We quantified the state identification rate of different window functions through the Jaccard index, and observed superior performance of the mRect window compared to the conventional window functions. Overall, the proposed window function provides an approach that improves SWC estimations, and thus the subsequent inferences and interpretations based on the connectivity network analyses.

View all citing articles on Scopus

Mohammad Shahrokh Esfahani received the Ph.D. degree in electrical engineering from Texas A&M University, in 2014. He received the B.Sc. and M.Sc. degrees from the University of Tehran and Sharif University of Technology, respectively in 2007 and 2009, all in Electrical Engineering. He is currently a Postdoctoral Research Associate in the Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, College Station, TX. His current research interests include genomic signal processing, uncertainty quantification, and Bayesian statistics.

Byung-Jun Yoon received the B.S.E. (summa cum laude) degree from the Seoul National University, Seoul, Korea, in 1998, and the M.S. and Ph.D. degrees from the California Institute of Technology, Pasadena, in 2002 and 2007, respectively, all in Electrical Engineering. In 2008, he joined the Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, where he was an Assistant Professor during 2008–2014 and has been an Associate Professor since 2014. Recently, Dr. Yoon joined Hamad bin Khalifa University (HBKU), College of Science and Engineering (CSE), Doha, Qatar, as a founding faculty member, where he is currently an Associate Professor. His recent honors include the NSF CAREER Award and the Best Paper Award at the 9th Asia Pacific Bioinformatics Conference (APBC). His main research interests include genomic signal processing (GSP), bioinformatics, and computational network biology.

Edward R. Dougherty is a Distinguished Professor in the Department of Electrical and Computer Engineering at Texas A&M University in College Station, TX, where he holds the Robert M. Kennedy 26 Chair in Electrical Engineering and is Scientific Director of the Center for Bioinformatics and Genomic Systems Engineering. He holds a Ph.D. in mathematics from Rutgers University and an M.S. in Computer Science from Stevens Institute of Technology, and has been awarded the Doctor Honoris Causa by the Tampere University of Technology. He is a fellow of both IEEE and SPIE, has received the SPIE Presidents Award, and served as the editor of the SPIE/IS&T Journal of Electronic Imaging. At Texas A&M University he has received the Association of Former Students Distinguished Achievement Award in Research, been named Fellow of the Texas Engineering Experiment Station and Halliburton Professor of the Dwight Look College of Engineering. Prof. Dougherty is the author of 16 books and the author of more than 300 journal papers.

View full text

Discrete optimal Bayesian classification with error-conditioned sequential sampling

Highlights

Abstract

Introduction

Section snippets

The discrete model and optimal Bayesian classifier

Error-conditioned sequential sampling algorithm

Simulations with synthetic data

Numerical experiments on real pathways

Conclusion

Conflict of interest

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit. Lett.

J. Theor. Biol.

Pattern Classification

The effects of intraclass correlation on certain significance tests when sampling from multivariate normal population

Commun. Stat.-Theory Methods

Bayesian minimum mean-square error estimation for classification error. Part I: Definition and the Bayesian MMSE error estimator for discrete classification

IEEE Trans. Signal Process.

Bayesian minimum mean-square error estimation for classification error. Part II: Linear classification of Gaussian models

IEEE Trans. Signal Process.

The knowledge gradient algorithm for a general class of online learning problems

Oper. Res.

Bandit processes and dynamic allocation indices

J. R. Stat. Soc. Ser. B (Methodological)