Elsevier

Pattern Recognition

Volume 48, Issue 11, November 2015, Pages 3766-3782
Pattern Recognition

Discrete optimal Bayesian classification with error-conditioned sequential sampling

https://doi.org/10.1016/j.patcog.2015.03.023Get rights and content

Highlights

  • A sampling algorithm for training the optimal Bayesian classifier is introduced.

  • The algorithm works based on minimization of the expected error on the uncertainty class of prior knowledge.

  • Using a Zipf model we show that our sampling algorithm leads to a less true error on average than random sampling.

  • Our algorithm shows robustness even in case when prior knowledge drifts away from true distributions.

  • An example on data from p53 network shows that our method works well on from real pathway data as well.

Abstract

When in possession of prior knowledge concerning the feature-label distribution, in particular, when it is known that the feature-label distribution belongs to an uncertainty class of distributions governed by a prior distribution, this prior knowledge can be used in conjunction with the training data to construct the optimal Bayesian classifier (OBC), whose performance is, on average, optimal among all classifiers relative to the posterior distribution derived from the prior distribution and the data. Typically in classification theory it is assumed that sampling is performed randomly in accordance with the prior probabilities on the classes and this has heretofore been true in the case of OBC. In the present paper we propose to forego random sampling and utilize the prior knowledge and previously collected data to determine which class to sample from at each step of the sampling. Specifically, we choose to sample from the class that leads to the smallest expected classification error with the addition of the new sample point. We demonstrate the superiority of the resulting nonrandom sampling procedure to random sampling on both synthetic data and data generated from known biological pathways.

Introduction

In many classification applications one is limited to small samples. For instance, in medicine, where classification may involve diagnosis, prognosis, or treatment option, data can be limited due to specimen availability, cost, or the time necessary to obtain and process specimens (which is related to cost). In classification theory it is generally assumed that sampling is random, meaning that the training data are independent and identically distributed (i.i.d.); indeed, assumption of random sampling is typically made throughout a text on classification. For instance, Devroye et al. declare on page 2 of their text that all sampling is random [1]. The assumption is so pervasive that it may be applied without being mentioned. Duda et al. state: “In typical supervised pattern classification problems, the estimation of the prior probabilities presents no serious difficulties.” [2]. Implicit in this statement is that the ratio of the number of data points in a class with respect to the total sample size converges to the class probability, as it does in the case of random sampling according to Bernoulli׳s law of large numbers. No doubt, random sampling has advantages, but is it most efficient in classifier design, especially when one is constrained to small samples?

The effects of nonrandom sampling owing to correlation in the training data have been examined as far back as the early 1970s using numerical examples [3] and the issue subsequently has been examined by studying the effects on asymptotic error rates in the context of linear discriminant analysis (LDA) [4], [5], [6]. With small samples, asymptotic results are not really relevant. More recently, nonrandom sampling has been addressed for finite samples by providing representation of the first- and second-order moments for expected errors arising from nonrandom sampling, again in the framework of LDA [7]. In particular, these results demonstrate that nonrandom sampling can be advantageous depending on the correlation structure within the data.

Here we consider a specific scenario for nonrandom sampling. Given a sample, Sn, consisting of n data points, if another data point is to be selected and a classifier designed from the larger sample, Sn+1, would it be better to select the new point in an i.i.d. fashion, which means it could come from either class-conditional distribution, or to predetermine the class from which it is to be chosen based on some class-selection criterion, in which case Sn+1 would not be a random sample, even if Sn were a random sample? The answer depends on having a suitable criterion whose application leads to making a beneficial choice as to whether or not to select an i.i.d. data point. By working within the framework of optimal Bayesian classification, we can establish such a criterion and obtain an advantageous nonrandom sampling procedure. In this framework, one has an uncertainty class of possible feature-label distributions and a prior distribution governing the uncertainty class. This allows one to determine the minimum mean-square-error (MMSE) estimate of the error based on the prior distribution and the data [8], [9]. An optimal Bayesian classifier (OBC) possesses minimum expected error across the uncertainty class [10], [11]. Relative to the sampling procedure, the aim is to select the next data point in such a way as to minimize the expected error of the optimal Bayesian classifier, the critical point being that the Bayesian framework facilitates determination of the expected error, which is impossible in the ordinary purely data-driven setting.

This work focuses on discrete classification. Using simulations, both with synthetic and simulated data from real biological pathways, we demonstrate the effectiveness of the proposed nonrandom sampling paradigm relative to random sampling and also examine some of its properties.

Other methods for nonrandom sampling have been proposed that possess conceptual similarities as well as vital differences with the approach proposed herein. These include online learning and active sampling (learning).

In online learning, sequential measurements are made, one at a time, to improve an uncertain model. In particular, the knowledge gradient (KG) algorithm assumes that one of M alternatives can be measured at each time step, each yielding a random reward with an unknown mean and known variance (corresponding to measurement error) [12]. The aim is to make sequential measurements that will maximize the expected total reward to be collected over a time period, thereby treating the problem as a multi-armed bandit process [13]. To achieve this goal, at every time step one tries to identify the optimal KG policy that allows one to choose a measurement (among the M available alternatives) that is expected to bring the largest improvement. The alternative measurements (or rewards) are typically assumed to be independent Gaussian random variables and prior knowledge concerning the measurements and their correlations can be incorporated into the problem via their joint distribution. Our proposed Bayesian framework for nonrandom sampling utilizes a substantially different approach, in that it puts a prior distribution on an uncertainty class of feature-label distributions. Among the key differences resulting from this Bayesian framework is that the distribution of the reward (cost) is not directly modeled; instead, we estimate the expected cost, which is classification error. Moreover, we do not impose restrictions on the variance of our cost/reward in the case of pursuing each policy.

Active sampling has a long history in machine learning, going back to [14], [15]. As discussed in [16], the essence of active sampling algorithms is to control the selection of potential unlabeled training points in the sample space to be labeled and used for further training. A generic active sampling algorithm is described in [17]. While there are conceptual similarities with our work, there are fundamental differences. Our goal is not to search among unlabeled sample points for those for which we wish to generate labels; rather, we generate new sample points from a chosen known label. Moreover, we directly target reduction of classification error. Reducing uncertainty in our class probability distributions is a side effect, not the direct goal. Considering active learning under a Bayesian framework as in [18] does not eliminate the difference because the underlying strategy is to choose sample points to label.

The rest of the paper is organized as follows. In Section 2 the general framework of the discrete classification problem and the optimal Bayesian classifiers is introduced. In Section 3 the proposed sampling algorithm is introduced. Section 4 shows some results of applying the proposed sampling method in the classification problem with synthetic data from a Zipf model. In Section 5 the effect of the proposed method is studied on data generated from pathways. Section 6 concludes the paper.

Throughout this paper, we use bold letters to denote vectors, e.g. p or U. Capital letters are used for random variables; when in bold they denote a random vector. The notation Eπ(θ)[·] is the expectation with respect to the parameter θ distributed by π(θ).

Section snippets

The discrete model and optimal Bayesian classifier

The discrete model consists of b bins and two classes, y{0,1}, with {pi}1b and {qi}1b being class-conditional probabilities for iX={1,,b}, and c being the prior probability of class 0, i.e. P(X=i|y=0)=pi,P(X=i|y=1)=qifori=1,,b, and c=P(y=0).

A classifier is a function ψ that maps sample points to a class, ψ:{1,,b}{0,1}. The true classification error ε is the probability that a sample point from class y is classified by ψ as belonging to a different class; ε=P(ψ(X)y). The error can be

Error-conditioned sequential sampling algorithm

The aim of this paper, which is elaborated in this section, is to improve the performance of the OBC by the means of controlling the sampling procedure, the heuristic being that it would be better to iterate the updating of the posterior distribution by selecting from the class for which the selected point would best improve the performance of the OBC. As is often the case with an improvement in performance, greater knowledge must be assumed at the outset. In this case, since sampling would no

Simulations with synthetic data

This section utilizes a set of experiments to examine the effect of the proposed sampling procedure on the performance of optimal Bayesian classifiers via synthetic Monte Carlo simulations. We consider a discrete model with 2 classes and both 16 and 32 bins. Different values of the class prior probability c are considered. Furthermore, we assume that there is a true class conditional probability vector for each class, namely, vectors ptrue and qtrue, from which sample points are drawn. As

Numerical experiments on real pathways

A major area of research in translational genomics involves classification of cell condition based on genetic activity, which in medicine corresponds to diagnosing the presence or type of disease. This requires designing expression-based classifiers based on genes whose product abundances indicate critical differences in cell state. For cancer diagnosis, classification can be between different kinds of cancer, different stages of tumor development, different prognoses, or other such

Conclusion

This study has shown that prior knowledge concerning the classes can be used to select training points for classifier design in a more efficient fashion than random sampling. The method has been described mathematically and its performance studied via Monte Carlo simulations on both synthetic and real-pathway generated data. We have observed that the proposed method shows more improvement as the difference in the amounts of uncertainty regarding the two classes increases and that performance

Conflict of interest

None declared.

Ariana Broumand received the B.Sc. and M.Sc. degrees from the University of Tehran in 2009 and 2012, in electrical and biomedical (bioelectrical) engineering respectively. During his M.S. course he spent 8 months as a visiting researcher at University of Rostock, Germany. He is currently a Ph.D. student at Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX. His current research interests include genomic signal processing, and Bayesian statistics and

References (36)

  • L. Devroye et al.
    (1996)
  • R.O. Duda et al.

    Pattern Classification

    (2000)
  • J. Basu et al.

    The effects of intraclass correlation on certain significance tests when sampling from multivariate normal population

    Commun. Stat.-Theory Methods

    (1974)
  • L.A. Dalton et al.

    Bayesian minimum mean-square error estimation for classification error. Part I: Definition and the Bayesian MMSE error estimator for discrete classification

    IEEE Trans. Signal Process.

    (2011)
  • L.A. Dalton et al.

    Bayesian minimum mean-square error estimation for classification error. Part II: Linear classification of Gaussian models

    IEEE Trans. Signal Process.

    (2011)
  • I.O. Ryzhov et al.

    The knowledge gradient algorithm for a general class of online learning problems

    Oper. Res.

    (2012)
  • J.C. Gittins

    Bandit processes and dynamic allocation indices

    J. R. Stat. Soc. Ser. B (Methodological)

    (1979)
  • H.A. Simon, G. Lea, Problem solving and rule induction: a unified view,...
  • Cited by (33)

    • A proxy learning curve for the Bayes classifier

      2023, Pattern Recognition
      Citation Excerpt :

      Firstly, some possible knowledge about the model parameters could be incorporated into the analysis to obtain theoretical curves that better match those particular models. For example, some parameters could be assumed known, e.g. correlation matrices are diagonals, and/or equal for all classes, or some knowledge about prior probabilities could be considered [38]. Also, other models different from strictly Gaussian ones, e.g. imprecise Gaussian [39], or different from Gaussian mixtures, e.g. Independent Component Analysis mixtures [35] may be assumed, though they will probably be intractable in most cases.

    • Robust importance sampling for error estimation in the context of optimal Bayesian transfer learning

      2022, Patterns
      Citation Excerpt :

      A natural question is how one can maximize the “return-on-investment” for data acquisition given the available budget. Such strategies for optimal experimental design19–24 and active learning25–27 have been actively studied in a Bayesian paradigm that enables objective-based uncertainty quantification via mean objective cost of uncertainty.28,29 While this is beyond the scope of this current study, it opens up interesting directions for future research.

    • Sliding window correlation analysis: Modulating window shape for dynamic brain connectivity in resting state

      2019, NeuroImage
      Citation Excerpt :

      Additionally, the brain's response to changing internal and external stimuli requires dynamic changes in connectivity networks organization over time (Chang and Glover, 2010; Chang et al., 2011). Among the various statistical methodologies, including time-frequency analyses (Allen et al., 2014; Chang and Glover, 2010; Thompson and Fransson, 2015) and data-driven modeling (Broumand et al., 2015; Broumand and Hu, 2015; Cribben et al., 2012; Lindquist et al., 2014), the sliding window correlation (SWC) analysis has remained the most popular approach to evaluate dynamic functional connectivity (Allen et al., 2014; Hutchison et al., 2013; Mokhtari et al., 2018a; Preti et al., 2016; Rashid et al., 2014; Sakoğlu et al., 2010). Analogous to a moving average function, a sliding window analysis computes a succession of pairwise correlation matrices using the time series from a given parcellation of brain regions.

    View all citing articles on Scopus

    Ariana Broumand received the B.Sc. and M.Sc. degrees from the University of Tehran in 2009 and 2012, in electrical and biomedical (bioelectrical) engineering respectively. During his M.S. course he spent 8 months as a visiting researcher at University of Rostock, Germany. He is currently a Ph.D. student at Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX. His current research interests include genomic signal processing, and Bayesian statistics and bioinformatics.

    Mohammad Shahrokh Esfahani received the Ph.D. degree in electrical engineering from Texas A&M University, in 2014. He received the B.Sc. and M.Sc. degrees from the University of Tehran and Sharif University of Technology, respectively in 2007 and 2009, all in Electrical Engineering. He is currently a Postdoctoral Research Associate in the Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, College Station, TX. His current research interests include genomic signal processing, uncertainty quantification, and Bayesian statistics.

    Byung-Jun Yoon received the B.S.E. (summa cum laude) degree from the Seoul National University, Seoul, Korea, in 1998, and the M.S. and Ph.D. degrees from the California Institute of Technology, Pasadena, in 2002 and 2007, respectively, all in Electrical Engineering. In 2008, he joined the Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, where he was an Assistant Professor during 2008–2014 and has been an Associate Professor since 2014. Recently, Dr. Yoon joined Hamad bin Khalifa University (HBKU), College of Science and Engineering (CSE), Doha, Qatar, as a founding faculty member, where he is currently an Associate Professor. His recent honors include the NSF CAREER Award and the Best Paper Award at the 9th Asia Pacific Bioinformatics Conference (APBC). His main research interests include genomic signal processing (GSP), bioinformatics, and computational network biology.

    Edward R. Dougherty is a Distinguished Professor in the Department of Electrical and Computer Engineering at Texas A&M University in College Station, TX, where he holds the Robert M. Kennedy 26 Chair in Electrical Engineering and is Scientific Director of the Center for Bioinformatics and Genomic Systems Engineering. He holds a Ph.D. in mathematics from Rutgers University and an M.S. in Computer Science from Stevens Institute of Technology, and has been awarded the Doctor Honoris Causa by the Tampere University of Technology. He is a fellow of both IEEE and SPIE, has received the SPIE Presidents Award, and served as the editor of the SPIE/IS&T Journal of Electronic Imaging. At Texas A&M University he has received the Association of Former Students Distinguished Achievement Award in Research, been named Fellow of the Texas Engineering Experiment Station and Halliburton Professor of the Dwight Look College of Engineering. Prof. Dougherty is the author of 16 books and the author of more than 300 journal papers.

    View full text