Joint discriminative–generative modelling based on statistical tests for classification
Introduction
The objective of statistical pattern classification is to assign a new individual to one of several pre-specified classes (Venables and Ripley, 2002). Let represent an individual with q features and let y, a categorical variable, label the class of X. As the value of y is unknown for a new individual X, we often assign X to class using the maximum a posteriori criterion: .
The parameter vector in the conditional distribution is in general estimated from a training-set of n labelled, independent individuals together with their labels . The estimation can be done by a generative or a discriminative approach.
Discriminative approaches estimate by maximisation of the conditional log-likelihood . A typical discriminative approach is linear logistic regression (LLR) for two-class discrimination, in which the two classes have labels and , say, and where . The decision boundary is given by the hyperplane . In this case, comprises coefficients and . In short, discriminative approaches aim to estimate directly the decision boundary. However, since no data-generating process (DGP) is assumed, they do not fully exploit information provided by the joint distribution .
Generative approaches estimate in a two-stage scheme. First, a DGP and class prior probabilities are assumed, and the parameter vectors and are estimated by maximisation of the joint log-likelihood, . Secondly, and can be obtained through Bayes’ theorem: . A typical generative approach is normal-based linear or quadratic discriminant analysis (LDA or QDA), in which the DGP is assumed to correspond to a multivariate normal distribution for each class y. In this case, comprises class means and covariance matrices. In short, the generative approaches aim to exploit fully the information contained in . However, their performance may be degraded if the DGP is wrongly specified.
In fact, both discriminative and generative approaches can be derived from factorisation of the joint distribution : a factorisation leads to discriminative approaches which assume the form of the posterior probabilities for classification; another factorisation leads to generative approaches which assume the DGP .
The terms ‘generative (or informative) approach’ and ‘discriminative approach’ are the new terminology for what Dawid (1976) called respectively the ‘sampling paradigm’ and the ‘diagnostic paradigm’. A considerable number of comparisons of these two paradigms have been reported, including, among others, Efron, 1975, O’Neill, 1980, Titterington et al., 1981, Rubinstein and Hastie, 1997, Ng and Jordan, 2001, Xue and Titterington, 2008.
Given the advantages and disadvantages of using either a generative or a discriminative approach, in recent years many hybrid methods have been proposed in order to exploit the best of both worlds (Rubinstein, 1998, Raina et al., 2003, Bouchard and Triggs, 2004, McCallum et al., 2006, Bishop and Lasserre, 2007). Some comments about these methods can be found in (Xue and Titterington, 2009, Xue and Titterington, 2010).
Since a generative classifier assumes a DGP whereas its discriminative counterpart does not, a general observation is as follows: a generative classifier performs better than its discriminative counterpart if the DGP is well-specified (but not necessarily perfectly-specified, given the bias-variance trade-off); the former performs worse than the latter if the DGP is clearly mis-specified.
Ng and Jordan (2001) presented some theoretical and empirical comparisons of LLR and the normal-based naïve Bayes classifier, which is a generative approach equivalent to LDA or QDA for conditionally independent feature variables. Their results suggested that, for the two approaches, there were two distinct regimes of relative classification performance with respect to the training-set size: the discriminative classifier performs better with larger training-sets whereas the generative classifier does better with smaller training-sets. We conjecture that such an important pattern may be explained to some extent by the following: for empirical data, any mis-specification of the DGP becomes more apparent as the amount of training data increases, and thus the discriminative classifier is favoured for larger training sets.
In this context, we present a new hybrid method, which we call a joint discriminative–generative modelling (JoDiG) approach to classification. The basic ideas of the approach are as follows. First, it uses a discriminative approach for a sub-vector of X, where contains the variables that clearly violate the assumptions underlying the proposed DGP; secondly, it uses a generative approach for the remaining variables of X; and, thirdly, these two approaches are combined in a probabilistic way, by factorisation of the joint distribution , rather than in an ad hoc way.
The assumptions underlying the JoDiG approach are twofold. First, it must be possibly to test the DGP, partially if not fully. Secondly, and are assumed (block-wise) conditionally independent given the class y, such that . Within or , the individual variables are not necessarily assumed conditionally independent.
In Section 2, we describe the motivation, algorithmic details, interpretation, computational complexity and extensions of the JoDiG approach. Then, in Section 3, we illustrate the JoDiG approach in a widely-used scenario and apply it to real and simulated data. Finally, in Section 4, we discuss closely-related work and other models that can also be derived by factorisation of .
Section snippets
Motivation
The JoDiG approach is motivated by the general observation that a generative classifier performs better than its discriminative counterpart when the DGP is well-specified and worse than the latter when the DGP is clearly mis-specified. It also pursues exactly a suggestion made but not developed in Rubinstein and Hastie (1997): “it is best to use an informative (generative) approach if confidence in the model correctness is high. This suggests a promising way of combining the two approaches:
Methodology
The principle of the JoDiG approach is quite generic, but for illustration numerical studies in this paper focus on a widely-used case: the DGP assumes a multivariate normal distribution for each class, i.e., . Suppose there are only two classes, (with ) and (with ); for multi-class extensions, see Section 2.4. For linear classifiers, it is assumed that .
The (multivariate) normal distribution is the most widely-assumed DGP for statistical classification, as the
Closely-related work
Kang and Tian (2006) proposed a hybrid generative/discriminative Bayesian (HBayes) classifier, which is closely related to our JoDiG approach.
The HBayes method constructs an iterative, heuristic partition of X. It starts with an empty (i.e., ). Then in the t-th iteration it moves a single variable from into such that moving this variable produces the greatest improvement. The procedure is continued till no such variable can be found. Therefore, in order to select
Conclusions
Based on a general observation that a generative classifier performs better than its discriminative counterpart if the DGP is well-specified and worse than the latter if the DGP is clearly mis-specified, this paper has presented a JoDiG approach. The approach partitions variables into two sub-vectors based on statistical tests of the assumed DGP, then uses a discriminative approach for the variables which clearly failed the tests and a generative approach for the other variables, and finally
Acknowledgments
The authors thank the reviewers, the associate editor and Professor David J. Hand for their suggestions that have enhanced the comprehensiveness, structure and composition of the manuscript. This work was partly supported by the award of a Hutchison Whampoa-EPSRC Dorothy Hodgkin Postgraduate Award to J.H.X. It also benefited from our participation in the Research Programme on ‘Statistical Theory and Methods for Complex, High-Dimensional Data’ at the Isaac Newton Institute for Mathematical
References (25)
- et al.
Interpretation of hybrid generative/discriminative algorithms
Neurocomputing
(2009) - et al.
On the generative–discriminative tradeoff approach: interpretation, asymptotic efficiency and classification performance
Comput. Statist. Data Anal.
(2010) - Asuncion, A., Newman, D.J., 2007. UCI Machine Learning Repository. University of California, School of Information and...
- et al.
Generative or discriminative? Getting the best of both worlds (with discussion)
- et al.
The tradeoff between generative and discriminative classifiers
Properties of diagnostic data distributions
Biometrics
(1976)The efficiency of logistic regression compared to normal discriminant analysis
J. Amer. Statist. Assoc.
(1975)- Friedman, J.H., 1996. Another Approach to Polychotomous Classification. Tech. Rep., Stanford...
- et al.
Idiot’s Bayes – Not so stupid after all?
Internat. Statist. Rev.
(2001) - et al.
Discriminant analysis by gaussian mixtures
J. Roy. Statist. Soc. Ser. B
(1996)
Classification by pairwise coupling
Ann. Statist.
A hybrid generative/discriminative Bayesian classifier
Cited by (13)
A bias-variance based heuristic for constructing a hybrid logistic regression-naïve Bayes model for classification
2020, International Journal of Approximate ReasoningCitation Excerpt :However, they find no convincing evidence to support Ng and Jordan's claim. In the last two decades, several researchers have been exploring hybrid models that combine discriminative and generative models into one model [1,8,10,16,18,21]. These are discussed in the following paragraphs.
Discriminatively guided filtering (DGF) for hyperspectral image classification
2018, NeurocomputingCitation Excerpt :In contrast, we would consider generative classifiers, which can serve as supervised feature-extraction approaches and have the capability of exploiting the labelling information of the training samples, a capacity that PCA lacks. Moreover, to the discriminative classifiers like the SVM adopted by PGF, the generative classifiers can provide complementary discriminative strength, another capacity that PCA lacks; for the complementarity of these two types of classifiers and the advantages of combining them together, see [29–33]. As pioneered by Kang et al. [28], the GF not only can be used as an edge-preserving smoothing operator, but also can help HSI classification.
On computing probabilities of dismissal of 10b-5 securities class-action cases
2017, Decision Support SystemsCitation Excerpt :At this juncture, we emphasize the choice of RMSE as a performance metric. Most of the earlier works in machine learning literatures (see [26,27]) focus on classification. Consequently, classification error, which measures the ratio of # incorrect classifications to # total instances, is their choice for a performance measure.
A feature augmentation-based method for constructing generative-discriminative hybrid models
2022, Scientia Sinica InformationisColored spun fabric texture representation and application by combining spatial features with frequency features
2021, Textile Research JournalUsing Hybrid Discriminative-Generative Models for Binary Classification
2019, Automatic Control and Computer Sciences