Joint discriminative–generative modelling based on statistical tests for classification

doi:10.1016/j.patrec.2010.01.015

Pattern Recognition Letters

Volume 31, Issue 9, 1 July 2010, Pages 1048-1055

https://doi.org/10.1016/j.patrec.2010.01.015 Get rights and content

Abstract

In statistical pattern classification, generative approaches, such as linear discriminant analysis (LDA), assume a data-generating process (DGP), whereas discriminative approaches, such as linear logistic regression (LLR), do not model the DGP. In general, a generative classifier performs better than its discriminative counterpart if the DGP is well-specified and worse than the latter if the DGP is clearly mis-specified. In view of this, this paper presents a joint discriminative–generative modelling (JoDiG) approach, by partitioning predictor variables X into two sub-vectors, namely $X_{G}$ , to which a generative approach is applied, and $X_{D}$ , to be treated by a discriminative approach. This partitioning of X is based on statistical tests of the assumed DGP: the variables that clearly fail the tests are grouped as $X_{D}$ and the rest as $X_{G}$ . Then the generative and discriminative approaches are combined in a probabilistic rather than a heuristic way. The principle of the JoDiG approach is quite generic, but for illustrative purposes numerical studies of the paper focus on a widely-used case, in which the DGP assumes a multivariate normal distribution for each class. In this case, the JoDiG approach uses LDA for $X_{G}$ and LLR for $X_{D}$ . Numerical experiments on real and simulated data demonstrate that the performance of this new approach to classification is similar to or better than that of its discriminative and generative counterparts, in particular when the size of the training-set is comparable to the dimension of the data.

Introduction

The objective of statistical pattern classification is to assign a new individual to one of several pre-specified classes (Venables and Ripley, 2002). Let $X = (x_{1}, \dots, x_{q})$ represent an individual with q features and let y, a categorical variable, label the class of X. As the value of y is unknown for a new individual X, we often assign X to class $\hat{y}$ using the maximum a posteriori criterion: $\hat{y} = {argmax}_{y} p (y | X, α)$ .

The parameter vector $α$ in the conditional distribution $p (y | X, α)$ is in general estimated from a training-set of n labelled, independent individuals ${X_{i} = (x_{i 1}, \dots, x_{iq})}_{i = 1}^{n}$ together with their labels ${y_{i}}_{i = 1}^{n}$ . The estimation can be done by a generative or a discriminative approach.

Discriminative approaches estimate $α$ by maximisation of the conditional log-likelihood $ℓ (α) = \sum_{i = 1}^{n} \log p (y_{i} | X_{i}, α)$ . A typical discriminative approach is linear logistic regression (LLR) for two-class discrimination, in which the two classes have labels $y = 1$ and $y = 0$ , say, and $p (y | X, α) = \frac{\exp {g (X, α) y}}{1 + \exp {g (X, α)}}$ where $g (X, α) = \log \frac{p (y = 1 | X, α)}{p (y = 0 | X, α)} = β_{0} + β^{T} X$ . The decision boundary is given by the hyperplane $g (X, α) = 0$ . In this case, $α$ comprises coefficients $β_{0}$ and $β$ . In short, discriminative approaches aim to estimate directly the decision boundary. However, since no data-generating process (DGP) $p (X | y)$ is assumed, they do not fully exploit information provided by the joint distribution $p (X, y)$ .

Generative approaches estimate $α$ in a two-stage scheme. First, a DGP $p (X | y, θ)$ and class prior probabilities $p (y | π)$ are assumed, and the parameter vectors $θ$ and $π$ are estimated by maximisation of the joint log-likelihood, $ℓ (θ, π) = \sum_{i = 1}^{n} \log {p (X_{i} | y_{i}, θ) p (y_{i} | π)}$ . Secondly, $α$ and $p (y | X, α)$ can be obtained through Bayes’ theorem: $p (y | X, α) = \frac{p (X | y, θ) p (y | π)}{\sum_{y} p (X | y, θ) p (y | π)}$ . A typical generative approach is normal-based linear or quadratic discriminant analysis (LDA or QDA), in which the DGP $p (X | y, θ)$ is assumed to correspond to a multivariate normal distribution for each class y. In this case, $θ$ comprises class means and covariance matrices. In short, the generative approaches aim to exploit fully the information contained in $p (X, y)$ . However, their performance may be degraded if the DGP is wrongly specified.

In fact, both discriminative and generative approaches can be derived from factorisation of the joint distribution $p (X, y)$ : a factorisation $p (X, y) = p (y | X) p (X)$ leads to discriminative approaches which assume the form of the posterior probabilities $p (y | X)$ for classification; another factorisation $p (X, y) = p (X | y) p (y)$ leads to generative approaches which assume the DGP $p (X | y)$ .

The terms ‘generative (or informative) approach’ and ‘discriminative approach’ are the new terminology for what Dawid (1976) called respectively the ‘sampling paradigm’ and the ‘diagnostic paradigm’. A considerable number of comparisons of these two paradigms have been reported, including, among others, Efron, 1975, O’Neill, 1980, Titterington et al., 1981, Rubinstein and Hastie, 1997, Ng and Jordan, 2001, Xue and Titterington, 2008.

Given the advantages and disadvantages of using either a generative or a discriminative approach, in recent years many hybrid methods have been proposed in order to exploit the best of both worlds (Rubinstein, 1998, Raina et al., 2003, Bouchard and Triggs, 2004, McCallum et al., 2006, Bishop and Lasserre, 2007). Some comments about these methods can be found in (Xue and Titterington, 2009, Xue and Titterington, 2010).

Since a generative classifier assumes a DGP whereas its discriminative counterpart does not, a general observation is as follows: a generative classifier performs better than its discriminative counterpart if the DGP is well-specified (but not necessarily perfectly-specified, given the bias-variance trade-off); the former performs worse than the latter if the DGP is clearly mis-specified.

Ng and Jordan (2001) presented some theoretical and empirical comparisons of LLR and the normal-based naïve Bayes classifier, which is a generative approach equivalent to LDA or QDA for conditionally independent feature variables. Their results suggested that, for the two approaches, there were two distinct regimes of relative classification performance with respect to the training-set size: the discriminative classifier performs better with larger training-sets whereas the generative classifier does better with smaller training-sets. We conjecture that such an important pattern may be explained to some extent by the following: for empirical data, any mis-specification of the DGP becomes more apparent as the amount of training data increases, and thus the discriminative classifier is favoured for larger training sets.

In this context, we present a new hybrid method, which we call a joint discriminative–generative modelling (JoDiG) approach to classification. The basic ideas of the approach are as follows. First, it uses a discriminative approach for a sub-vector $X_{D}$ of X, where $X_{D}$ contains the variables that clearly violate the assumptions underlying the proposed DGP; secondly, it uses a generative approach for the remaining variables $X_{G}$ of X; and, thirdly, these two approaches are combined in a probabilistic way, by factorisation of the joint distribution $p (X, y)$ , rather than in an ad hoc way.

The assumptions underlying the JoDiG approach are twofold. First, it must be possibly to test the DGP, partially if not fully. Secondly, $X_{G}$ and $X_{D}$ are assumed (block-wise) conditionally independent given the class y, such that $p (X_{G} | X_{D}, y) = p (X_{G} | y)$ . Within $X_{G}$ or $X_{D}$ , the individual variables are not necessarily assumed conditionally independent.

In Section 2, we describe the motivation, algorithmic details, interpretation, computational complexity and extensions of the JoDiG approach. Then, in Section 3, we illustrate the JoDiG approach in a widely-used scenario and apply it to real and simulated data. Finally, in Section 4, we discuss closely-related work and other models that can also be derived by factorisation of $p (X, y)$ .

Section snippets

Motivation

The JoDiG approach is motivated by the general observation that a generative classifier performs better than its discriminative counterpart when the DGP is well-specified and worse than the latter when the DGP is clearly mis-specified. It also pursues exactly a suggestion made but not developed in Rubinstein and Hastie (1997): “it is best to use an informative (generative) approach if confidence in the model correctness is high. This suggests a promising way of combining the two approaches:

Methodology

The principle of the JoDiG approach is quite generic, but for illustration numerical studies in this paper focus on a widely-used case: the DGP assumes a multivariate normal distribution for each class, i.e., $X_{G} | y \sim N (μ_{y}, Σ_{y})$ . Suppose there are only two classes, $C_{1}$ (with $y = 1$ ) and $C_{0}$ (with $y = 0$ ); for multi-class extensions, see Section 2.4. For linear classifiers, it is assumed that $Σ_{1} = Σ_{0} = Σ$ .

The (multivariate) normal distribution is the most widely-assumed DGP for statistical classification, as the

Closely-related work

Kang and Tian (2006) proposed a hybrid generative/discriminative Bayesian (HBayes) classifier, which is closely related to our JoDiG approach.

The HBayes method constructs an iterative, heuristic partition of X. It starts with an empty $X_{D}^{(0)}$ (i.e., $X_{G}^{(0)} = X$ ). Then in the t-th iteration it moves a single variable $x_{j}$ from $X_{G}^{(t - 1)}$ into $X_{D}^{(t - 1)}$ such that moving this variable produces the greatest improvement. The procedure is continued till no such variable can be found. Therefore, in order to select

Conclusions

Based on a general observation that a generative classifier performs better than its discriminative counterpart if the DGP is well-specified and worse than the latter if the DGP is clearly mis-specified, this paper has presented a JoDiG approach. The approach partitions variables into two sub-vectors based on statistical tests of the assumed DGP, then uses a discriminative approach for the variables which clearly failed the tests and a generative approach for the other variables, and finally

Acknowledgments

The authors thank the reviewers, the associate editor and Professor David J. Hand for their suggestions that have enhanced the comprehensiveness, structure and composition of the manuscript. This work was partly supported by the award of a Hutchison Whampoa-EPSRC Dorothy Hodgkin Postgraduate Award to J.H.X. It also benefited from our participation in the Research Programme on ‘Statistical Theory and Methods for Complex, High-Dimensional Data’ at the Isaac Newton Institute for Mathematical

References (25)

J.-H. Xue et al.
Interpretation of hybrid generative/discriminative algorithms
Neurocomputing
(2009)
J.-H. Xue et al.
On the generative–discriminative tradeoff approach: interpretation, asymptotic efficiency and classification performance
Comput. Statist. Data Anal.
(2010)
Asuncion, A., Newman, D.J., 2007. UCI Machine Learning Repository. University of California, School of Information and...
C.M. Bishop et al.
Generative or discriminative? Getting the best of both worlds (with discussion)
G. Bouchard et al.
The tradeoff between generative and discriminative classifiers
A.P. Dawid
Properties of diagnostic data distributions
Biometrics
(1976)
B. Efron
The efficiency of logistic regression compared to normal discriminant analysis
J. Amer. Statist. Assoc.
(1975)
Friedman, J.H., 1996. Another Approach to Polychotomous Classification. Tech. Rep., Stanford...
D.J. Hand et al.
Idiot’s Bayes – Not so stupid after all?
Internat. Statist. Rev.
(2001)
T. Hastie et al.
Discriminant analysis by gaussian mixtures
J. Roy. Statist. Soc. Ser. B
(1996)

T. Hastie et al.

Classification by pairwise coupling

Ann. Statist.

(1998)

C. Kang et al.

A hybrid generative/discriminative Bayesian classifier

Cited by (13)

A bias-variance based heuristic for constructing a hybrid logistic regression-naïve Bayes model for classification
2020, International Journal of Approximate Reasoning
Citation Excerpt :
In contrast, our focus is on a restricted Bayesian network classifier, which combines two probabilistic models in a graphical way. One of the closest to our work is that by Xue and Titterington [21]. They study hybrid discriminative-generative classifiers where the discriminative component is LR, and the generative component is Fisher's linear discriminant analysis (LDA).
Discriminative classifiers tend to have lower asymptotic classification errors, while generative classifiers can be more accurate when the training set size is small. In this paper, we examine the construction of hybrid models from categorical data, where we use logistic regression (LR) as a discriminative component, and naïve Bayes (NB) as a generative component. We adopt a bias-variance tradeoff based strategy, with the objective of minimizing the sum of these two errors. Specifically, the proposed heuristic consists of functions of training sample size and conditional dependence among features. These functions serve as proxies for model variance and model bias. We implement our method on 25 different classification datasets, and find that the hybrid model does better than pure LR and pure NB. Our proposed method is competitive with random forest. Although the hybrid model fails to beat LASSO in predictive performance, as suggested by the experimental results, the difference appears to be insignificant when the number of features is small. Also, the hybrid model requires less training time than LASSO, which makes it more attractive when the training time is a big concern.
Discriminatively guided filtering (DGF) for hyperspectral image classification
2018, Neurocomputing
Citation Excerpt :
In contrast, we would consider generative classifiers, which can serve as supervised feature-extraction approaches and have the capability of exploiting the labelling information of the training samples, a capacity that PCA lacks. Moreover, to the discriminative classifiers like the SVM adopted by PGF, the generative classifiers can provide complementary discriminative strength, another capacity that PCA lacks; for the complementarity of these two types of classifiers and the advantages of combining them together, see [29–33]. As pioneered by Kang et al. [28], the GF not only can be used as an edge-preserving smoothing operator, but also can help HSI classification.
In this paper, we propose a new filtering framework called discriminatively guided image filtering (DGF), for hyperspectral image (HSI) classification. DGF integrates a discriminative classifier and a generative classifier by the guided filtering (GF), considering the complementary strength of these two types of classification paradigms. To demonstrate the effectiveness of the proposed framework, the combination of support vector machine (SVM) and linear discriminative analysis (LDA), which serve as a discriminative classifier and a generative classifier respectively, is investigated in this paper. Specifically, the original HSI is projected into the low-dimensional space induced by LDA to serve as guidance images for filtering the intermediate classification results induced by SVM. Experiment results show the superior performance of the proposed DGF compared with that of the principal component analysis (PCA)-based GF.
On computing probabilities of dismissal of 10b-5 securities class-action cases
2017, Decision Support Systems
Citation Excerpt :
provide an analytical proof that hybrid classifier works better under some conditions. In the machine learning domain also, different hybrid models combining generative and discriminative approaches have been proposed, such as hybrid generative–discriminative models [21], mixed log-likelihood model [22,23], multi-conditional learning models [24], Bayesian trade-off [25], H-Bayes model [26], and JoDiG approach [27]. [28] compare logistic regression and nave Bayes models.
The main goal of this paper is to propose a probability model for computing probabilities of dismissal of 10b-5 securities class-action cases filed in United States Federal district courts. By dismissal, we mean dismissal with prejudice in response to the motion to dismiss filed by the defendants, and not eventual dismissal after the discovery process. The proposed probability model is a hybrid of two widely-used methods: logistic regression, and naïve Bayes. Using a dataset of 925 10b-5 securities class-action cases filed between 2002 and 2010, we show that the proposed hybrid model has the potential of computing better probabilities than either LR or NB models. By better, we mean lower root mean square errors of probabilities of dismissal. The proposed hybrid model uses the following features: allegations of generally accepted accounting principles violations, allegations of lack of internal control, bankruptcy filing during the class period, allegations of Section 11 violations of Securities Act of 1933, and short-term drop in stock price. Our model is useful for those insurance companies which underwrite Directors and Officers liability policy.
A feature augmentation-based method for constructing generative-discriminative hybrid models
2022, Scientia Sinica Informationis
Colored spun fabric texture representation and application by combining spatial features with frequency features
2021, Textile Research Journal
Using Hybrid Discriminative-Generative Models for Binary Classification
2019, Automatic Control and Computer Sciences

View all citing articles on Scopus

View full text

Joint discriminative–generative modelling based on statistical tests for classification

Abstract

Introduction

Section snippets

Motivation

Methodology

Closely-related work

Conclusions

Acknowledgments

Neurocomputing

Comput. Statist. Data Anal.

Generative or discriminative? Getting the best of both worlds (with discussion)

The tradeoff between generative and discriminative classifiers

Properties of diagnostic data distributions

Biometrics

The efficiency of logistic regression compared to normal discriminant analysis

J. Amer. Statist. Assoc.

Idiot’s Bayes – Not so stupid after all?

Internat. Statist. Rev.

Discriminant analysis by gaussian mixtures

J. Roy. Statist. Soc. Ser. B

Classification by pairwise coupling

Ann. Statist.

A hybrid generative/discriminative Bayesian classifier