On the generative–discriminative tradeoff approach: Interpretation, asymptotic efficiency and classification performance

https://doi.org/10.1016/j.csda.2009.09.011Get rights and content

Abstract

The interpretation of generative, discriminative and hybrid approaches to classification is discussed, in particular for the generative–discriminative tradeoff (GDT), a hybrid approach. The asymptotic efficiency of the GDT, relative to that of its generative or discriminative counterpart, is presented theoretically and, by using linear normal discrimination as an example, numerically. On real and simulated datasets, the classification performance of the GDT is compared with those of normal-based linear discriminant analysis (LDA) and linear logistic regression (LLR). Four arguments are made as follows. First, the GDT is a generative model integrating both discriminative and generative learning. It is therefore subject to model misspecification of the data-generating process and hindered by complex optimisation. Secondly, among the three approaches being compared, the asymptotic efficiency of the GDT is higher than that of the discriminative approach but lower than that of the generative approach, when no model misspecification occurs. Thirdly, without model misspecification, LDA performs the best; with model misspecification, LLR or the GDT with an optimal, large weight on its discriminative component may perform the best. Finally, LLR is affected by the imbalance between groups of data.

Introduction

In discriminant analysis, individuals with features x are classified into groups labelled by a categorical variable y. The most commonly adopted discriminant rule is the maximum a posteriori criterion: for a given individual x, the allocated group is yˆ=argmaxyp(y|x,α), where x is in general a p-variate random vector and α denotes a column vector of the parameters of the conditional distribution p(y|x). In practice, α is unknown but can be estimated from a training set of n labelled individuals (x1:n,y1:n)={(xi,yi)}i=1n.

Dawid (1976) divided the statistical modelling and learning (or parameter estimation) approaches to discrimination into two paradigms, namely, the sampling paradigm and the diagnostic paradigm. In recent years, these have re-emerged in the machine learning community under the new terminology of generative (or informative) and discriminative approaches, respectively (Rubinstein and Hastie, 1997, Ng and Jordan, 2001, Raina et al., 2003, Bouchard and Triggs, 2004, McCallum et al., 2006, Bishop and Lasserre, 2007, Bouchard, 2007).

The discriminative approaches (or the approaches corresponding to the diagnostic paradigm) model p(y1:n|x1:n,α), without modelling the so-called data-generating process p(x|y,θg), where θg is the parameter vector of p(x|y). Then α is estimated through maximisation of the conditional likelihood, i.e., αˆ=argmaxαp(y1:n|x1:n,α), which is in practice further simplified by the assumption of certain conditional independence structure such that p(y1:n|x1:n,α)=i=1np(yi|xi,α). Thus only p(y|x,α) needs to be modelled. Hereafter, we refer to such a model and learning procedure as a discriminative model and discriminative learning, respectively. A typical discriminative classifier is logistic regression.

The generative approaches (or the approaches corresponding to the sampling paradigm) model p(y1:n|π) and p(x1:n|y1:n,θg), where π is the parameter vector of p(y). Then, in general, θ=(πT,θgT)T is estimated through maximum likelihood, i.e., θˆ=argmaxθp(x1:n,y1:n|θ), which is in practice further simplified by assuming that p(x1:n,y1:n|θ)=i=1np(xi,yi|θ). Thus only p(y|π) and p(x|y,θg) need to be modelled. Hereafter, we refer to such a model and learning procedure as a generative model and generative learning, respectively. Typical generative classifiers include normal-based discriminant analysis and the naïve Bayes classifier.

As concisely characterised by Rubinstein and Hastie (1997), the generative classifiers learn the group densities, while the discriminative classifiers learn the group boundaries (i.e., p(y|x,α) in our setting) without regard to the underlying group densities.

From Bayes’ Theorem, which substitutes p(y|π)p(x|y,θg)/{yp(y|π)p(x|y,θg)} for p(y|x,α), two observations can be made. First, there is a mapping α(θ) from θ to α such that the generative approaches can lead to αˆ, and thereby provide working classifiers for discrimination. Secondly, the generative model is more informative than the corresponding discriminative model, and thus discriminative learning techniques can be used with a generative model. The first observation is a basic characteristic of classical generative classifiers, and the second has led to increasing research interest recently (Rubinstein, 1998, Raina et al., 2003, Bouchard and Triggs, 2004, McCallum et al., 2006).

For the generative classifiers, although maximum likelihood based on p(x,y|θ) will lead to an asymptotically unbiased and efficient estimator θˆ and consequently αˆ, it can only be justified if p(x,y) is correctly specified. Similarly, for the discriminative classifiers, although maximum likelihood based on p(y|x,α) will lead to an asymptotically unbiased and efficient estimator αˆ, it can only be justified if p(y|x) or, for example for the case of two groups C1 and C0, the corresponding discriminant function, g(x,α)=log{p(C1|x)/p(C0|x)}, is correctly specified. Different p(x,y|θ)’s may lead to the same discriminant function g(x,α), which indicates that the discriminative classifiers may be less sensitive than the generative classifiers to the misspecification of p(x,y|θ).

In practice, commonly used discriminative and generative classifiers are linear logistic regression (LLR) and normal-based linear discriminant analysis (LDA), respectively. Numerous theoretical, simulated and empirical comparisons between these two approaches have been investigated; see Efron (1975), Titterington et al. (1981) and Ng and Jordan (2001) for example. In general, the performance of such approaches depends on the correctness of the modelling, the bias, efficiency and consistency of the learning, and the reliability of the training data. For instance, when the modelling of p(y|π) and p(x|y,θg) is correct, LDA can be more efficient than LLR (Efron, 1975). However, LLR can perform better than LDA when x|y is not normally distributed, because LLR does not necessarily assume the Gaussian form of p(x|y,θg); for instance, the modelling of LLR is valid under general exponential family assumptions on p(x|y,θg) (Efron, 1975).

In order to exploit the best of both worlds, many interesting proposals have emerged for combining the generative and discriminative approaches, such as the mixed discriminants (Rubinstein, 1998), the hybrid generative–discriminative models (Raina et al., 2003, Fujino et al., 2007), the mixed log-likelihood (or the generative–discriminative tradeoff) (Rubinstein, 1998, Bouchard and Triggs, 2004), multi-conditional learning (McCallum et al., 2006) and a Bayesian blending (Bishop and Lasserre, 2007). Since the generative approaches can model unlabelled individuals while the discriminative approaches do not, some of the above generative–discriminative combinations have been applied to semi-supervised learning scenarios (Suzuki et al., 2007, Druck et al., 2007, Bishop and Lasserre, 2007, Bouchard, 2007).

In the remaining sections of this paper, we first briefly discuss the above approaches and then focus on the generative–discriminative tradeoff approach (GDT). Regarding the GDT, we first present its interpretation, and then investigate its asymptotic efficiency relative to that of its generative or discriminative counterpart theoretically and, by using linear normal discrimination as an example, numerically, when there is no model misspecification. Finally we compare the classification performance of LDA, LLR and the GDT, using real and simulated datasets.

Section snippets

Methodologies

This paper will focus on two-group discriminant analysis, where y is a binary variable. Suppose that a population C contains two groups C1 (with y=1) and C0 (with y=0), with respective proportions π1 and π0=1π1; the existence of these two groups requires π1(0,1), an open interval. In addition, the training set {(xi,yi)}i=1n contains n labelled individuals, which were randomly, independently collected from C.

In the sense of minimum classification error rate, an optimal discriminant function

Generative–discriminative tradeoff (GDT)

The intuition behind the GDT is to construct a new log-likelihood as a weighted average of the log-likelihoods g(θ) for generative learning and d(α) for discriminative learning. In order to couple the two separate estimations of θ and α, d(α) is represented by y|x(θ) through the use of the mapping α(θ). The reasons for using d(α(θ)), rather than g(θ(α)), include the following: p(y|x) can be derived from p(x,y) but not vice versa; the dimension of θ is larger than that of α, as with LDA.

Σg(θ) and Σλ(θ)

Let Ig(θ) and Iy|x(θ) denote the following matrices, respectively: Ig(θ)=E(x,y){g(θ)θg(θ)θT}andIy|x(θ)=E(x,y){y|x(θ)θy|x(θ)θT}, where E(x,y){} represents the expectation over p(x,y).

After some algebra, it follows that 1nIy|x(θ)=xp(C1|x)p(C0|x)logr(θ,π;x)θlogr(θ,π;x)θTp(x)dx, where logr(θ,π;x)=log[{π1p(x|θ1)}/{π0p(x|θ0)}]=g(x,α) and p(x)=π1p(x|θ1)+π0p(x|θ0).

In addition, considering λ(θ)=λg(θ)+(1λ)y|x(θ), we obtain Uλ(θ)=E(x,y){2λ(θ)θθT}=λIg(θ)+(1λ)Iy|x(θ) and, since E(

ARE for linear normal discrimination

The theoretical arguments presented earlier as regards the ARE are generic, regardless of probabilistic models for the data. In order to illustrate the relative performance of the discriminative, generative and GDT approaches, however, we need to follow the practice of the generative and GDT approaches, that is, to specify probabilistic distributions for the two groups.

In the classification literature and practice, the most widely adopted generative model assumes two normally distributed

Experiments on classification performance of GDT

In order to implement a GDT, the data-generating process p(x|y) has to be specified. As was done in the simulation study by Bouchard and Triggs (2004), we assume that x|y follows multivariate normal distributions N(μk,Λ), k=0,1, with a common diagonal covariance matrix Λ across the groups. With this assumption, LDA-Λ and LLR are equivalent to the GDT at λ=1 and λ=0, respectively. Meanwhile, in order to investigate how the classification performance depends on the ‘log-likelihood weight’ λ, λ is

Discussion

To our knowledge, there is as yet no theoretical analysis of the GDT from the perspective of asymptotic relative efficiency. Our asymptotic analysis, as with Efron (1975) and O’Neill (1980), assumed no model misspecification, partly because of technical difficulties that often occur in modelling diverse misspecifications in a generic way, and partly because in practice some diagnoses of the model and transformations of the data can be carried out to make the model better specified. Bearing this

Conclusions

The conclusions from our study are fourfold.

First, the GDT is a generative model integrating both discriminative and generative learning, so it is also subject to misspecification of the data-generating process p(x|y,θg), or otherwise of the joint distribution p(x,y|θ), and is hindered by complex optimisation.

Secondly, amongst the three approaches that we compare, the asymptotic efficiency of the GDT is higher than that of the discriminative approach but lower than that of the generative

Acknowledgments

The authors thank Guillaume Bouchard, the reviewer and the associate editor for their extensive comments and insightful suggestions that have reshaped and reinforced the manuscript. The work partly benefited from our participation in the Research Programme on ‘Statistical Theory and Methods for Complex, High-Dimensional Data’ at the Isaac Newton Institute for Mathematical Sciences in Cambridge. J.H.X. is also grateful for a Hutchison Whampoa-EPSRC Dorothy Hodgkin Postgraduate Award.

References (20)

There are more references available in the full text version of this article.

Cited by (0)

View full text