Exact and approximate algorithms for variable selection in linear discriminant analysis

https://doi.org/10.1016/j.csda.2010.05.027Get rights and content

Abstract

Variable selection is a venerable problem in multivariate statistics. In the context of discriminant analysis, the goal is to select a subset of variables that accomplishes one of two objectives: (1) the provision of a parsimonious, yet descriptive, representation of group structure, or (2) the ability to correctly allocate new cases to groups. We present an exact (branch-and-bound) algorithm for variable selection in linear discriminant analysis that identifies subsets of variables that minimize Wilks’ Λ. An important feature of this algorithm is a variable reordering scheme that greatly reduces computation time. We also present an approximate procedure based on tabu search, which can be implemented for a variety of objective criteria designed for either the descriptive or allocation goals associated with discriminant analysis. The tabu search heuristic is especially useful for maximizing the hit ratio (i.e., the percentage of correctly classified cases). Computational results for the proposed methods are provided for two data sets from the literature.

Introduction

The problem of selecting a subset of variables from a larger candidate pool spans several areas of multivariate statistics (Beale et al., 1967, Duarte Silva, 2001, Fueda et al., 2009). Most notably, variable selection research abounds in areas such as multiple linear regression (Furnival and Wilson, 1974, Miller, 2002, Gatu and Kontoghiorghes, 2006), logistic regression (Pacheco et al., 2009), polynomial regression (Peixoto, 1987, Brusco and Steinley, in press, Brusco et al., 2009), principal component analysis (Jolliffe, 2002, Ch. 6; Cadima et al., 2004; Mori et al., 2007; Brusco et al., 2009), factor analysis (Kano and Harada, 2000, Hogarty et al., 2004), cluster analysis (Brusco and Cradit, 2001, Steinley and Brusco, 2008, Krzanowski and Hand, 2009), and discriminant analysis (McCabe, 1975, McKay and Campbell, 1982a, McKay and Campbell, 1982b, Pacheco et al., 2006).

In this paper, we focus on variable selection within the context of linear discriminant analysis (Fisher, 1936). Linear discriminant analysis software is commonly available in commercial statistical software packages, and remains widely used in business and scientific applications. Our intention is not to imply that variable selection is only important for linear discriminant analysis and not for alternative approaches such as quadratic discriminant analysis (Smith, 1947) or mathematical programming methods (Stam, 1997). Instead, we focus on linear discriminant analysis in light of the well-established variable selection history for this topic.

Although it is perhaps fair to say that the subset selection literature in discriminant analysis is not as extensive as it is in regression, during the past 50 years, there have been many attempts to improve variable selection in discriminant analysis (see, for example, Roy, 1958; Weiner and Dunn, 1966; Horton et al., 1968; Urbakh, 1971; McCabe, 1975; McLachlan, 1976; Habbema and Hermans, 1977; Murray, 1977; McHenry, 1978; Ganeshanandam and Krzanowski, 1989; Snappin and Knoke, 1989; Seaman and Young, 1990; Duarte Silva, 2001; Wood et al., 2005; Huberty and Olejnik, 2006, Ch. 6; Trendafilov and Jolliffe, 2007). One of the most important distinctions that has emerged from this research is that the goal of variable selection in discriminant analysis depends on whether the analyst’s objective is description or allocation (McKay and Campbell, 1982a). In the case of description, the goal is to obtain a parsimonious subset of Q variables from a candidate pool of P variables that separates the groups almost as well as when all of the P variables are used. When allocation is the objective, the goal is to obtain a subset of Q variables that maximizes the predictive power of the model for classifying new cases. Throughout the remainder of this paper, we adopt the nomenclature of Huberty (1984), who uses the terms ‘descriptive discriminant analysis’ (DDA) and ‘predictive discriminant analysis’ (PDA) to refer to the description and allocation objectives, respectively.

Regardless of whether the context of an application is DDA or PDA, there are several important issues pertaining to the variable selection problem for discriminant analysis. First, it is necessary to establish an objective criterion, such as minimization of Wilks’ Λ (McCabe, 1975) or maximization of the rate of correct classification (Habbema and Hermans, 1977), so as to facilitate the comparison of different subsets. Second, the development of computationally efficient methods for finding the optimal (or, at least, near-optimal) subset of Q variables is required. Third, there must be some mechanism for selecting the best value of Q on the range of 1QP. This last issue is especially important when the objective criterion is a monotonically non-increasing (or non-decreasing) function of Q (e.g., Wilks’ Λ cannot increase when a variable is added).

Stepwise selection of variables for a discriminant analysis dates back more than 50 years (Kendall, 1957, Roy, 1958). In its most fundamental form, the stepwise approach successively adds variables, one at a time, on the basis of changes in measures of the separation of the groups. These changes are typically measured by F-statistics or analogous measures such as Wilks’ Λ. Stepwise discriminant analysis has been roundly criticized in the literature for its failure to obtain the best subset for either DDA or PDA (see, for example, McKay and Campbell, 1982a, Huberty, 1984). A seminal contribution to the variable selection problem in discriminant analysis was provided by McCabe (1975), who adapted Furnival’s 1971 all-possible-subsets regression approach for discriminant analysis. More specifically, McCabe demonstrated that fast updates of models were possible for computing Wilks’ Λ, enabling the evaluation of all 2P1 models for P20. Both McKay and Campbell (1982a) and Huberty (1984) advocated the use of McCabe’s all-possible-subsets approach of minimizing Wilks’ Λ in the context of DDA when computationally feasible, More recently, Duarte Silva (2001) demonstrated that Furnival and Wilson’s (1974) algorithm can be applied to obtain subsets that minimize Wilks’ Λ, extending feasibility to P>20.

Although a variable subset obtained based on minimizing Wilks’ Λ might also perform satisfactorily for both DDA and PDA applications, Habbema and Hermans (1977) recommended maximization of the ‘hit ratio’ (i.e., the percentage of correctly classified cases) as a preferable objective criterion for the PDA context. An important advantage of this criterion is that it is not a monotonically non-decreasing function of Q and, therefore, the value of Q that maximizes the hit ratio can be selected. More recently, Pacheco et al. (2006) evaluated a variety of heuristic procedures for variable selection using maximization of the hit ratio as the objective function. Pacheco et al. reported that maximizing the hit ratio led to better predictive performance in comparison to surrogate criteria, such as Wilks’ Λ, They also observed that metaheuristics such as memetic algorithms (Moscato and Cotta, 2003), variable neighborhood search (Hansen and Mladenović, 2003), and tabu search Glover and Laguna (1993) led to better classification performance than traditional stepwise methods.

In this paper, we develop methods for variable selection that can be used for variable selection in both DDA and PDA. First, we present an exact branch-and-bound algorithm for minimizing Wilks’ Λ that incorporates an effective reordering of the variable list prior to initiation of the search process. Placing the most important variables for minimizing Wilks’ Λ earlier in the list allows faster pruning of partial solutions. The proposed algorithm can be applied gainfully for DDA applications where the number of variables is 50 or fewer, and possibly for larger pools of candidate variables in some instances. For PDA applications, the solution obtained by the branch-and-bound algorithm can be used directly, or it can serve as a starting point for a tabu search algorithm designed to maximize the hit ratio. The tabu search heuristic allows the user to input a range (i.e., a minimum and maximum) of possible values for Q.

In Section 2, we provide a formal presentation of the linear discriminant function, allocation rules, Wilks’ Λ, and hit ratios. Section 3 contains a description of the branch-and-bound procedure for variable selection in DDA. The tabu search heuristic for PDA is presented in Section 4. Computational results for empirical data sets are reported in Section 5. The paper concludes with a brief summary in Section 6.

Section snippets

Variable selection in linear discriminant analysis

Fisher, 1936, Fisher, 1938 proposed a linear discriminant function approach for separating populations to the greatest extent possible. To describe Fisher’s approach, we use notation that is concordant with (but not identical to) notation used in textbook presentations of discriminant analysis (Hand, 1981, Johnson and Wichern, 2007, Ch. 11). We denote O={o1,o2,,oN} as a training set of N objects (e.g., observations, cases, subjects, etc.) with corresponding index set C={i=1,2,,N}. There are G

Discrete optimization problem

We denote RQR as a subset consisting of 1QP variables for establishing discriminant functions. Additionally, we also denote Λ(RQ) as the value of Wilks’ Λ corresponding to subset RQ. For any given value of Q, the goal is to select the subset from among all πQ=P!/(Q!(PQ)!) subsets such that Λ(RQ) is minimized. More formally, the relevant discrete optimization problem for DDA can be posed as follows: Minimize:Λ(RQ).Subject to:RQR,and|RQ|=Q. The subset that minimizes (5) is denoted as RQ.

Discrete optimization problem

Although the implementation of the solution obtained by the branch-and-bound algorithm for DDA might also perform exceptionally well for PDA, we wanted to develop a program that would allow for further refinement of the subset to improve classification performance. We will assume that RQ is obtained as input from the branch-and-bound algorithm, and select parameters Qmin and Qmax such that QminQQmax. The goal is to identify a subset that consists of somewhere between Qmin and Qmax variables

Data set #1: the “soil” data set

The first data set that we selected for analysis is the “soil” data set, originally reported by Horton et al. (1968) and subsequently analyzed by McCabe (1975) and Habbema and Hermans (1977). The data consist of N=48 soil samples measured on each of P=9 variables, most of which pertain to various mineral contents. The samples are subdivided into G=12 groups, each of size 4, based on three levels of topological position and four depths.

The results of McCabe (1975) clearly revealed that stepwise

Summary and extensions

We have presented variable selection methods for DDA and PDA. The method for DDA is an exact branch-and-bound algorithm that finds subsets that minimize Wilks’ Λ. The algorithm was extremely efficient in producing optimal subsets for an empirical data set with P=40 candidate predictors and is scalable for up to 50 predictors. In light of the fact that all-possible-subsets algorithms for minimizing Wilks’ Λ are typically restricted to candidate pools of roughly 20 to 30 variables (Huberty and

References (58)

  • M.J. Brusco et al.

    Branch-and-Bound Applications in Combinatorial Data Analysis

    (2005)
  • M.J. Brusco et al.

    Neighborhood search heuristics for selecting hierarchically well-formulated subsets in polynomial regression

    Naval Research Logistics

    (2010)
  • M.J. Brusco et al.

    Variable neighborhood search heuristics for selecting a subset of variables in principal component analysis

    Psychometrika

    (2009)
  • M.J. Brusco et al.

    An exact algorithm for hierarchically well-formulated subsets in second-order polynomial regression

    Technometrics

    (2009)
  • Z. Drezner et al.

    Tabu search model selection in multiple regression analysis

    Communications in Statistics

    (1999)
  • R.A. Fisher

    The use of multiple measurements in taxonomic problems

    Annals of Eugenics

    (1936)
  • R.A. Fisher

    The statistical utilization of multiple measurements

    Annals of Eugenics

    (1938)
  • K. Fueda et al.

    Variable selection in multivariate methods using global score estimation

    Computational Statistics

    (2009)
  • Y. Fujikoshi

    A criterion for variable selection in multiple discriminant analysis

    Hiroshima Mathematical Journal

    (1983)
  • G.M. Furnival

    All possible regressions with less computation

    Technometrics

    (1971)
  • G.M. Furnival et al.

    Regression by leaps and bounds

    Technometrics

    (1974)
  • S. Ganeshanandam et al.

    On selecting variables and assessing their performance in linear discriminant analysis

    Australian Journal of Statistics

    (1989)
  • C. Gatu et al.

    Branch-and-bound algorithms for computing the best-subset regression models

    Journal of Computational and Graphical Statistics

    (2006)
  • F. Glover et al.

    Tabu search

  • J.D.F. Habbema et al.

    Selection of variables in discriminant analysis by—statistic and error rate

    Technometrics

    (1977)
  • D.J. Hand

    Discrimination and Classification

    (1981)
  • P. Hansen et al.

    Variable neighborhood search

  • K.Y. Hogarty et al.

    Selection of variables in exploratory factor analysis: an empirical comparison of a stepwise and traditional approach

    Psychometrika

    (2004)
  • I.F. Horton et al.

    Multivariate-covariance and canonical analysis: a method for selecting the most effective discriminators in a multivariate situation

    Biometrics

    (1968)
  • Cited by (0)

    View full text