Exact and approximate algorithms for variable selection in linear discriminant analysis

doi:10.1016/j.csda.2010.05.027

Computational Statistics & Data Analysis

Volume 55, Issue 1, 1 January 2011, Pages 123-131

https://doi.org/10.1016/j.csda.2010.05.027 Get rights and content

Abstract

Variable selection is a venerable problem in multivariate statistics. In the context of discriminant analysis, the goal is to select a subset of variables that accomplishes one of two objectives: (1) the provision of a parsimonious, yet descriptive, representation of group structure, or (2) the ability to correctly allocate new cases to groups. We present an exact (branch-and-bound) algorithm for variable selection in linear discriminant analysis that identifies subsets of variables that minimize Wilks’ $Λ$ . An important feature of this algorithm is a variable reordering scheme that greatly reduces computation time. We also present an approximate procedure based on tabu search, which can be implemented for a variety of objective criteria designed for either the descriptive or allocation goals associated with discriminant analysis. The tabu search heuristic is especially useful for maximizing the hit ratio (i.e., the percentage of correctly classified cases). Computational results for the proposed methods are provided for two data sets from the literature.

Introduction

The problem of selecting a subset of variables from a larger candidate pool spans several areas of multivariate statistics (Beale et al., 1967, Duarte Silva, 2001, Fueda et al., 2009). Most notably, variable selection research abounds in areas such as multiple linear regression (Furnival and Wilson, 1974, Miller, 2002, Gatu and Kontoghiorghes, 2006), logistic regression (Pacheco et al., 2009), polynomial regression (Peixoto, 1987, Brusco and Steinley, in press, Brusco et al., 2009), principal component analysis (Jolliffe, 2002, Ch. 6; Cadima et al., 2004; Mori et al., 2007; Brusco et al., 2009), factor analysis (Kano and Harada, 2000, Hogarty et al., 2004), cluster analysis (Brusco and Cradit, 2001, Steinley and Brusco, 2008, Krzanowski and Hand, 2009), and discriminant analysis (McCabe, 1975, McKay and Campbell, 1982a, McKay and Campbell, 1982b, Pacheco et al., 2006).

In this paper, we focus on variable selection within the context of linear discriminant analysis (Fisher, 1936). Linear discriminant analysis software is commonly available in commercial statistical software packages, and remains widely used in business and scientific applications. Our intention is not to imply that variable selection is only important for linear discriminant analysis and not for alternative approaches such as quadratic discriminant analysis (Smith, 1947) or mathematical programming methods (Stam, 1997). Instead, we focus on linear discriminant analysis in light of the well-established variable selection history for this topic.

Although it is perhaps fair to say that the subset selection literature in discriminant analysis is not as extensive as it is in regression, during the past 50 years, there have been many attempts to improve variable selection in discriminant analysis (see, for example, Roy, 1958; Weiner and Dunn, 1966; Horton et al., 1968; Urbakh, 1971; McCabe, 1975; McLachlan, 1976; Habbema and Hermans, 1977; Murray, 1977; McHenry, 1978; Ganeshanandam and Krzanowski, 1989; Snappin and Knoke, 1989; Seaman and Young, 1990; Duarte Silva, 2001; Wood et al., 2005; Huberty and Olejnik, 2006, Ch. 6; Trendafilov and Jolliffe, 2007). One of the most important distinctions that has emerged from this research is that the goal of variable selection in discriminant analysis depends on whether the analyst’s objective is description or allocation (McKay and Campbell, 1982a). In the case of description, the goal is to obtain a parsimonious subset of $Q$ variables from a candidate pool of $P$ variables that separates the groups almost as well as when all of the $P$ variables are used. When allocation is the objective, the goal is to obtain a subset of $Q$ variables that maximizes the predictive power of the model for classifying new cases. Throughout the remainder of this paper, we adopt the nomenclature of Huberty (1984), who uses the terms ‘descriptive discriminant analysis’ (DDA) and ‘predictive discriminant analysis’ (PDA) to refer to the description and allocation objectives, respectively.

Regardless of whether the context of an application is DDA or PDA, there are several important issues pertaining to the variable selection problem for discriminant analysis. First, it is necessary to establish an objective criterion, such as minimization of Wilks’ $Λ$ (McCabe, 1975) or maximization of the rate of correct classification (Habbema and Hermans, 1977), so as to facilitate the comparison of different subsets. Second, the development of computationally efficient methods for finding the optimal (or, at least, near-optimal) subset of $Q$ variables is required. Third, there must be some mechanism for selecting the best value of $Q$ on the range of $1 \leq Q \leq P$ . This last issue is especially important when the objective criterion is a monotonically non-increasing (or non-decreasing) function of $Q$ (e.g., Wilks’ $Λ$ cannot increase when a variable is added).

Stepwise selection of variables for a discriminant analysis dates back more than 50 years (Kendall, 1957, Roy, 1958). In its most fundamental form, the stepwise approach successively adds variables, one at a time, on the basis of changes in measures of the separation of the groups. These changes are typically measured by $F$ -statistics or analogous measures such as Wilks’ $Λ$ . Stepwise discriminant analysis has been roundly criticized in the literature for its failure to obtain the best subset for either DDA or PDA (see, for example, McKay and Campbell, 1982a, Huberty, 1984). A seminal contribution to the variable selection problem in discriminant analysis was provided by McCabe (1975), who adapted Furnival’s 1971 all-possible-subsets regression approach for discriminant analysis. More specifically, McCabe demonstrated that fast updates of models were possible for computing Wilks’ $Λ$ , enabling the evaluation of all $2^{P} - 1$ models for $P \leq 20$ . Both McKay and Campbell (1982a) and Huberty (1984) advocated the use of McCabe’s all-possible-subsets approach of minimizing Wilks’ $Λ$ in the context of DDA when computationally feasible, More recently, Duarte Silva (2001) demonstrated that Furnival and Wilson’s (1974) algorithm can be applied to obtain subsets that minimize Wilks’ $Λ$ , extending feasibility to $P > 20$ .

Although a variable subset obtained based on minimizing Wilks’ $Λ$ might also perform satisfactorily for both DDA and PDA applications, Habbema and Hermans (1977) recommended maximization of the ‘hit ratio’ (i.e., the percentage of correctly classified cases) as a preferable objective criterion for the PDA context. An important advantage of this criterion is that it is not a monotonically non-decreasing function of $Q$ and, therefore, the value of $Q$ that maximizes the hit ratio can be selected. More recently, Pacheco et al. (2006) evaluated a variety of heuristic procedures for variable selection using maximization of the hit ratio as the objective function. Pacheco et al. reported that maximizing the hit ratio led to better predictive performance in comparison to surrogate criteria, such as Wilks’ $Λ$ , They also observed that metaheuristics such as memetic algorithms (Moscato and Cotta, 2003), variable neighborhood search (Hansen and Mladenović, 2003), and tabu search Glover and Laguna (1993) led to better classification performance than traditional stepwise methods.

In this paper, we develop methods for variable selection that can be used for variable selection in both DDA and PDA. First, we present an exact branch-and-bound algorithm for minimizing Wilks’ $Λ$ that incorporates an effective reordering of the variable list prior to initiation of the search process. Placing the most important variables for minimizing Wilks’ $Λ$ earlier in the list allows faster pruning of partial solutions. The proposed algorithm can be applied gainfully for DDA applications where the number of variables is 50 or fewer, and possibly for larger pools of candidate variables in some instances. For PDA applications, the solution obtained by the branch-and-bound algorithm can be used directly, or it can serve as a starting point for a tabu search algorithm designed to maximize the hit ratio. The tabu search heuristic allows the user to input a range (i.e., a minimum and maximum) of possible values for $Q$ .

In Section 2, we provide a formal presentation of the linear discriminant function, allocation rules, Wilks’ $Λ$ , and hit ratios. Section 3 contains a description of the branch-and-bound procedure for variable selection in DDA. The tabu search heuristic for PDA is presented in Section 4. Computational results for empirical data sets are reported in Section 5. The paper concludes with a brief summary in Section 6.

Section snippets

Variable selection in linear discriminant analysis

Fisher, 1936, Fisher, 1938 proposed a linear discriminant function approach for separating populations to the greatest extent possible. To describe Fisher’s approach, we use notation that is concordant with (but not identical to) notation used in textbook presentations of discriminant analysis (Hand, 1981, Johnson and Wichern, 2007, Ch. 11). We denote $O = {o_{1}, o_{2}, \dots, o_{N}}$ as a training set of $N$ objects (e.g., observations, cases, subjects, etc.) with corresponding index set $C = {i = 1, 2, \dots, N}$ . There are $G$

Discrete optimization problem

We denote $R_{Q} \subseteq R$ as a subset consisting of $1 \leq Q \leq P$ variables for establishing discriminant functions. Additionally, we also denote $Λ (R_{Q})$ as the value of Wilks’ $Λ$ corresponding to subset $R_{Q}$ . For any given value of $Q$ , the goal is to select the subset from among all $π_{Q} = P! / (Q! (P - Q)!)$ subsets such that $Λ (R_{Q})$ is minimized. More formally, the relevant discrete optimization problem for DDA can be posed as follows: $Minimize : Λ (R_{Q}) .$ $Subject to : R_{Q} \subseteq R,$ $and | R_{Q} | = Q .$ The subset that minimizes (5) is denoted as $R_{Q}^{*}$ .

Discrete optimization problem

Although the implementation of the solution obtained by the branch-and-bound algorithm for DDA might also perform exceptionally well for PDA, we wanted to develop a program that would allow for further refinement of the subset to improve classification performance. We will assume that $R_{Q}^{*}$ is obtained as input from the branch-and-bound algorithm, and select parameters $Q_{\min}$ and $Q_{\max}$ such that $Q_{\min} \leq Q \leq Q_{\max}$ . The goal is to identify a subset that consists of somewhere between $Q_{\min}$ and $Q_{\max}$ variables

Data set #1: the “soil” data set

The first data set that we selected for analysis is the “soil” data set, originally reported by Horton et al. (1968) and subsequently analyzed by McCabe (1975) and Habbema and Hermans (1977). The data consist of $N = 48$ soil samples measured on each of $P = 9$ variables, most of which pertain to various mineral contents. The samples are subdivided into $G = 12$ groups, each of size 4, based on three levels of topological position and four depths.

The results of McCabe (1975) clearly revealed that stepwise

Summary and extensions

We have presented variable selection methods for DDA and PDA. The method for DDA is an exact branch-and-bound algorithm that finds subsets that minimize Wilks’ $Λ$ . The algorithm was extremely efficient in producing optimal subsets for an empirical data set with $P = 40$ candidate predictors and is scalable for up to 50 predictors. In light of the fact that all-possible-subsets algorithms for minimizing Wilks’ $Λ$ are typically restricted to candidate pools of roughly 20 to 30 variables (Huberty and

References (58)

J. Cadima et al.
Computational aspects of algorithms for variable selection in the context of principal components
Computational Statistics and Data Analysis
(2004)
A.P. Duarte Silva
Efficient variable screening for multivariate analysis
Journal of Multivariate Analysis
(2001)
W.J. Krzanowski et al.
A simple method for screening variables before clustering of microarray data
Computational Statistics and Data Analysis
(2009)
J. Pacheco et al.
A variable selection method based on tabu search for logistic regression models
European Journal of Operational Research
(2009)
J. Pacheco et al.
Analysis of new variable selection methods for discriminant analysis
Computational Statistics and Data Analysis
(2006)
N.T. Trendafilov et al.
DALASS: variable selection in discriminant analysis via the lasso
Computational Statistics and Data Analysis
(2007)
H. Akaike
A new look at statistical model identification
IEEE Transactions on Automatic Control
(1974)
E.M.L. Beale et al.
The discarding of variables in multivariate analysis
Biometrika
(1967)
L. Breiman et al.
Classification and Regression Trees
(1984)
M.J. Brusco et al.
A variable-selection heuristic for $k$ -means clustering
Psychometrika
(2001)

M.J. Brusco et al.

Branch-and-Bound Applications in Combinatorial Data Analysis

(2005)

M.J. Brusco et al.

Neighborhood search heuristics for selecting hierarchically well-formulated subsets in polynomial regression

Naval Research Logistics

(2010)

M.J. Brusco et al.

Variable neighborhood search heuristics for selecting a subset of variables in principal component analysis

Psychometrika

(2009)

M.J. Brusco et al.

An exact algorithm for hierarchically well-formulated subsets in second-order polynomial regression

Technometrics

(2009)

Z. Drezner et al.

Tabu search model selection in multiple regression analysis

Communications in Statistics

(1999)

R.A. Fisher

The use of multiple measurements in taxonomic problems

Annals of Eugenics

(1936)

R.A. Fisher

The statistical utilization of multiple measurements

Annals of Eugenics

(1938)

K. Fueda et al.

Variable selection in multivariate methods using global score estimation

Computational Statistics

(2009)

Y. Fujikoshi

A criterion for variable selection in multiple discriminant analysis

Hiroshima Mathematical Journal

(1983)

G.M. Furnival

All possible regressions with less computation

Technometrics

(1971)

G.M. Furnival et al.

Regression by leaps and bounds

Technometrics

(1974)

S. Ganeshanandam et al.

On selecting variables and assessing their performance in linear discriminant analysis

Australian Journal of Statistics

(1989)

C. Gatu et al.

Branch-and-bound algorithms for computing the best-subset regression models

Journal of Computational and Graphical Statistics

(2006)

F. Glover et al.

Tabu search

J.D.F. Habbema et al.

Selection of variables in discriminant analysis by—statistic and error rate

Technometrics

(1977)

D.J. Hand

Discrimination and Classification

(1981)

P. Hansen et al.

Variable neighborhood search

K.Y. Hogarty et al.

Selection of variables in exploratory factor analysis: an empirical comparison of a stepwise and traditional approach

Psychometrika

(2004)

I.F. Horton et al.

Multivariate-covariance and canonical analysis: a method for selecting the most effective discriminators in a multivariate situation

Biometrics

(1968)

Cited by (0)

View full text

Exact and approximate algorithms for variable selection in linear discriminant analysis

Abstract

Introduction

Section snippets

Variable selection in linear discriminant analysis

Discrete optimization problem

Discrete optimization problem

Data set #1: the “soil” data set

Summary and extensions

Computational Statistics and Data Analysis

Journal of Multivariate Analysis

Computational Statistics and Data Analysis

European Journal of Operational Research

Computational Statistics and Data Analysis

Computational Statistics and Data Analysis

A new look at statistical model identification

IEEE Transactions on Automatic Control

The discarding of variables in multivariate analysis

Biometrika

Classification and Regression Trees

A variable-selection heuristic for k-means clustering

Psychometrika

Branch-and-Bound Applications in Combinatorial Data Analysis

Neighborhood search heuristics for selecting hierarchically well-formulated subsets in polynomial regression

Naval Research Logistics

Variable neighborhood search heuristics for selecting a subset of variables in principal component analysis

Psychometrika

An exact algorithm for hierarchically well-formulated subsets in second-order polynomial regression

Technometrics

Tabu search model selection in multiple regression analysis

Communications in Statistics

The use of multiple measurements in taxonomic problems

Annals of Eugenics

The statistical utilization of multiple measurements

Annals of Eugenics

Variable selection in multivariate methods using global score estimation

Computational Statistics

A criterion for variable selection in multiple discriminant analysis

Hiroshima Mathematical Journal

All possible regressions with less computation

Technometrics

Regression by leaps and bounds

Technometrics

On selecting variables and assessing their performance in linear discriminant analysis

Australian Journal of Statistics

Branch-and-bound algorithms for computing the best-subset regression models

Journal of Computational and Graphical Statistics

Tabu search

Selection of variables in discriminant analysis by—statistic and error rate

Technometrics

Discrimination and Classification

Variable neighborhood search

Selection of variables in exploratory factor analysis: an empirical comparison of a stepwise and traditional approach

Psychometrika

Multivariate-covariance and canonical analysis: a method for selecting the most effective discriminators in a multivariate situation

Biometrics

A variable-selection heuristic for $k$ -means clustering