Exact and approximate algorithms for variable selection in linear discriminant analysis
Introduction
The problem of selecting a subset of variables from a larger candidate pool spans several areas of multivariate statistics (Beale et al., 1967, Duarte Silva, 2001, Fueda et al., 2009). Most notably, variable selection research abounds in areas such as multiple linear regression (Furnival and Wilson, 1974, Miller, 2002, Gatu and Kontoghiorghes, 2006), logistic regression (Pacheco et al., 2009), polynomial regression (Peixoto, 1987, Brusco and Steinley, in press, Brusco et al., 2009), principal component analysis (Jolliffe, 2002, Ch. 6; Cadima et al., 2004; Mori et al., 2007; Brusco et al., 2009), factor analysis (Kano and Harada, 2000, Hogarty et al., 2004), cluster analysis (Brusco and Cradit, 2001, Steinley and Brusco, 2008, Krzanowski and Hand, 2009), and discriminant analysis (McCabe, 1975, McKay and Campbell, 1982a, McKay and Campbell, 1982b, Pacheco et al., 2006).
In this paper, we focus on variable selection within the context of linear discriminant analysis (Fisher, 1936). Linear discriminant analysis software is commonly available in commercial statistical software packages, and remains widely used in business and scientific applications. Our intention is not to imply that variable selection is only important for linear discriminant analysis and not for alternative approaches such as quadratic discriminant analysis (Smith, 1947) or mathematical programming methods (Stam, 1997). Instead, we focus on linear discriminant analysis in light of the well-established variable selection history for this topic.
Although it is perhaps fair to say that the subset selection literature in discriminant analysis is not as extensive as it is in regression, during the past 50 years, there have been many attempts to improve variable selection in discriminant analysis (see, for example, Roy, 1958; Weiner and Dunn, 1966; Horton et al., 1968; Urbakh, 1971; McCabe, 1975; McLachlan, 1976; Habbema and Hermans, 1977; Murray, 1977; McHenry, 1978; Ganeshanandam and Krzanowski, 1989; Snappin and Knoke, 1989; Seaman and Young, 1990; Duarte Silva, 2001; Wood et al., 2005; Huberty and Olejnik, 2006, Ch. 6; Trendafilov and Jolliffe, 2007). One of the most important distinctions that has emerged from this research is that the goal of variable selection in discriminant analysis depends on whether the analyst’s objective is description or allocation (McKay and Campbell, 1982a). In the case of description, the goal is to obtain a parsimonious subset of variables from a candidate pool of variables that separates the groups almost as well as when all of the variables are used. When allocation is the objective, the goal is to obtain a subset of variables that maximizes the predictive power of the model for classifying new cases. Throughout the remainder of this paper, we adopt the nomenclature of Huberty (1984), who uses the terms ‘descriptive discriminant analysis’ (DDA) and ‘predictive discriminant analysis’ (PDA) to refer to the description and allocation objectives, respectively.
Regardless of whether the context of an application is DDA or PDA, there are several important issues pertaining to the variable selection problem for discriminant analysis. First, it is necessary to establish an objective criterion, such as minimization of Wilks’ (McCabe, 1975) or maximization of the rate of correct classification (Habbema and Hermans, 1977), so as to facilitate the comparison of different subsets. Second, the development of computationally efficient methods for finding the optimal (or, at least, near-optimal) subset of variables is required. Third, there must be some mechanism for selecting the best value of on the range of . This last issue is especially important when the objective criterion is a monotonically non-increasing (or non-decreasing) function of (e.g., Wilks’ cannot increase when a variable is added).
Stepwise selection of variables for a discriminant analysis dates back more than 50 years (Kendall, 1957, Roy, 1958). In its most fundamental form, the stepwise approach successively adds variables, one at a time, on the basis of changes in measures of the separation of the groups. These changes are typically measured by -statistics or analogous measures such as Wilks’ . Stepwise discriminant analysis has been roundly criticized in the literature for its failure to obtain the best subset for either DDA or PDA (see, for example, McKay and Campbell, 1982a, Huberty, 1984). A seminal contribution to the variable selection problem in discriminant analysis was provided by McCabe (1975), who adapted Furnival’s 1971 all-possible-subsets regression approach for discriminant analysis. More specifically, McCabe demonstrated that fast updates of models were possible for computing Wilks’ , enabling the evaluation of all models for . Both McKay and Campbell (1982a) and Huberty (1984) advocated the use of McCabe’s all-possible-subsets approach of minimizing Wilks’ in the context of DDA when computationally feasible, More recently, Duarte Silva (2001) demonstrated that Furnival and Wilson’s (1974) algorithm can be applied to obtain subsets that minimize Wilks’ , extending feasibility to .
Although a variable subset obtained based on minimizing Wilks’ might also perform satisfactorily for both DDA and PDA applications, Habbema and Hermans (1977) recommended maximization of the ‘hit ratio’ (i.e., the percentage of correctly classified cases) as a preferable objective criterion for the PDA context. An important advantage of this criterion is that it is not a monotonically non-decreasing function of and, therefore, the value of that maximizes the hit ratio can be selected. More recently, Pacheco et al. (2006) evaluated a variety of heuristic procedures for variable selection using maximization of the hit ratio as the objective function. Pacheco et al. reported that maximizing the hit ratio led to better predictive performance in comparison to surrogate criteria, such as Wilks’ , They also observed that metaheuristics such as memetic algorithms (Moscato and Cotta, 2003), variable neighborhood search (Hansen and Mladenović, 2003), and tabu search Glover and Laguna (1993) led to better classification performance than traditional stepwise methods.
In this paper, we develop methods for variable selection that can be used for variable selection in both DDA and PDA. First, we present an exact branch-and-bound algorithm for minimizing Wilks’ that incorporates an effective reordering of the variable list prior to initiation of the search process. Placing the most important variables for minimizing Wilks’ earlier in the list allows faster pruning of partial solutions. The proposed algorithm can be applied gainfully for DDA applications where the number of variables is 50 or fewer, and possibly for larger pools of candidate variables in some instances. For PDA applications, the solution obtained by the branch-and-bound algorithm can be used directly, or it can serve as a starting point for a tabu search algorithm designed to maximize the hit ratio. The tabu search heuristic allows the user to input a range (i.e., a minimum and maximum) of possible values for .
In Section 2, we provide a formal presentation of the linear discriminant function, allocation rules, Wilks’ , and hit ratios. Section 3 contains a description of the branch-and-bound procedure for variable selection in DDA. The tabu search heuristic for PDA is presented in Section 4. Computational results for empirical data sets are reported in Section 5. The paper concludes with a brief summary in Section 6.
Section snippets
Variable selection in linear discriminant analysis
Fisher, 1936, Fisher, 1938 proposed a linear discriminant function approach for separating populations to the greatest extent possible. To describe Fisher’s approach, we use notation that is concordant with (but not identical to) notation used in textbook presentations of discriminant analysis (Hand, 1981, Johnson and Wichern, 2007, Ch. 11). We denote as a training set of objects (e.g., observations, cases, subjects, etc.) with corresponding index set . There are
Discrete optimization problem
We denote as a subset consisting of variables for establishing discriminant functions. Additionally, we also denote as the value of Wilks’ corresponding to subset . For any given value of , the goal is to select the subset from among all subsets such that is minimized. More formally, the relevant discrete optimization problem for DDA can be posed as follows: The subset that minimizes (5) is denoted as .
Discrete optimization problem
Although the implementation of the solution obtained by the branch-and-bound algorithm for DDA might also perform exceptionally well for PDA, we wanted to develop a program that would allow for further refinement of the subset to improve classification performance. We will assume that is obtained as input from the branch-and-bound algorithm, and select parameters and such that . The goal is to identify a subset that consists of somewhere between and variables
Data set #1: the “soil” data set
The first data set that we selected for analysis is the “soil” data set, originally reported by Horton et al. (1968) and subsequently analyzed by McCabe (1975) and Habbema and Hermans (1977). The data consist of soil samples measured on each of variables, most of which pertain to various mineral contents. The samples are subdivided into groups, each of size 4, based on three levels of topological position and four depths.
The results of McCabe (1975) clearly revealed that stepwise
Summary and extensions
We have presented variable selection methods for DDA and PDA. The method for DDA is an exact branch-and-bound algorithm that finds subsets that minimize Wilks’ . The algorithm was extremely efficient in producing optimal subsets for an empirical data set with candidate predictors and is scalable for up to 50 predictors. In light of the fact that all-possible-subsets algorithms for minimizing Wilks’ are typically restricted to candidate pools of roughly 20 to 30 variables (Huberty and
References (58)
- et al.
Computational aspects of algorithms for variable selection in the context of principal components
Computational Statistics and Data Analysis
(2004) Efficient variable screening for multivariate analysis
Journal of Multivariate Analysis
(2001)- et al.
A simple method for screening variables before clustering of microarray data
Computational Statistics and Data Analysis
(2009) - et al.
A variable selection method based on tabu search for logistic regression models
European Journal of Operational Research
(2009) - et al.
Analysis of new variable selection methods for discriminant analysis
Computational Statistics and Data Analysis
(2006) - et al.
DALASS: variable selection in discriminant analysis via the lasso
Computational Statistics and Data Analysis
(2007) A new look at statistical model identification
IEEE Transactions on Automatic Control
(1974)- et al.
The discarding of variables in multivariate analysis
Biometrika
(1967) - et al.
Classification and Regression Trees
(1984) - et al.
A variable-selection heuristic for -means clustering
Psychometrika
(2001)