Analysis of new variable selection methods for discriminant analysis

https://doi.org/10.1016/j.csda.2006.04.019Get rights and content

Abstract

Several methods to select variables that are subsequently used in discriminant analysis are proposed and analysed. The aim is to find from among a set of m variables a smaller subset which enables an efficient classification of cases. Reducing dimensionality has some advantages such as reducing the costs of data acquisition, better understanding of the final classification model, and an increase in the efficiency and efficacy of the model itself. The specific problem consists in finding, for a small integer value of p, the size p subset of original variables that yields the greatest percentage of hits in the discriminant analysis. To solve this problem a series of techniques based on metaheuristic strategies is proposed. After performing some test it is found that they obtain significantly better results than the stepwise, backward or forward methods used by classic statistical packages. The way these methods work is illustrated with several examples.

Introduction

The aim in the classification problem is to classify instances that are characterized by attributes or variables; that is, to determine which class every instance belongs to. Based on a set of examples (whose class is known) a set of rules are designed and generalised to classify the set of instances with the greatest precision possible.

There are several methodologies for dealing with this problem: classic discriminant analysis, logistic regression, neural networks, decision trees, instance-based learning, etc. most discriminant analysis methods search for hyperplanes in variable space that better distinguish the classes from the instances. This translates into searching for linear functions and then using them for classification purposes (Wald, Fisher, etc.). The use of linear functions enables better interpretation of the results (e.g., importance and/or significance of each variable in instance classification) when analysing the value of the coefficient obtained. Not every classification method is suited to this type of analysis and in fact some are classified as “black box” models. Thus, classic discriminant analysis continues to be an interesting methodology.

Before beginning designing a classification method, when many variables are involved, only those variables that are really required should be selected; that is, the first step is to eliminate the less significant variables from the analysis.

Thus, the problem consists in finding a subset of variables that can carry out this classification task in an optimum way. This problem is known as variables selection or feature selection. Research into this issue was started in the early 1960s by Lewis (1962) and Sebestyen (1962). According to Liu and Motoda (1998), feature selection provides some advantages such as reducing the costs of data acquisition, better understanding of the final classification model, and an increase in the efficiency and efficacy of such a model. Extensive research into variable selection has been carried out over the past four decades. Many studies on variable selection are related to medicine and biology, such as Sierra et al. (2001), Ganster et al. (2001), Inza et al. (2000), Lee et al. (2003), Shy and Suganthan (2003), and Tamoto et al. (2004).

From a computational point of view variable selection is an NP-hard problem (Kohavi, 1995, Cotta et al., 2004) and therefore there is no guarantee of finding the optimum solution (NP=nondeterministicpolynomialtime). This means that when the size of the problem is large finding an optimum solution in practice is unfeasible. Two different methodological approaches have been developed for variable selection problems: (a) the optimal or exact techniques (enumerative techniques) which are able to guarantee an optimal solution, but which are only applicable to small-sized sets; and the heuristic techniques, which are able to find good solutions (although unable to guarantee the optimum) in a reasonable amount of time. Among the enumerative techniques, the Narendra and Fukunaga (1977) algorithm is one of the best known but, as pointed out by Jain and Zongker (1997), the algorithm is impractical for problems with very large feature sets. Recent references about implicit enumerative techniques of selection features adapted to regression models could be found in Gatu and Kontoghiorghes, 2003, Gatu and Kontoghiorghes, 2005, Gatu and Kontoghiorghes, 2006. On the other hand, the quality of ’heuristic’ solutions strongly varies depending on the methods used. As found in other optimization problems, metaheuristic techniques have proved to be superior methodologies. For example, among the heuristic techniques we find the works based on genetic algorithms (see Bala et al., 1996, Jourdan et al., 2001, Oliveira et al., 2003; Inza et al., 2001a, Inza et al., 2001b; Wong and Nandi, 2004) and the recent work by García et al. (2006) who present a method based on Scatter Search.

These methods search for subsets with greater classification capacity based on different criteria. However, none of them focus on the posterior use of the variables selected in the discriminant analysis. This work proposes some “ad hoc” new methods and compares the different variable selection methods for discriminant analysis. For this specific purpose the stepwise method (Efroymson, 1960) and all its variants, such as O’Gorman's (2004) as well as the backward and forward methods, can be found in the literature. These are simple selection procedures based on statistical criteria (Wilks Lamda, Fisher's F, etc.) which have been incorporated into some of the best known statistical packages such as SPSS, BMDP, etc. As highlighted by Huberty (1994) and Salvador (2000) these methods are not very efficient, and when there are many original variables the optimum is rarely achieved. The methods proposed in this work yield significantly better results as shown below.

The methods designed in this work are based on different metaheuristic techniques such as: GRASP, memetic algorithms, VNS, Tabu search and path relinking. Different tests were used to analyse and compare their efficacy with each other and with previous methods.

The remainder of this paper is organized as follows: the problem is modelled in Section 2; the GRASP procedure is described in Section 3 and the memetic algorithm in Section 4; the variable neighbourhood search procedure (VNS) is described in Section 5, and the Tabu search algorithm in Section 6. In Section 7, a modification for improving the robustness of the strategies is described and in Section 8 the results of the computational experiments are presented. Finally in Section 9 the main conclusions are offered.

Section snippets

Modelling the problem

We can formulate the problem of selecting the subset of variables with superior classification performance as follows: V being a set of m variables, such that V={1,2,,m} and A being a set of instances, (also named “training” set). For each case we also know the class it belongs to. Given a predefined value pN,p<m, we have to find a subset SV, with a size p with the greatest classification capacity for the discriminant analysis, f(S).

To be precise, the function f(S) is defined as a percentage

GRASP

Greedy randomised adaptive search procedure (GRASP), is a heuristic that constructs solutions with controlled randomisation and a greedy function. Most GRASP implementations also include a local search that is used to improve the solutions generated with the randomised greedy function. GRASP was originally proposed in the context of a set covering problem (Feo and Resende, 1989). Details of the methodology and a survey on applications can be found in Feo and Resende (1995) and Pitsoulis and

Memetic Algorithms

Memetic Algorithms are also population-based methods and they have proved to be faster than Genetic Algorithms for certain types of problems, (Moscato and Laguna, 1996). In brief, they combine local search procedures with crossing or mutating operators; due to their structure some researchers have called them hybrid genetic algorithms, parallel genetic algorithms (PGAs) or genetic local search methods. The method is gaining wide acceptance particularly for the well-known problems of

VNS (Variable neighbourhood search)

VNS is a recent metaheuristic for solving optimization problems. Its basic idea is the systematic change of neighbourhood within a local search (Hansen and Mladenovic, 1998, Hansen and Mladenovic, 1999). Two recent tutorials were published by Hansen and Mladenovic, 2002, Hansen and Mladenovic, 2003. More information is available at: http:vnsheuristic.ull.es. The VNS procedure is as follows:

Variable_neighbour_search_procedure̲
Read initial Solution S
Repeat
k=0
Repeat
-Dok=k+1andS=S
-Randomly choose

Description of a basic algorithm

Tabu search (TS) is a strategy proposed by Glover, 1989, Glover, 1990. “TS is dramatically changing our possibilities of solving a host of combinatorial problems in different areas” (Glover and Laguna, 2002). This procedure explores the solution space beyond the local optimum. Once a local optimum is reached, upward moves and those worsening the solutions are allowed. Simultaneously, the last moves are marked as tabu during the following iterations to avoid cycling. Recent and comprehensive

Use of a validation set for improving the robustness in the strategies

It has been observed that the metaheuristic strategies described before (GRASP, memetic algorithms, VNS and TS) are focus more on the optimization point of view , than on a statistical point of view (i.e. in the generalization). Because of this, with the aim of increasing the robustness of the strategies, a new set of instances (“validation” set) is taken and used in the following way. A solution only is admitted as the new best solution if the number of fits in the validation set don’t get

Computational results

To check and compare the efficacy of the different methods a series of experiments was run with different test problems. We have selected data sets with enough instances for building large training sets (at least 10 cases for every freedom degree), validation set and 10 test sets from every data set. Using large training sets is recommended to obtain a trade-off between “optimization” and “generalization”. Six data sets were used. These data sets can be found in the well-known data repository

Conclusions

This work approaches the problem of variables selection for discriminant analysis. Although there are many references in the literature regarding selecting variables for their use in classification, there are very few key references on the selection of variables for their use in discriminant analysis. In fact, the most well-known statistical packages continue to use classic selection methods. In this work we propose as an alternative new methods based on various metaheuristic strategies. All of

Acknowledgements

Authors are grateful for financial support from the Spanish Ministry of Education and Science (National Plan of R&D - Projects SEJ2005-08923/ECON, and SEJ 2004-08176- 02-01/ECON).

References (43)

  • H. Ganster et al.

    Automated melanoma recognition

    IEEE Trans. Med. Imaging

    (2001)
  • F.C. García et al.

    Solving feature selection problem by a parallel scatter search

    European J. Oper. Res.

    (2006)
  • C. Gatu et al.

    Efficient strategies for deriving the subset {VAR} models

    Comput. Manag. Sci.

    (2005)
  • C. Gatu et al.

    Branch-and-bound algorithms for computing the best-subset regression models

    J. Comput. Graph. Stat.

    (2006)
  • F. Glover

    Tabu search. Part I

    ORSA J. Comput.

    (1989)
  • F. Glover

    Tabu search. Part II

    ORSA J. Comput.

    (1990)
  • F.y. Glover et al.

    Tabu Search

    (1997)
  • F.y. Glover et al.

    Tabu search

  • F. Glover et al.

    Fundamentals of scatter search and path relinking

    Control Cybernet.

    (2000)
  • P. Hansen et al.

    An introduction to variable neighborhood search

  • Hansen, P., Mladenovic, N., 1999. First improvement may better than best improvement: an Empirical Study. Les Cahiers...
  • Cited by (31)

    • Robust variable selection for model-based learning in presence of adulteration

      2021, Computational Statistics and Data Analysis
      Citation Excerpt :

      Such hybrid methods usually involve feature selection based on some measure of separability between groups, like the one introduced by Indahl and Næs (2004), specifically tailored for spectroscopic data, and the one proposed by Andrews and McNicholas (2014). Further, a series of techniques based on metaheuristic strategies for variable selection in discriminant analysis can be found in Pacheco et al. (2006), while the method of Chiang and Pell (2004) relies on a stochastic search based on genetic algorithms. Even though being more complex and computationally intensive, wrapper approaches provide better classification results and a more accurate representation of the data generating process (Kohavi and John, 1997).

    • Set based particle swarm optimization for the feature selection problem

      2019, Engineering Applications of Artificial Intelligence
    View all citing articles on Scopus
    View full text