Analysis of new variable selection methods for discriminant analysis

doi:10.1016/j.csda.2006.04.019

Computational Statistics & Data Analysis

Volume 51, Issue 3, 1 December 2006, Pages 1463-1478

https://doi.org/10.1016/j.csda.2006.04.019 Get rights and content

Abstract

Several methods to select variables that are subsequently used in discriminant analysis are proposed and analysed. The aim is to find from among a set of m variables a smaller subset which enables an efficient classification of cases. Reducing dimensionality has some advantages such as reducing the costs of data acquisition, better understanding of the final classification model, and an increase in the efficiency and efficacy of the model itself. The specific problem consists in finding, for a small integer value of p, the size p subset of original variables that yields the greatest percentage of hits in the discriminant analysis. To solve this problem a series of techniques based on metaheuristic strategies is proposed. After performing some test it is found that they obtain significantly better results than the stepwise, backward or forward methods used by classic statistical packages. The way these methods work is illustrated with several examples.

Introduction

The aim in the classification problem is to classify instances that are characterized by attributes or variables; that is, to determine which class every instance belongs to. Based on a set of examples (whose class is known) a set of rules are designed and generalised to classify the set of instances with the greatest precision possible.

There are several methodologies for dealing with this problem: classic discriminant analysis, logistic regression, neural networks, decision trees, instance-based learning, etc. most discriminant analysis methods search for hyperplanes in variable space that better distinguish the classes from the instances. This translates into searching for linear functions and then using them for classification purposes (Wald, Fisher, etc.). The use of linear functions enables better interpretation of the results (e.g., importance and/or significance of each variable in instance classification) when analysing the value of the coefficient obtained. Not every classification method is suited to this type of analysis and in fact some are classified as “black box” models. Thus, classic discriminant analysis continues to be an interesting methodology.

Before beginning designing a classification method, when many variables are involved, only those variables that are really required should be selected; that is, the first step is to eliminate the less significant variables from the analysis.

Thus, the problem consists in finding a subset of variables that can carry out this classification task in an optimum way. This problem is known as variables selection or feature selection. Research into this issue was started in the early 1960s by Lewis (1962) and Sebestyen (1962). According to Liu and Motoda (1998), feature selection provides some advantages such as reducing the costs of data acquisition, better understanding of the final classification model, and an increase in the efficiency and efficacy of such a model. Extensive research into variable selection has been carried out over the past four decades. Many studies on variable selection are related to medicine and biology, such as Sierra et al. (2001), Ganster et al. (2001), Inza et al. (2000), Lee et al. (2003), Shy and Suganthan (2003), and Tamoto et al. (2004).

From a computational point of view variable selection is an NP-hard problem (Kohavi, 1995, Cotta et al., 2004) and therefore there is no guarantee of finding the optimum solution $(NP = nondeterministicpolynomialtime)$ . This means that when the size of the problem is large finding an optimum solution in practice is unfeasible. Two different methodological approaches have been developed for variable selection problems: (a) the optimal or exact techniques (enumerative techniques) which are able to guarantee an optimal solution, but which are only applicable to small-sized sets; and the heuristic techniques, which are able to find good solutions (although unable to guarantee the optimum) in a reasonable amount of time. Among the enumerative techniques, the Narendra and Fukunaga (1977) algorithm is one of the best known but, as pointed out by Jain and Zongker (1997), the algorithm is impractical for problems with very large feature sets. Recent references about implicit enumerative techniques of selection features adapted to regression models could be found in Gatu and Kontoghiorghes, 2003, Gatu and Kontoghiorghes, 2005, Gatu and Kontoghiorghes, 2006. On the other hand, the quality of ’heuristic’ solutions strongly varies depending on the methods used. As found in other optimization problems, metaheuristic techniques have proved to be superior methodologies. For example, among the heuristic techniques we find the works based on genetic algorithms (see Bala et al., 1996, Jourdan et al., 2001, Oliveira et al., 2003; Inza et al., 2001a, Inza et al., 2001b; Wong and Nandi, 2004) and the recent work by García et al. (2006) who present a method based on Scatter Search.

These methods search for subsets with greater classification capacity based on different criteria. However, none of them focus on the posterior use of the variables selected in the discriminant analysis. This work proposes some “ad hoc” new methods and compares the different variable selection methods for discriminant analysis. For this specific purpose the stepwise method (Efroymson, 1960) and all its variants, such as O’Gorman's (2004) as well as the backward and forward methods, can be found in the literature. These are simple selection procedures based on statistical criteria (Wilks Lamda, Fisher's F, etc.) which have been incorporated into some of the best known statistical packages such as SPSS, BMDP, etc. As highlighted by Huberty (1994) and Salvador (2000) these methods are not very efficient, and when there are many original variables the optimum is rarely achieved. The methods proposed in this work yield significantly better results as shown below.

The methods designed in this work are based on different metaheuristic techniques such as: GRASP, memetic algorithms, VNS, Tabu search and path relinking. Different tests were used to analyse and compare their efficacy with each other and with previous methods.

The remainder of this paper is organized as follows: the problem is modelled in Section 2; the GRASP procedure is described in Section 3 and the memetic algorithm in Section 4; the variable neighbourhood search procedure (VNS) is described in Section 5, and the Tabu search algorithm in Section 6. In Section 7, a modification for improving the robustness of the strategies is described and in Section 8 the results of the computational experiments are presented. Finally in Section 9 the main conclusions are offered.

Section snippets

Modelling the problem

We can formulate the problem of selecting the subset of variables with superior classification performance as follows: V being a set of m variables, such that $V = {1, 2, \dots, m}$ and A being a set of instances, (also named “training” set). For each case we also know the class it belongs to. Given a predefined value $p \in N, p < m$ , we have to find a subset $S \subset V$ , with a size p with the greatest classification capacity for the discriminant analysis, $f (S)$ .

To be precise, the function $f (S)$ is defined as a percentage

GRASP

Greedy randomised adaptive search procedure (GRASP), is a heuristic that constructs solutions with controlled randomisation and a greedy function. Most GRASP implementations also include a local search that is used to improve the solutions generated with the randomised greedy function. GRASP was originally proposed in the context of a set covering problem (Feo and Resende, 1989). Details of the methodology and a survey on applications can be found in Feo and Resende (1995) and Pitsoulis and

Memetic Algorithms

Memetic Algorithms are also population-based methods and they have proved to be faster than Genetic Algorithms for certain types of problems, (Moscato and Laguna, 1996). In brief, they combine local search procedures with crossing or mutating operators; due to their structure some researchers have called them hybrid genetic algorithms, parallel genetic algorithms (PGAs) or genetic local search methods. The method is gaining wide acceptance particularly for the well-known problems of

VNS (Variable neighbourhood search)

VNS is a recent metaheuristic for solving optimization problems. Its basic idea is the systematic change of neighbourhood within a local search (Hansen and Mladenovic, 1998, Hansen and Mladenovic, 1999). Two recent tutorials were published by Hansen and Mladenovic, 2002, Hansen and Mladenovic, 2003. More information is available at: http:vnsheuristic.ull.es. The VNS procedure is as follows:

$\underset{̲}{Variable_neighbour_search_procedure}$
Read initial Solution S
Repeat
$k = 0$
Repeat
-	$Do k = k + 1 and S^{'} = S$
-	Randomly choose

Description of a basic algorithm

Tabu search (TS) is a strategy proposed by Glover, 1989, Glover, 1990. “TS is dramatically changing our possibilities of solving a host of combinatorial problems in different areas” (Glover and Laguna, 2002). This procedure explores the solution space beyond the local optimum. Once a local optimum is reached, upward moves and those worsening the solutions are allowed. Simultaneously, the last moves are marked as tabu during the following iterations to avoid cycling. Recent and comprehensive

Use of a validation set for improving the robustness in the strategies

It has been observed that the metaheuristic strategies described before (GRASP, memetic algorithms, VNS and TS) are focus more on the optimization point of view , than on a statistical point of view (i.e. in the generalization). Because of this, with the aim of increasing the robustness of the strategies, a new set of instances (“validation” set) is taken and used in the following way. A solution only is admitted as the new best solution if the number of fits in the validation set don’t get

Computational results

To check and compare the efficacy of the different methods a series of experiments was run with different test problems. We have selected data sets with enough instances for building large training sets (at least 10 cases for every freedom degree), validation set and 10 test sets from every data set. Using large training sets is recommended to obtain a trade-off between “optimization” and “generalization”. Six data sets were used. These data sets can be found in the well-known data repository

Conclusions

This work approaches the problem of variables selection for discriminant analysis. Although there are many references in the literature regarding selecting variables for their use in classification, there are very few key references on the selection of variables for their use in discriminant analysis. In fact, the most well-known statistical packages continue to use classic selection methods. In this work we propose as an alternative new methods based on various metaheuristic strategies. All of

Acknowledgements

Authors are grateful for financial support from the Spanish Ministry of Education and Science (National Plan of R&D - Projects SEJ2005-08923/ECON, and SEJ 2004-08176- 02-01/ECON).

References (43)

T.A.y. Feo et al.
A probabilistic heuristic for a computationally difficult set covering problem
Oper. Res. Lett.
(1989)
C. Gatu et al.
Parallel algorithms for computing all possible subset regression models using the {QR} decomposition
Parallel Comput.
(2003)
I. Inza et al.
Feature subset selection by Bayesian networks based optimization
Artif. Intell.
(2000)
I. Inza et al.
Feature subset selection by genetic algorithms and estimation of distribution algorithms: a case study in the survival of cirrhotic patients treated with TIPS
Artif. Intell. Med.
(2001)
I. Inza et al.
Feature subset selection by bayesian networks: a comparison with genetic and sequential algorithms
Int. J. Approx. Reason.
(2001)
M.L.D. Wong et al.
Automatic digital modulation recognition using artificial neural network and genetic algorithm
Signal Process.
(2004)
J. Bala et al.
Using learning to facilitate the evolution of features for recognizing visual concepts
Evol. Comput.
(1996)
Cotta, C., Sloper, C.y., Moscato, P., 2004. Evolutionary search of thresholds for robust feature set selection:...
M.A. Efroymson
Multiple regression analysis
T.A. Feo et al.
Greedy randomized adaptive search procedures
J. Global Optim.
(1995)

H. Ganster et al.

Automated melanoma recognition

IEEE Trans. Med. Imaging

(2001)

F.C. García et al.

Solving feature selection problem by a parallel scatter search

European J. Oper. Res.

(2006)

C. Gatu et al.

Efficient strategies for deriving the subset {VAR} models

Comput. Manag. Sci.

(2005)

C. Gatu et al.

Branch-and-bound algorithms for computing the best-subset regression models

J. Comput. Graph. Stat.

(2006)

F. Glover

Tabu search. Part I

ORSA J. Comput.

(1989)

F. Glover

Tabu search. Part II

ORSA J. Comput.

(1990)

F.y. Glover et al.

Tabu Search

(1997)

F.y. Glover et al.

Tabu search

F. Glover et al.

Fundamentals of scatter search and path relinking

Control Cybernet.

(2000)

P. Hansen et al.

An introduction to variable neighborhood search

Hansen, P., Mladenovic, N., 1999. First improvement may better than best improvement: an Empirical Study. Les Cahiers...

Cited by (31)

Robust variable selection for model-based learning in presence of adulteration
2021, Computational Statistics and Data Analysis
Citation Excerpt :
Such hybrid methods usually involve feature selection based on some measure of separability between groups, like the one introduced by Indahl and Næs (2004), specifically tailored for spectroscopic data, and the one proposed by Andrews and McNicholas (2014). Further, a series of techniques based on metaheuristic strategies for variable selection in discriminant analysis can be found in Pacheco et al. (2006), while the method of Chiang and Pell (2004) relies on a stochastic search based on genetic algorithms. Even though being more complex and computationally intensive, wrapper approaches provide better classification results and a more accurate representation of the data generating process (Kohavi and John, 1997).
The problem of identifying the most discriminating features when performing supervised learning has been extensively investigated. In particular, several methods for variable selection have been proposed in model-based classification. The impact of outliers and wrongly labeled units on the determination of relevant predictors has instead received far less attention, with almost no dedicated methodologies available. Two robust variable selection approaches are introduced: one that embeds a robust classifier within a greedy-forward selection procedure and the other based on the theory of maximum likelihood estimation and irrelevance. The former recasts the feature identification as a model selection problem, while the latter regards the relevant subset as a model parameter to be estimated. The benefits of the proposed methods, in contrast with non-robust solutions, are assessed via an experiment on synthetic data. An application to a high-dimensional classification problem of contaminated spectroscopic data is presented.
Set based particle swarm optimization for the feature selection problem
2019, Engineering Applications of Artificial Intelligence
Selecting the correct features when training a classification algorithm, has a significant impact on the performance of the classifier. More features provide more information, but can lead to overfitting and utilizing features that are redundant, irrelevant or too noisy. The feature selection problem (FSP) is concerned with identifying those features, from the entire set of features, that lead to the best possible classification. This article evaluates the performance of the set based particle swarm optimization (SBPSO) algorithm on the FSP. SBPSO was specifically developed to solve discrete-valued optimization problems that can be formulated as set-based problems. A wrapper based SBPSO algorithm based on a k-nearest neighbor classifier is proposed in this paper. The SBPSO wrapper algorithm was compared to three other discrete PSO wrapper algorithms on a large number of datasets of different sizes and outperformed, with statistical significance, the other algorithms on the FSP. The SBPSO algorithm can thus be considered an effective tool for solving the FSP.
Variable selection and training set design for particle classification using a linear and a non-linear classifier
2017, Chemical Engineering Science
While particulate products are often characterized by their median diameter or the width of the particle size distribution, information is rarely given about the agglomeration degree of the product. To obtain this information, a tool combining image analysis and discriminant factorial analysis (DFA) was introduced in previous works. The accuracy of that method depended on the number of image descriptors selected, i.e. measurements describing each particle: few image descriptors resulted in rather poor classification while too many lead to an overfitting of the data. The aim of this study is twofold: First, we want to compare the classification accuracy of artificial neural networks (ANN) and DFA which, contrary to ANN, forms linear classifiers. Second, we want to provide an easy-to-implement procedure for generating particle classifiers. We used a qualitative measure called Proportional Similarity to test whether a subset selection of image descriptors was necessary to avoid an overfitting. The influence of the training set size was investigated as well as the transferability of the classifier on data obtained under different experimental conditions. The chemical systems used were l-alanine/water and adipic acid/water and the classes considered were single crystals, agglomerates, and gas bubbles. The results show that an ANN classifier provides higher accuracy and is more effective when only few image descriptors are available while DFA is simpler to create. Moreover, we show good transferability of classifiers trained on data of different experimental conditions. Based on our results, we provide guidelines for classification of particulate systems.
Assessment of serum biomarkers in rats after exposure to pesticides of different chemical classes
2015, Toxicology and Applied Pharmacology
There is increasing emphasis on the use of biomarkers of adverse outcomes in safety assessment and translational research. We evaluated serum biomarkers and targeted metabolite profiles after exposure to pesticides (permethrin, deltamethrin, imidacloprid, carbaryl, triadimefon, fipronil) with different neurotoxic actions. Adult male Long–Evans rats were evaluated after single exposure to vehicle or one of two doses of each pesticide at the time of peak effect. The doses were selected to produce similar magnitude of behavioral effects across chemicals. Serum or plasma was analyzed using commercial cytokine/protein panels and targeted metabolomics. Additional studies of fipronil used lower doses (lacking behavioral effects), singly or for 14 days, and included additional markers of exposure and biological activity. Biomarker profiles varied in the number of altered analytes and patterns of change across pesticide classes, and discriminant analysis could separate treatment groups from control. Low doses of fipronil produced greater effects when given for 14 days compared to a single dose. Changes in thyroid hormones and relative amounts of fipronil and its sulfone metabolite also differed between the dosing regimens. Most cytokine changes reflected alterations in inflammatory responses, hormone levels, and products of phospholipid, fatty acid, and amino acid metabolism. These findings demonstrate distinct blood-based analyte profiles across pesticide classes, dose levels, and exposure duration. These results show promise for detailed analyses of these biomarkers and their linkages to biological pathways.
A comparison of simulated annealing algorithms for variable selection in principal component analysis and discriminant analysis
2014, Computational Statistics and Data Analysis
Variable selection is a venerable problem in multivariate statistics. Simulated annealing is one of a variety of metaheuristics that can be gainfully employed for variable selection; however, its effectiveness is influenced by algorithm design features such as the construction of the initial subset, the maximum and minimum temperatures, the cooling scheme, and the process for generating trial subsets in the neighborhood of the incumbent subset. These design features were manipulated to produce 24 versions of a simulated annealing algorithm for the problem of selecting exactly $p$ out of $m$ candidate variables. The versions were then compared within the contexts of principal component analysis and discriminant analysis. The results suggest some complex and interesting interactions among the design features, yet some robust versions across the two studies were established.
Bi-objective feature selection for discriminant analysis in two-class classification
2013, Knowledge-Based Systems
This works deals with the problem of selecting variables (features) that are subsequently used in discriminant analysis. The aim is to find, from a set of m variables, smaller subsets which enable an efficient classification of cases in two classes. We consider two objectives, each one associated with the misclassification error in each class (type I and type II errors). Thus, we establish a bi-objective problem and develop an algorithm based on the NSGA-II strategy to this specific problem, in order to obtain a set of non-dominated solutions. Managing these two objectives separately (and not jointly) allows an enhanced analysis of the obtained solutions by observing the approach to efficient frontier. This is especially significant when each type of error has a different level of importance or when they cannot be compared. To illustrate these issues, several known databases from literature are used, as well as an additional database with several Spanish firms featured by financial variables and two classes: “creditworthy” and “non-creditworthy”. Finally, we show that when solutions obtained by our NSGA-II implementation are evaluated from the classic mono-objective perspective (minimizing the ratio of both error types jointly) they are better than those obtained by classic methods for feature selection and similar than those provided by other recently published methods.

View all citing articles on Scopus

View full text

Analysis of new variable selection methods for discriminant analysis

Abstract

Introduction

Section snippets

Modelling the problem

GRASP

Memetic Algorithms

VNS (Variable neighbourhood search)

Description of a basic algorithm

Use of a validation set for improving the robustness in the strategies

Computational results

Conclusions

Acknowledgements

Oper. Res. Lett.

Parallel Comput.

Artif. Intell.

Artif. Intell. Med.

Int. J. Approx. Reason.

Signal Process.

Using learning to facilitate the evolution of features for recognizing visual concepts

Evol. Comput.

Multiple regression analysis

Greedy randomized adaptive search procedures

J. Global Optim.

Automated melanoma recognition

IEEE Trans. Med. Imaging

Solving feature selection problem by a parallel scatter search

European J. Oper. Res.

Efficient strategies for deriving the subset {VAR} models

Comput. Manag. Sci.

Branch-and-bound algorithms for computing the best-subset regression models

J. Comput. Graph. Stat.

Tabu search. Part I

ORSA J. Comput.

Tabu search. Part II

ORSA J. Comput.

Tabu Search

Tabu search

Fundamentals of scatter search and path relinking

Control Cybernet.

An introduction to variable neighborhood search