A new local search based hybrid genetic algorithm for feature selection
Introduction
The recent incremental trend of high-dimensional data collection and problem representation demands the use of feature selection (FS) in many machine learning tasks. A large number of irrelevant and/or redundant features generally exist in the real-world datasets that may significantly degrade the accuracy of learned models and reduce the learning speed of the models. FS is a solution that involves finding a subset of salient features to improve predictive accuracy and removes the useless features. Thus, the learning model receives a concise structure without sacrificing the predictive accuracy built by using only the selected salient features. Therefore, nowadays, FS is an active research in the machine learning area. It also provides other benefits, such as, data visualization and data understanding, reducing the measurement and storage requirements [15].
The traditional approaches in FS can be broadly categorized into three approaches: filter, wrapper, and hybrid approaches [32]. The filter approach requires the statistical analysis of the feature set only for solving the FS task without utilizing any learning model [10]. The wrapper approach involves with the predetermined learning model, selects features on measuring the learning performance of the particular learning model [15]. The hybrid approach attempts to take advantage of the filter and wrapper approaches [18], [51]. It is often found that, hybrid technique is capable of locating a good solution, while a single technique often traps into an immature solution.
In addition, recently a new category of FS approach has been proposed. That is to say, the embedded approach in which FS process is integrated with the classifier construction (e.g., SIMBA [61], SVM-RFE [17], L1 regularized learning [54]). The performance of this approach is similar to wrapper approach, since the main concern is to the interaction between the feature selection and classification.
The success of different approaches mainly depends on considering fruitful search strategy in FS process. For this, different approaches use different ways to generate subsets and progress the search processes. One way is to start the search process with an empty set and successively add features (e.g., [14], [43]), called the sequential forward search (SFS). Another way is to search with a full set and successively remove features [1], [12], [19], [49], [52], called the sequential backward search (SBS). This sequential strategy is simple to implement and fast but affected by nesting effect [44], which signifies that once a feature is added (or, deleted) it cannot be deleted (or, added) later. In order to overcome such effect, the floating search strategy [44] has been implemented by modifying sequential search strategy. On the other hand, one can start search process with a randomly selected subset (e.g., [33], [53]) involving a sequential strategy, called as the random search strategy. In addition, a good solution can be found using complete search strategy [10], [32], since it covers searching for the whole feature space. However, such strategy is not feasible, since it requires longer time while involving larger feature sets.
Most search strategies discussed above, however, attempt to find solutions that range between sub-optimal and near optimal regions, since they involve searching locally rather than globally. On the other hand, solution of optimal or near optimal is quite difficult for those search algorithms due to involving partial search in the solution space or suffering from computational complexity. Therefore, the recent trend of research has been shifted towards the global search algorithms (or, meta-heuristics). They find the solution in the full search space by their global search capability with utilizing the local search suitably. The global search algorithms work on basis of the activity of multi-agents, which ultimately enhance to find very high-quality solutions within a reasonable time [9]. Genetic algorithm (GA) [8], [18], [40], [41], [56], [61], [62] is the one that can successfully solve the FS tasks among the other global search algorithms, such as, ant colony optimization algorithm (ACO) [3], [27], [51], and particle swarm optimization algorithm (PSO) [58].
In this paper, we propose a new hybrid genetic algorithm (HGA) for feature selection, called as HGAFS. The proposed idea tries to hybridize the GA by integrating new local search operations (LSOs). Such operations embedded in GA leads to fine-tune the search process for FS in an organized fashion. HGAFS combines the FS with determining the size of the subset in a reduced form. It uses a bounded random selection scheme involving correlation information in the LSOs for selecting salient features. The idea implemented here as an extension of our earlier work [24]. Our algorithm, HGAFS, differs from previous works (e.g., [8], [18], [40], [62]) in selecting salient features from a given dataset on two aspects.
First, HGAFS emphasizes not only on selecting a number of salient features, but also on attaining a reduced number. The proposed HGAFS selects the salient features with reduced number using a subset size determination scheme. Such scheme works upon a bounded region and tries to provide the size of the subset smaller in number. Thus, the distribution of the number of 1-bits into the individual strings of a population in HGAFS is maintained according to the subset size determined here. This approach is quite different from existing works (e.g., [8], [18], [35], [40], [61], [62]). The most common practice is to choose these 1-bits using a boundless random function and then selects relevant features using GA. Although finding relevant features using GA is a good step, the boundless random function affects the FS process. The reason is that, it may provide either too low or too high value. In case of too high, search space becomes larger, computational time increases much, and ultimately least significant features might be selected. The search process, on the other hand, cannot be completed perfectly if the number of 1-bits is too low. Thus, selecting a subset of salient features with reduced number provides a novel approach in FS process using GAs.
Second, HGAFS uses correlation information in conjunction with the bounded scheme to select a subset of relevant features. The aim of using correlation information of features is to guide the search process in GA in such a way that, relatively less correlated (distinct) features are injected in a high proportion with respect to more correlated (similar) features to the consecutive generations. It is important here to note that, correlation information guides the search process only using the GA, while the neural networks (NNs) assist to fulfill the genetic process of GA. The existing FS approaches (e.g., [8], [18], [40], [62]) do not use correlation information to guide the search process. Thus, the redundant information might be increased due to the selection of correlated features in their solutions.
In addition, there are also some approaches, such as, RELIEF [29], I-RELIEF [55], and SIMBA [5] that are also involved to assign the weight of each feature and select a reduced number of relevant features. In close observation, these methods are indeed filter method, except SIMBA, while our proposed HGAFS is the wrapper-based feature selection algorithm using GA and NN. It is well-known that, the performance of wrapper method always outperforms filter method for feature selection [10], [32]. On the other hand, SIMBA involves with the classifier in selecting salient features, but failed to provide global search.
Recently, a constructive approach for FS, CAFS, has been proposed [25]. This approach automatically selects the relevant features using correlation information-based sequential search strategy and determines appropriate architectures for the NNs during training. One major disadvantage of such an approach is that, CAFS may suffer from the nesting effect, since it uses a sequential approach in selecting a set of salient features. One efficient way to avoid this effect is to incorporate a global search strategy such as GA [25]. We utilize such correlation-based search strategy in the GA as a local search operation. Such utilization ultimately has produced a significant performance gain for FS in HGAFS.
The rest of this paper is organized as follows. Section 2 reviews the literature of existing FS works. Detailed discussions about HGAFS with the computational complexity can be found in Section 3. Section 4 presents the results of our experimental studies including the experimental methodology, experimental results, and comparisons with other existing FS algorithms. Finally, Section 5 presents the discussions with future directions of our algorithm and Section 6 concludes the paper with a brief summary.
Section snippets
A review of literature
The performance of any FS task is greatly dependent on the search technique in finding the salient features from a given dataset [35]. Among numerous FS algorithms, most are involved with either sequential search [1], [7], [12], [14], [17], [19], [44], [45], [50], [52], [57], [59] or global search technique [3], [8], [18], [27], [31], [35], [40], [41], [51], [56], [58], [61], [62]. In contrast, guiding the search strategies and evaluating the generated subset, the existing FS algorithms can be
Proposed HGAFS
GA provides genetic search that belongs to the global search strategy for finding an optimal solution to a given problem. In FS task, GA provides better solutions, but affected by two shortcomings, i.e., premature convergence and weakness in fine-tuning near local optimum points [18], [40]. To overcome such weakness of GA, hybridizing GA, i.e., incorporating domain specific knowledge into the GA, is an active research nowadays.
Our proposed HGAFS uses HGA technique combining a bounded scheme,
Experimental studies
This section presents HGAFS's performance on several well-known real-world benchmark and gene expression classification datasets, including diabetes, breast cancer, glass, vehicle, hepatitis, horse, sonar, splice, colon cancer, lymphoma, and leukemia datasets. These datasets have been the subject of many studies in NNs and machine learning and cover the examples of small, medium, large, and very large dimensional datasets. The characteristics of these datasets are shown in Table 1, which show a
Discussions
This section briefly explains why the performance of HGAFS is better than those of the other FS algorithms. There are three major differences that might contribute to the better performance of HGAFS.
First, HGAFS is guided in selecting the salient features using the subset size determination scheme. Such determination scheme encourages HGAFS to generate subsets with reduced form, while other approaches (e.g., [8], [18], [35], [40], [62]) use a random function instead. Thus, subsets of larger
Conclusions
A method that integrates two quite different new techniques in GA for feature selection has been proposed. It restricts the number of 1-bits in the individual strings, and the local search operation. The fitness function, combining the performance of NN with the correlation information of features, assists LSO in HGAFS to find the most salient features with less redundancy of information.
In HGAFS, neither the computation of training based classifier nor the computation of mutual information nor
Acknowledgments
Supported by grants to KM from the Japanese Society for Promotion of Sciences and the University of Fukui.
Md. Monirul Kabir received the B.E. degree in Electrical and Electronic Engineering from Bangladesh Institute of Technology (BIT), Khulna, now Khulna University of Engineering and Technology (KUET), Bangladesh in 1999. He received a master of engineering degree in the department of Human and Artificial Intelligent Systems from University of Fukui, Japan in 2008. He obtained a doctor of engineering degree in the System Design Engineering from University of Fukui in March 2011. He was an
References (62)
- et al.
Test feature selection using ant colony optimization
Expert Systems with Applications
(2009) - et al.
A new orthogonal array based crossover, with analysis of gene interactions, for evolutionary algorithms and its application to car door design
Expert Systems with Applications
(2010) - et al.
Feature selection for classification
Intelligent Data Analysis
(1997) A filter model for feature subset selection based on genetic algorithm
Knowledge-Based Systems
(2009)- et al.
Eliminating redundancy and irrelevance using a new MLP-based feature selection method
Pattern Recognition
(2006) - et al.
A hybrid genetic algorithm for feature selection wrapper based on mutual information
Pattern Recognition Letters
(2007) - et al.
Comparison of algorithms that select features for pattern classifiers
Pattern Recognition
(2000) - et al.
An efficient ant colony optimization approach to attribute reduction in rough set theory
Pattern Recognition Letters
(2008) - et al.
Prediction of colon cancer using an evolutionary neural network
Neurocomputing
(2004) - et al.
Random subspace method for multivariate feature selection
Pattern Recognition Letters
(2006)
An effective feature selection method for hyperspectral image classification based on genetic algorithm and support vector machine
Knowledge-Based Systems
A novel ACO-GA hybrid algorithm for feature selection in protein function prediction
Expert Systems with Applications
A quantitative study of experimental evaluations of neural network learning algorithms
Neural Networks
Floating search methods in feature selection
Pattern Recognition Letters
A hybrid approach for feature subset selection using neural networks and ant colony optimization
Expert Systems with Applications
Feature selection with neural networks
Pattern Recognition Letters
Feature selection based on rough sets and particle swarm optimization
Pattern Recognition Letters
Feature selection using genetic algorithm and cluster validation
Expert Systems with Applications
Markov blanket-embedded genetic algorithm for gene selection
Pattern Recognition
Modified backward feature selection by cross validation
Proceedings of the European Symposium on Artificial Neural Networks
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
Nature
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays
Proceedings of the National Academy of Sciences of the USA
Selecting inputs for modeling using normalized higher order statistics and independent component analysis
IEEE Transactions on Neural Networks
A neuro-fuzzy scheme for simultaneous feature selection and fuzzy rule-based classification
IEEE Transactions on Neural Networks
Ant Colony Optimization
Genetic Algorithms in Search, Optimization and Machine Learning
An incremental approach to contribution-based feature selection
Journal of Intelligence Systems
An introduction to variable and feature selection
Journal of Machine Learning Research
Molecular classification of cancer: class discovery and class prediction by gene expression
Science
Gene selection for cancer classification using support vector machines
Machine Learning
Cited by (0)
Md. Monirul Kabir received the B.E. degree in Electrical and Electronic Engineering from Bangladesh Institute of Technology (BIT), Khulna, now Khulna University of Engineering and Technology (KUET), Bangladesh in 1999. He received a master of engineering degree in the department of Human and Artificial Intelligent Systems from University of Fukui, Japan in 2008. He obtained a doctor of engineering degree in the System Design Engineering from University of Fukui in March 2011. He was an assistant programmer from 2002 to 2005 at the Dhaka University of Engineering and Technology (DUET), Bangladesh. His research interest includes data mining, artificial neural networks, evolutionary approaches, swarm intelligence, and mobile ad hoc network.
Md. Shahjahan is an Associate Professor at the Department of Electrical and Electronic Engineering, Khulna University of Engineering and Technology, Khulna, Bangladesh. He received B.E. from Bangladesh Institute of Technology (BIT) in January, 1996. He received M.E. in Information Science from University of Fukui, Japan, in 2003, D.E. at the Department of System Design Engineering from University of Fukui in 2006. He joined as a Lecturer at Department of Electrical and Electronic Engineering, KUET, in September, 1996 and as assistant professor in 2006. He received the best student award from IEICE, Hokuriku part, in the year 2003, Japan. He is a member of Institute of Engineers Bangladesh (IEB), Bangladesh. He has published a number of international conference and journal papers in different places of the world.
Kazuyuki Murase is a Professor at the Department of Human and Artificial Intelligence Systems, Graduate School of Engineering, University of Fukui, Fukui, Japan, since 1999. He received M.E. in Electrical Engineering from Nagoya University in 1978, Ph.D. in Biomedical Engineering from Iowa State University in 1983. He Joined as a Research Associate at Department of Information Science of Toyohashi University of Technology in 1984, as an Associate Professor at the Department of Information Science of Fukui University in 1988, and became the professor in 1992. He is a member of The Institute of Electronics, Information and Communication Engineers (IEICE), The Japanese Society for Medical and Biological Engineering (JSMBE), The Japan Neuroscience Society (JSN), The International Neural Network Society (INNS), and The Society for Neuroscience (SFN). He serves as a Councilor of Physiological Society of Japan (PSJ) and Japanese Association for the Study of Pain (JASP).