Research ArticleImproved intelligent water drop-based hybrid feature selection method for microarray data processing
Graphical abstract
Introduction
Medical data can be classified into two main categories; first one represents the structured data which is called microarray data, while second category represents an unstructured data that can be two dimensional as in medical imaging, or one dimensional as in biomedical signal processing. Each type of these data always contains some irrelevant or redundant features that must be removed; therefore, the FS process is indispensable for medical datasets analysis for many purposes like diagnosing, screening, and treatment (Remeseiro and Bolon-Canedo, 2019).
Microarray datasets contain many noise features that must be removed for improving classification performance (Li et al., 2017). FS is one of the dimensionality reduction approaches that cover the process of finding the minimal and most optimal subset of features from total features in high dimensional datasets by removing irrelevant and redundant features for increasing the classification or regression accuracy in the supervised learning problems (Mafarja and Mirjalili, 2019).
There are many FS methods developed for finding a more compact and optimal subset of features (Diao and Shen, 2015) in the literature that can be classified based on the availability of supervision like target variable class label into three categories as supervised, semi-supervised or unsupervised FS method, where FS methods in a supervised category were devoted for classification or regression problems based on target variable type as categorical or numerical respectively. In contrast, unsupervised FS methods have usually been used for handling a clustering problems (Li et al., 2017).
Another classification perspective proposed based on the selection strategy as a filter, wrapper, embedded and hybrid feature selection methods, where filter feature selection approach use a statistical technique to provide a rank for each feature then select specific top ranked features based on a specific threshold. In contrast, the wrapper approach works by selecting a subset of features using a specific search algorithm then evaluating it using a specific classifier and fitness function (Li et al., 2017), see Fig. 1 part(a) and part(b).
In an embedded FS methods, the selection accomplished as a part of classification algorithm like in the random forest, where a classifier selects features that maximize accuracy automatically or during the training of the classifier like specifying weights in a neural network (Ang et al., 2015), see Fig. 1 part(c))
On the other hand, hybrid FS methods select subset of features by combining two or more FS methods from various selection strategies for handling their advantages simultaneously, where as in general, the wrapper approach provides more accurate results when compared with the filter approach. However, the filter and embedded approaches need minimum computation time when compared with the wrapper-based FS methods (Manikandan and Abirami, 2018).
In some studies like (Saeys et al., 2008, Abeel et al., 2010, Haury et al., 2011, Manikandan and Abirami, 2018, Rouhi and Nezamabadi-Pour, 2020), authors discussed an additional feature selection strategy and named it an ensemble FS methods, which aims to solve instability and perturbation issues in many individual FS methods by running a particular FS method on several sub-samples, and merging the obtained features to form a more stable subset for tackling an over-fitting problem for high dimensional datasets (Ang et al., 2015).
Most of the microarray datasets in the medical field are high dimensional as it consists of a large number of features compared with a small number of samples of a particular disease, which may lead to data over-fitting problem and inaccurate results when using wrappers or embedded FS methods because the classifier is repeatedly called to evaluate each subset. Therefore, most traditional FS methods used filter FS methods, especially in microarray data. Recently, there exist a tendency for using hybrid (filter - wrapper) or ensemble methods for FS over medical datasets by applying an ensemble filter at first for dimensionality reduction through removing some features then using a wrapper for fine-tuning by handling a more accurate results (Bolón-Canedo et al., 2015b).
FS methods can be also classified from a data perspective as FS methods with a conventional data, FS with structured features, FS with heterogeneous data, and FS with streaming data (Li et al., 2017).
Feature selection methods with a conventional data include most of the existing FS methods which ignore the inherited feature structures and assume that all features are independent of each other’s. This type of FS methods assesses the importance of each feature based on one of four ways: features similarity like “Fisher Score”, using various heuristic filter criteria like “Mutual Information”, using some sparse regularization terms such as “Multi-Cluster”, or using various statistical measures like “Low Variance, and T-score”. In contrast to these, some of the FS methods takes the structure of features (spatial or temporal, groups, trees or graphs) into account in FS process therefore, it improves the learning task.
Some FS methods, however, are applicable for heterogeneous data includes data from multiple sources, from multiple views, and linked data such as data in social media. Finally, some FS methods are suitable for streaming features or streaming data, where one dimension; either the number of instances or the number of the candidate features is constant, while the other dimension is unknown or infinite and arrives one at a time like “Unsupervised Streaming” FS method in Social Media. Alhenawi et al. (2022) categorized researches for gene expression classification conducted during the recent seven years based on their purpose into nine directions .
In this paper, we came up with a hybrid feature selection method for microarray data classification, which balance exploration and exploitation capabilities in selecting a subset of features by combining an ensemble filter method and an Intelligent Water Drop (IWD) algorithm, which was developed by Shah-Hosseini (2009) with two improvements:
- •
One of the improvements targets IWD exploration capability by using three different iterative Local Search Algorithms (ILSA).
- •
Another improvement targets the selection process of the next feature added to the IWD solution list for each drop in the original IWD algorithm using correlation coefficient (cc) between current features in IWD list and all other unvested features as a HUD value for updating the soil value carried by the drop itself and soil over the selected path for each IWD agent to eliminate the probability of selection of any redundant features. Fast correlation coefficient filter is a multivariate FS method finds features that are strongly correlated with a specific class and features that have the lowest correlation with other features (Djellali et al., 2017), where among two strongly correlated features; one of them (that has the lowest correlation with specific class) is considered as a redundant feature and not selected as next feature in IWD list.
The main contributions of this paper can be summarized as follows:
- 1.
summarize the most recent state-of-the-art works related to FS methods based on IWD.
- 2.
proposing a hybrid feature selection algorithm for medical applications based on an ensemble filter and an improved IWD as a wrapper.
- 3.
improving an exploitation capability of IWD by adding one of three LS algorithms (TS, NLSA, HC) after each iteration from IWD to eliminate a local optima problem.
- 4.
Evaluating the proposed feature selection algorithm performance against some of the most recent FS algorithms from the literature.
The rest of this paper is organized as follows: Section 2 presents a brief review of the latest works on FS using IWD. Section 3 presents IWD inspiration and mathematical equations, while Section 4 is devoted to present the proposed hybrid FS method. In Section 5, experimental setups are explained. Experimental results presented and discussed in Section 6. Finally, Section 7 presents the conclusion and future work.
Section snippets
IWD based feature selection related work
In the literature, there are some prior work on deploying IWD for FS problem for different applications such intrusion detection, spam email detection, sentiment analysis, web page classification, rough set FS, and gene selection for cancer prediction in the medical field, as illustrated in details in this section and summarized in Table 1, Table 2.
Hendrawan and Murase (2011) developed four embedded feature selection methods to find the most significant set of textual features for an irrigation
IWD algorithm
IWD was developed by Shah-Hosseini (2009). It is inspired the way by which water drops flow in natural rivers intelligently, where each drop work as an independent agent that was initially starting from the source and moves randomly with specific velocity and specific initial value of soil. During movement, each drop will carry an amount of soil from the bed of the path proportional to its velocity. The soil in the path will be decreased as the shortest and best path to encourage other drops to
The proposed hybrid FS method
In most of the works that developed hybrid FS methods, authors select specific filter for reducing the number of features passed to the wrapper stage later, where that wrapper must select features from the filtered subset for optimizing the accuracy of a particular classifier on the training set, but this way (means using a single filter) depends only on one filter for selecting features that are passed to the wrapper, which means that the probability of removing some relevant features before
Experimental setups
This section devoted for displaying the main setups for experiments that have been conducted in this paper including information’s about datasets, data preprocessing steps, parameters tuning, classifier, and evaluation metrics.
Experimental results and discussion
In this section, we present and discuss the results that we obtained from experiments, that we conducted in two stages.
Conclusion and future work
In this paper, a hybrid FS method based on an ensemble filter, and an improved Intelligent water drop algorithm (IWD) as a wrapper is proposed. Initially, an ensemble filter used for decreasing the number of features that later passed to the next stage where an improved IWD-based wrapper FS method is applied. Improvement in the original IWD done through two steps; First, a correlation coefficient filter used as the HUD to select the next feature in each iteration, then one LS algorithm from
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
We would like to thank all persons who provided technical help, assisted in reviewing and editing the language of writing the manuscript.
References (51)
- et al.
Text feature selection using ant colony optimization
Expert Syst. Appl.
(2009) - et al.
Feature selection methods on gene expression microarray data for cancer classification: A systematic review
Comput. Biol. Med.
(2022) - et al.
An ensemble of intelligent water drop algorithm for feature selection optimization problem
Appl. Soft Comput.
(2018) - et al.
A modified intelligent water drops algorithm and its application to optimization problems
Expert Syst. Appl.
(2014) - et al.
An ensemble of intelligent water drop algorithms and its application to optimization problems
Inform. Sci.
(2015) - et al.
Two hybrid wrapper-filter feature selection algorithms applied to high-dimensional microarray experiments
Appl. Soft Comput.
(2016) - et al.
Distributed feature selection: An application to microarray data classification
Appl. Soft Comput.
(2015) - et al.
Hybrid method based on information gain and support vector machine for gene selection in cancer classification
Genom. Proteom. Bioinform.
(2017) - et al.
Neural-intelligent water drops algorithm to select relevant textural features for developing precision irrigation system using machine vision
Comput. Electron. Agric.
(2011) - et al.
A dynamic framework for tuning SVM hyper parameters based on Moth-Flame Optimization and knowledge-based-search
Expert Syst. Appl.
(2021)
General local search methods
European J. Oper. Res.
A review of feature selection methods in medical applications
Comput. Biol. Med.
Classification of human cancer diseases by gene expression profiles
Appl. Soft Comput.
An approach to continuous optimization by the intelligent water drops algorithm
Proc.-Soc. Behav. Sci.
A hybrid gene selection method for microarray recognition
Biocybern. Biomed. Eng.
Improved Salp Swarm Algorithm based on opposition based learning and novel local search algorithm for feature selection
Expert Syst. Appl.
Robust biomarker identification for cancer diagnosis with ensemble feature selection methods
Bioinformatics
An IWD-based feature selection method for intrusion detection system
Soft Comput.
Robustification of Naïve Bayes classifier and its application for microarray gene expression data analysis
BioMed Res. Int.
A hybrid job scheduling algorithm based on Tabu and Harmony search algorithms
J. Supercomput.
Intelligent water drops algorithm for rough set feature selection
Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection
IEEE/ACM Trans. Comput. Biol. Bioinform.
Feature selection for web page classification using the intelligent water drops algorithm
Glob. J. Technol.
A new optimal gene selection approach for cancer classification using enhanced Jaya-based forest optimization algorithm
Neural Comput. Appl.
Feature selection in DNA microarray classification
Cited by (10)
Optimizing microarray cancer gene selection using swarm intelligence: Recent developments and an exploratory study
2023, Egyptian Informatics JournalOptimizing fetal health prediction: Ensemble modeling with fusion of feature selection and extraction techniques for cardiotocography data
2023, Computational Biology and ChemistrySolving Traveling Salesman Problem Using Parallel River Formation Dynamics Optimization Algorithm on Multi-core Architecture Using Apache Spark
2024, International Journal of Computational Intelligence SystemsA Gene Selection Method Considering Measurement Errors
2024, Journal of Computational BiologyA bio-medical snake optimizer system driven by logarithmic surviving global search for optimizing feature selection and its application for disorder recognition
2023, Journal of Computational Design and EngineeringA novel feature selection algorithm for identifying hub genes in lung cancer
2023, Scientific Reports