Feature selection methods on gene expression microarray data for cancer classification: A systematic review
Introduction
Nowadays, classifying microarray datasets, which are called (large-scale biological data analysis), is a popular and attractive area of study for many researchers, as applying microarray technology is one of the most important applications in molecular biology for cancer detection [1]. It depends on developing more effective classification models that can be used for classifying any unseen microarray data after training the model over a specific training dataset. Detecting and classifying cancer, using microarray gene expressed data, have posed a huge challenge for researchers in the field of computer science, as this kind of datasets contains a small number of examples versus a huge number of genes. However, many of these genes are considered irrelevant or redundant, and they must be removed by using an efficient FS method for improving the performance of classification. Therefore, researchers have employed much effort in coming up with more effective FS techniques that can increase classification's accuracy and decrease the computation time using a smaller number of genes in diagnostic and prognostic prediction of tumor cancer [2].
FS is the process of selecting the most relevant and efficient features for improving classification's performance in high dimensional datasets [3]. Filters, wrappers, embedded, and hybrid methods are the main types of FS methods, but there is a new kind of FS methods that has been recently developed, which is called ensemble [4].
Filters use statistical measures for evaluating features against a class label. There are two categories of filters, mainly: ranking-based (univariate) and search-space-based (multivariate). The first category selects features that have higher ranks based on a specific threshold value, where ranks are provided according to the relationships between each feature and the specified class label for removing the most irrelevant features. In contrast, the second category takes care of the relationships within features. Therefore, it can remove irrelevant features in addition to redundant ones [5].
Wrappers depend on a classifier evaluation to select features as it selects feature sets that satisfy best results based on a fitness value for a classifier. This kind of methods consists of three parts: search algorithm, classifier, and fitness function [6], while embedded methods select features that automatically improve classification performance as a part of the learning stage [2]. See Fig. 1.
Ensemble methods have recently appeared for obtaining stability which does not exist in many FS methods. This will be achieved by aggregating the results of different feature subsets which were generated either by using the same FS method on various training data (homogeneous ensemble FS approach) or by applying various FS methods over the same training data (heterogeneous ensemble FS approach) [7]. See Fig. 1 (d and e).
Each approach has its negative and positive properties. For example, filter-based FS methods need less computational complexity and avoid over-fitting problems more than wrapper and embedded-based FS methods. In contrast, wrappers and embedded-based FS methods provide better accuracy than filters. Hybrid methods have advantages of both wrappers and filters as they result in achieving better accuracy than filters, and need less computational cost than wrappers. Ensemble methods are the most flexible of FS methods in high dimensional data, and are the least prone to over-fitting problems [2].
In literature, there are many reviews that concern FS in microarray data subjects, such as the works of [[8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23]], but to the best of our knowledge, there are no reviews done on categorizing these studies based on the main purpose of the reviewed papers.
In this systematic review, we categorize studies done recently on FS for microarray data processing into nine directions based on the main objectives of each research. Where some of these studies developed new FS methods which are (filter, wrapper, embedded, ensemble, distributed, parallel), other studies selected some FSM from the literature and compared their performance over the same environment. On the other hand, there are some researchers who conducted surveys on the subjects of FS.
Each research in each direction was reviewed based on methods, datasets, classifiers, performance metrics, datasets dimension's range used, and results obtained. Finally, we summarized some of the general observations that we noticed along with our conclusions.
The main contributions of this systematic review can be summarized as follows:
- ●
It analyses 132 research papers from Elsevier, Springer, and IEEE publishers in the literature published in the last seven years about FS in microarray data processing according to their main objectives and categorizes these researches into nine main directions.
- ●
It summarizes FS directions that received the highest, the middle, and the lowest research attention in the recent seven years for guiding researchers who plan to pursue research on FS for microarray data processing to choose their research direction.
- ●
In each direction, each research paper is reviewed based on methods, datasets, classifiers, performance metrics, datasets dimensions range used, and results obtained. This aims to provide researchers with valuable information about the related work of their selected directions if they have already specified their research direction in FS for microarray data classification field as displayed in Fig. 2
The rest of this paper is organized as follows: Section 2 presents a background. Section 3 displays the methodology. Section 4 displays and discusses the main nine directions of FS in microarray data researches in the recent seven years that we proposed in this survey. The development of FS publications over the recent seven years is displayed in Section 5. In Section 6, the main observations are identified, and in Section 7, the conclusion and future work are presented.
Section snippets
Gene expression microarray data
It is a structured medical data, where features in gene expression microarray datasets represent gene expression coefficients in samples for each instance that represents a patient. Usually, microarray datasets are highly dimensional, as they contain a huge number of features versus a small number of samples [24].
Importance of FS to microarray datasets analysis
Detecting cancer-infected genes and normal healthy genes from the microarray dataset is challenging in high dimensional microarray datasets which contain many redundant and irrelevant
Methodology
This survey was conducted by applying five main steps as summarized in Fig. 2:
- -
At the beginning, we collected 132 papers done on FS in microarray data processing from three popular publishers (Elsevier, Springer, and IEEE) during the last seven years.
- -
These papers were analyzed and distributed into nine main directions based on their main objectives as new FS methods do not exist in the literature. Surveys were conducted and performance was compared based on some of the existing FS methods.
- -
Each
FS in microarray data researches main directions
In the literature, there are many studies done on FS for microarray data processing which we can categorize into nine main research directions as illustrated in Fig. 3. In each direction, papers were grouped and connected based on two levels. At the first level, these papers were arranged in the provided order presented in the presentation according to specific criteria that varies from direction to another (for example, D3 and D5 connected papers that used the same meta-heuristic algorithm
Development of feature selection publications over the recent seven years
By tracking the development of research done in the recent seven years over all the proposed directions that we discussed in this survey, it can be noted that there is an increase in the number of research papers published in the last four years, especially the years of 2018 and 2019 compared to years before 2018, and this is noted for most directions.
We noticed that there are more works published in relation to D1, D2, D3, D5, D6, and D8 in the recent three years (2018–2021) than works
Observations and analyses
In this systematic review, we investigated 132 research papers available from three famous publishers (Elsevier, Springer, and IEEE) about FS for microarray data processing during the last seven years. We observed that researchers focused on the fifth direction (D5) “Hybrid FSM”, as it constituted 34.9% of the researches that were examined. We believe that the reason for this high percentage could be due to the fact that hybrid methods generally improve classification accuracy without causing
Conclusion and future work
We examined all papers concerned FS field for microarray data processing during the recent seven years, which were published in three famous publishers (Elsevier, Springer, and IEEE). We found that 38% of these papers are published in Springer. The reviewed papers were categorized based on their main purposes into nine directions, then they were summarized according to what studies received the most, the middle, and the least research attention in all 132 papers that were reviewed in this
CRediT author statement
Esra'a Alhenawi: Methodology, Writing- Original draft preparation, validation. Rizik Al-Sayyed: Supervision, Writing - review & editing, Visualization, Validation. Amjad Hudaib: Supervision, Writing - review, Validation. Seyedali Mirjalili: Supervision, Review & editing, Visualization, Validation.
Declaration of competing interest
The authors whose names are listed immediately below certify that they have NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers’ bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in
Acknowledgements
We would like to thank all people who provided a technical help, assisted in reviewing and editing the language of the manuscript, and to those who offered general support and useful comments regarding this manuscript.
Esra'a Alhenawi ([email protected]) is currently a PhD candidate, The University of Jordan, King Abdullah II School for Information Technology, Department of Computer Science.
References (165)
- et al.
A review of feature selection methods in medical applications
Comput. Biol. Med.
(2019) - et al.
A survey on feature selection methods
Comput. Electr. Eng.
(2014) - et al.
Distributed feature selection: an application to microarray data classification
Appl. Soft Comput.
(2015) - et al.
Ensemble feature selection: homogeneous and heterogeneous approaches
Knowl. Base Syst.
(2017) - et al.
Feature selection of gene expression data for cancer classification: a review
Procedia Computer Science
(2015) - et al.
An ensemble of filters and classifiers for microarray data classification
Pattern Recogn.
(2012) - et al.
Clustering of high-dimensional gene expression data with feature filtering methods and diffusion maps
Artif. Intell. Med.
(2010) - et al.
Locality sensitive semi-supervised feature selection
Neurocomputing
(2008) - et al.
An improved minimum redundancy maximum relevance approach for feature selection in gene expression data
Procedia Technology
(2013) - et al.
Feature subset selection in large dimensionality domains
Pattern Recogn.
(2010)
Gene selection for microarray data classification using a novel ant colony optimization
Neurocomputing
A survey of neural network-based cancer prediction models from microarray data
Artif. Intell. Med.
Feature selection and tumor classification for microarray data using relaxed lasso and generalized multi-class support vector machine
J. Theor. Biol.
Dealing with high-dimensional class-imbalanced datasets: embedded feature selection for svm classification
Appl. Soft Comput.
Svm-bt-rfe: an improved gene selection framework using bayesian t-test embedded in support vector machine (recursive feature elimination) algorithm
Karbala Int. J. Modern Sci.
A hybrid particle swarm optimization for feature subset selection by integrating a novel local search strategy
Appl. Soft Comput.
Applying particle swarm optimization-based decision tree classifier for cancer classification on gene expression data
Appl. Soft Comput.
Efficient feature selection method using real-valued grasshopper optimization algorithm
Expert Syst. Appl.
Recursive memetic algorithm for gene selection in microarray data
Expert Syst. Appl.
A hybrid feature selection algorithm for gene expression data classification
Neurocomputing
Gene selection for microarray cancer classification using a new evolutionary method employing artificial intelligence concepts
Genomics
Classification of human cancer diseases by gene expression profiles
Appl. Soft Comput.
A two-stage gene selection method for biomarker discovery from microarray data for cancer classification
Chemometr. Intell. Lab. Syst.
A hybrid gene selection method for microarray recognition
Biocybernet. Biomed. Eng.
Robust gene signatures from microarray data using genetic algorithms enriched with biological pathway keywords
J. Biomed. Inf.
A novel effective diagnosis model based on optimized least squares support machine for gene microarray
Appl. Soft Comput.
Optimization based tumor classification from microarray gene expression data
PLoS One
A survey on feature selection and extraction techniques for high-dimensional microarray datasets
Swarm intelligence based feature selection for high dimensional classification: a literature survey
Int. J. Comput.
Feature selection is important: state-of-the-art methods and application domains of feature selection on high-dimensional data
Review on feature selection methods for gene expression data classification
A Study on Metaheuristics Approaches for Gene Selection in Microarray Data: Algorithms, Applications and Open Challenges
Feature selection applied to microarray data
Classification of microarray data
Review on the usage of swarm intelligence in gene expression data
A survey on hybrid feature selection methods in microarray gene expression data for cancer classification
IEEE Access
Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection
IEEE ACM Trans. Comput. Biol. Bioinf
Challenges and future trends for microarray analysis
A review on feature selection techniques for gene expression data
A survey on gene selection for microarray cancer classification based on soft computing techniques
A review of feature selection methods with applications
A comparative study of various feature selection techniques in high-dimensional data set to improve classification accuracy
Feature selection in dna microarray classification
A meta-review of feature selection techniques in the context of microarray data
Feature selection: a data perspective
ACM Comput. Surv.
Unsupervised feature selection for multi-cluster data
Supervised feature selection: a tutorial
Artif. Intell. Res.
Condition monitoring for the roller bearings of wind turbines under variable working conditions based on the Fisher score and permutation entropy
Energies
Feature selection based on mutual information
Laplacian score for feature selection
Adv. Neural Inf. Process. Syst.
Cited by (78)
Particle guided metaheuristic algorithm for global optimization and feature selection problems[Formula presented]
2024, Expert Systems with ApplicationsFG-HFS: A feature filter and group evolution hybrid feature selection algorithm for high-dimensional gene expression data
2024, Expert Systems with ApplicationsPrediction model of radiotherapy outcome for Ocular Adnexal Lymphoma using informative features selected by chemometric algorithms
2024, Computers in Biology and MedicineAn improved binary particle swarm optimization algorithm for clinical cancer biomarker identification in microarray data
2024, Computer Methods and Programs in Biomedicine
Esra'a Alhenawi ([email protected]) is currently a PhD candidate, The University of Jordan, King Abdullah II School for Information Technology, Department of Computer Science.
Rizik Al-Sayyed ([email protected]), is a professor in computer networks, cloud computing, databases systems, and simulation, The University of Jordan, King Abdullah II School for Information Technology, Department of Information Technology.
Amjad Hudaib ([email protected]), is a professor in Software Engineering, The University of Jordan, King Abdullah II School for Information Technology, Department of Computer Information Systems.
Seyedali Mirjalili ([email protected]), is currently an Associate Professor and the director of the Centre for Artificial Intelligence Research and Optimization at Torrens University Australia. He is internationally well recognized in Swarm Intelligence and Optimization.