Feature selection methods on gene expression microarray data for cancer classification: A systematic review

https://doi.org/10.1016/j.compbiomed.2021.105051Get rights and content

Abstract

This systematic review provides researchers interested in feature selection (FS) for processing microarray data with comprehensive information about the main research directions for gene expression classification conducted during the recent seven years. A set of 132 researches published by three different publishers is reviewed. The studied papers are categorized into nine directions based on their objectives. The FS directions that received various levels of attention were then summarized. The review revealed that ‘propose hybrid FS methods’ represented the most interesting research direction with a percentage of 34.9%, while the other directions have lower percentages that ranged from 13.6% down to 3%. This guides researchers to select the most competitive research direction. Papers in each category are thoroughly reviewed based on six perspectives, mainly: method(s), classifier(s), dataset(s), dataset dimension(s) range, performance metric(s), and result(s) achieved.

Introduction

Nowadays, classifying microarray datasets, which are called (large-scale biological data analysis), is a popular and attractive area of study for many researchers, as applying microarray technology is one of the most important applications in molecular biology for cancer detection [1]. It depends on developing more effective classification models that can be used for classifying any unseen microarray data after training the model over a specific training dataset. Detecting and classifying cancer, using microarray gene expressed data, have posed a huge challenge for researchers in the field of computer science, as this kind of datasets contains a small number of examples versus a huge number of genes. However, many of these genes are considered irrelevant or redundant, and they must be removed by using an efficient FS method for improving the performance of classification. Therefore, researchers have employed much effort in coming up with more effective FS techniques that can increase classification's accuracy and decrease the computation time using a smaller number of genes in diagnostic and prognostic prediction of tumor cancer [2].

FS is the process of selecting the most relevant and efficient features for improving classification's performance in high dimensional datasets [3]. Filters, wrappers, embedded, and hybrid methods are the main types of FS methods, but there is a new kind of FS methods that has been recently developed, which is called ensemble [4].

Filters use statistical measures for evaluating features against a class label. There are two categories of filters, mainly: ranking-based (univariate) and search-space-based (multivariate). The first category selects features that have higher ranks based on a specific threshold value, where ranks are provided according to the relationships between each feature and the specified class label for removing the most irrelevant features. In contrast, the second category takes care of the relationships within features. Therefore, it can remove irrelevant features in addition to redundant ones [5].

Wrappers depend on a classifier evaluation to select features as it selects feature sets that satisfy best results based on a fitness value for a classifier. This kind of methods consists of three parts: search algorithm, classifier, and fitness function [6], while embedded methods select features that automatically improve classification performance as a part of the learning stage [2]. See Fig. 1.

Ensemble methods have recently appeared for obtaining stability which does not exist in many FS methods. This will be achieved by aggregating the results of different feature subsets which were generated either by using the same FS method on various training data (homogeneous ensemble FS approach) or by applying various FS methods over the same training data (heterogeneous ensemble FS approach) [7]. See Fig. 1 (d and e).

Each approach has its negative and positive properties. For example, filter-based FS methods need less computational complexity and avoid over-fitting problems more than wrapper and embedded-based FS methods. In contrast, wrappers and embedded-based FS methods provide better accuracy than filters. Hybrid methods have advantages of both wrappers and filters as they result in achieving better accuracy than filters, and need less computational cost than wrappers. Ensemble methods are the most flexible of FS methods in high dimensional data, and are the least prone to over-fitting problems [2].

In literature, there are many reviews that concern FS in microarray data subjects, such as the works of [[8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23]], but to the best of our knowledge, there are no reviews done on categorizing these studies based on the main purpose of the reviewed papers.

In this systematic review, we categorize studies done recently on FS for microarray data processing into nine directions based on the main objectives of each research. Where some of these studies developed new FS methods which are (filter, wrapper, embedded, ensemble, distributed, parallel), other studies selected some FSM from the literature and compared their performance over the same environment. On the other hand, there are some researchers who conducted surveys on the subjects of FS.

Each research in each direction was reviewed based on methods, datasets, classifiers, performance metrics, datasets dimension's range used, and results obtained. Finally, we summarized some of the general observations that we noticed along with our conclusions.

The main contributions of this systematic review can be summarized as follows:

  • It analyses 132 research papers from Elsevier, Springer, and IEEE publishers in the literature published in the last seven years about FS in microarray data processing according to their main objectives and categorizes these researches into nine main directions.

  • It summarizes FS directions that received the highest, the middle, and the lowest research attention in the recent seven years for guiding researchers who plan to pursue research on FS for microarray data processing to choose their research direction.

  • In each direction, each research paper is reviewed based on methods, datasets, classifiers, performance metrics, datasets dimensions range used, and results obtained. This aims to provide researchers with valuable information about the related work of their selected directions if they have already specified their research direction in FS for microarray data classification field as displayed in Fig. 2

The rest of this paper is organized as follows: Section 2 presents a background. Section 3 displays the methodology. Section 4 displays and discusses the main nine directions of FS in microarray data researches in the recent seven years that we proposed in this survey. The development of FS publications over the recent seven years is displayed in Section 5. In Section 6, the main observations are identified, and in Section 7, the conclusion and future work are presented.

Section snippets

Gene expression microarray data

It is a structured medical data, where features in gene expression microarray datasets represent gene expression coefficients in samples for each instance that represents a patient. Usually, microarray datasets are highly dimensional, as they contain a huge number of features versus a small number of samples [24].

Importance of FS to microarray datasets analysis

Detecting cancer-infected genes and normal healthy genes from the microarray dataset is challenging in high dimensional microarray datasets which contain many redundant and irrelevant

Methodology

This survey was conducted by applying five main steps as summarized in Fig. 2:

  • -

    At the beginning, we collected 132 papers done on FS in microarray data processing from three popular publishers (Elsevier, Springer, and IEEE) during the last seven years.

  • -

    These papers were analyzed and distributed into nine main directions based on their main objectives as new FS methods do not exist in the literature. Surveys were conducted and performance was compared based on some of the existing FS methods.

  • -

    Each

FS in microarray data researches main directions

In the literature, there are many studies done on FS for microarray data processing which we can categorize into nine main research directions as illustrated in Fig. 3. In each direction, papers were grouped and connected based on two levels. At the first level, these papers were arranged in the provided order presented in the presentation according to specific criteria that varies from direction to another (for example, D3 and D5 connected papers that used the same meta-heuristic algorithm

Development of feature selection publications over the recent seven years

By tracking the development of research done in the recent seven years over all the proposed directions that we discussed in this survey, it can be noted that there is an increase in the number of research papers published in the last four years, especially the years of 2018 and 2019 compared to years before 2018, and this is noted for most directions.

We noticed that there are more works published in relation to D1, D2, D3, D5, D6, and D8 in the recent three years (2018–2021) than works

Observations and analyses

In this systematic review, we investigated 132 research papers available from three famous publishers (Elsevier, Springer, and IEEE) about FS for microarray data processing during the last seven years. We observed that researchers focused on the fifth direction (D5) “Hybrid FSM”, as it constituted 34.9% of the researches that were examined. We believe that the reason for this high percentage could be due to the fact that hybrid methods generally improve classification accuracy without causing

Conclusion and future work

We examined all papers concerned FS field for microarray data processing during the recent seven years, which were published in three famous publishers (Elsevier, Springer, and IEEE). We found that 38% of these papers are published in Springer. The reviewed papers were categorized based on their main purposes into nine directions, then they were summarized according to what studies received the most, the middle, and the least research attention in all 132 papers that were reviewed in this

CRediT author statement

Esra'a Alhenawi: Methodology, Writing- Original draft preparation, validation. Rizik Al-Sayyed: Supervision, Writing - review & editing, Visualization, Validation. Amjad Hudaib: Supervision, Writing - review, Validation. Seyedali Mirjalili: Supervision, Review & editing, Visualization, Validation.

Declaration of competing interest

The authors whose names are listed immediately below certify that they have NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers’ bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in

Acknowledgements

We would like to thank all people who provided a technical help, assisted in reviewing and editing the language of the manuscript, and to those who offered general support and useful comments regarding this manuscript.

Esra'a Alhenawi ([email protected]) is currently a PhD candidate, The University of Jordan, King Abdullah II School for Information Technology, Department of Computer Science.

References (165)

  • S. Tabakhi et al.

    Gene selection for microarray data classification using a novel ant colony optimization

    Neurocomputing

    (2015)
  • M. Daoud et al.

    A survey of neural network-based cancer prediction models from microarray data

    Artif. Intell. Med.

    (2019)
  • C. Kang et al.

    Feature selection and tumor classification for microarray data using relaxed lasso and generalized multi-class support vector machine

    J. Theor. Biol.

    (2019)
  • S. Maldonado et al.

    Dealing with high-dimensional class-imbalanced datasets: embedded feature selection for svm classification

    Appl. Soft Comput.

    (2018)
  • S. Mishra et al.

    Svm-bt-rfe: an improved gene selection framework using bayesian t-test embedded in support vector machine (recursive feature elimination) algorithm

    Karbala Int. J. Modern Sci.

    (2015)
  • P. Moradi et al.

    A hybrid particle swarm optimization for feature subset selection by integrating a novel local search strategy

    Appl. Soft Comput.

    (2016)
  • K.-H. Chen et al.

    Applying particle swarm optimization-based decision tree classifier for cancer classification on gene expression data

    Appl. Soft Comput.

    (2014)
  • A. Zakeri et al.

    Efficient feature selection method using real-valued grasshopper optimization algorithm

    Expert Syst. Appl.

    (2019)
  • M. Ghosh et al.

    Recursive memetic algorithm for gene selection in microarray data

    Expert Syst. Appl.

    (2019)
  • H. Lu et al.

    A hybrid feature selection algorithm for gene expression data classification

    Neurocomputing

    (2017)
  • M. Dashtban et al.

    Gene selection for microarray cancer classification using a new evolutionary method employing artificial intelligence concepts

    Genomics

    (2017)
  • H. Salem et al.

    Classification of human cancer diseases by gene expression profiles

    Appl. Soft Comput.

    (2017)
  • A.K. Shukla et al.

    A two-stage gene selection method for biomarker discovery from microarray data for cancer classification

    Chemometr. Intell. Lab. Syst.

    (2018)
  • A.K. Shukla et al.

    A hybrid gene selection method for microarray recognition

    Biocybernet. Biomed. Eng.

    (2018)
  • R. Luque-Baena et al.

    Robust gene signatures from microarray data using genetic algorithms enriched with biological pathway keywords

    J. Biomed. Inf.

    (2014)
  • X. Gao et al.

    A novel effective diagnosis model based on optimized least squares support machine for gene microarray

    Appl. Soft Comput.

    (2018)
  • O. Dagliyan et al.

    Optimization based tumor classification from microarray gene expression data

    PLoS One

    (2011)
  • G. Manikandan et al.

    A survey on feature selection and extraction techniques for high-dimensional microarray datasets

  • T. Saw et al.

    Swarm intelligence based feature selection for high dimensional classification: a literature survey

    Int. J. Comput.

    (2019)
  • G. Manikandan et al.

    Feature selection is important: state-of-the-art methods and application domains of feature selection on high-dimensional data

  • T. Almutiri et al.

    Review on feature selection methods for gene expression data classification

  • A.K. Shukla et al.

    A Study on Metaheuristics Approaches for Gene Selection in Microarray Data: Algorithms, Applications and Open Challenges

    (2019)
  • A. Alonso-Betanzos et al.

    Feature selection applied to microarray data

  • N. Sánchez-Maroño et al.

    Classification of microarray data

  • N.A. Zamri et al.

    Review on the usage of swarm intelligence in gene expression data

  • N. Almugren et al.

    A survey on hybrid feature selection methods in microarray gene expression data for cancer classification

    IEEE Access

    (2019)
  • J.C. Ang et al.

    Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection

    IEEE ACM Trans. Comput. Biol. Bioinf

    (2015)
  • V. Bolón-Canedo et al.

    Challenges and future trends for microarray analysis

  • S. Vanjimalar et al.

    A review on feature selection techniques for gene expression data

  • S.D. Bharathi et al.

    A survey on gene selection for microarray cancer classification based on soft computing techniques

  • A. Jović et al.

    A review of feature selection methods with applications

  • K.P. Shroff et al.

    A comparative study of various feature selection techniques in high-dimensional data set to improve classification accuracy

  • V. Bolón-Canedo et al.

    Feature selection in dna microarray classification

  • Z. Mungloo-Dilmohamud et al.

    A meta-review of feature selection techniques in the context of microarray data

  • J. Li et al.

    Feature selection: a data perspective

    ACM Comput. Surv.

    (2017)
  • D. Cai et al.

    Unsupervised feature selection for multi-cluster data

  • S. Huang

    Supervised feature selection: a tutorial

    Artif. Intell. Res.

    (2015)
  • L. Fu et al.

    Condition monitoring for the roller bearings of wind turbines under variable working conditions based on the Fisher score and permutation entropy

    Energies

    (2019)
  • M.A. Sulaiman et al.

    Feature selection based on mutual information

  • X. He et al.

    Laplacian score for feature selection

    Adv. Neural Inf. Process. Syst.

    (2005)
  • Cited by (78)

    View all citing articles on Scopus

    Esra'a Alhenawi ([email protected]) is currently a PhD candidate, The University of Jordan, King Abdullah II School for Information Technology, Department of Computer Science.

    Rizik Al-Sayyed ([email protected]), is a professor in computer networks, cloud computing, databases systems, and simulation, The University of Jordan, King Abdullah II School for Information Technology, Department of Information Technology.

    Amjad Hudaib ([email protected]), is a professor in Software Engineering, The University of Jordan, King Abdullah II School for Information Technology, Department of Computer Information Systems.

    Seyedali Mirjalili ([email protected]), is currently an Associate Professor and the director of the Centre for Artificial Intelligence Research and Optimization at Torrens University Australia. He is internationally well recognized in Swarm Intelligence and Optimization.

    View full text