Review
Gene reduction and machine learning algorithms for cancer classification based on microarray gene expression data: A comprehensive review

https://doi.org/10.1016/j.eswa.2022.118946Get rights and content

Highlights

  • A comprehensive review on microarray gene expression reduction is proposed.

  • A novel taxonomy of up-to-date gene reduction algorithms is presented.

  • The strength and limitations of each type of gene reduction algorithms are provided.

  • Provide an outline of the accurately used type of machine learning algorithms.

  • Analyze the performance of the state-of-the-art algorithms.

Abstract

Disease diagnosis and prediction methods in biotechnology and medicine have significantly advanced over time. Consequently, analyzing raw gene expression is crucial for identifying diseases such as cancer. Interestingly, microarrays are a tool that records gene expression from deoxyribonucleic acid (DNA) or ribonucleic acid. This technique exhibits intriguing characteristics, such as generating high-dimensional data with a small sample size. However, in the case of such dataset, the classification model is prone to overfitting. This limitation can be overcome by reducing the dimensions of the microarray datasets to a reasonable number. Machine learning (ML)-based data reduction has recently achieved considerable attention in genomic research. Therefore, this review examines recent studies that present state-of-the-art data reduction and classification algorithms for microarray gene expression data to diagnose tumors and analyzes their performance. To the best of our knowledge, this is the first review that provides a comprehensive view of data preprocessing, dimensionality reduction, including feature (i.e., gene) selection, feature extraction, and their hybrid, and ML algorithms. The paper is structured as follows. First, this review summarizes several data preprocessing methods applied to gene expression datasets. Then, a detailed review of various ML-based feature selection algorithms, including filter, wrapper, embedded, ensemble, and hybrid algorithms, is discussed. These algorithms are examined under three main classes—supervised, unsupervised, and semisupervised ML. Next, the feature extraction and hybrid of feature extraction and selection algorithms are thoroughly reviewed. Furthermore, a detailed review of broadly applied ML algorithms to simplify tumor and nontumor classification using microarray datasets is presented. Finally, the challenges and open questions related to gene expression datasets for accurate cancer classification and detection are highlighted.

Introduction

In recent years, data analytics have gained considerable attention in the field of biomedical research, with a deep concentration on the most critical clinical conditions, such as cancer (Momeni, Hassanzadeh, Abadeh, & Bellazzi, 2020). For example, DNA microarray techniques have been adopted to represent and describe thousands of gene expressions and generate microarray gene expression data (Sayed, Nassef, Badr, & Farag, 2019). Gene expression profiling expresses the current physiological state of a cell and contains valuable information about gene activity. These data have effectively been used for early diagnosis and prognosis of cancer types, in the past few years (Bolón-Canedo & Alonso-Betanzos, 2019).

A significant defect in such data is that they comprise inordinate gene counts (the curse of dimensionality) against a few observations (the curse of data sparsity). However, high-dimensionality data that include irrelevant, redundant, and noisy genes cause incorrect diagnoses owing to the few cancer-causing genes (Almugren & Alshamlan, 2019c). Such datasets tend to overfitting, producing non-reproducible results and are impacted by high variance (Momeni et al., 2020). For data analysis, coping with a dataset consisting of a limited number of observations and a huge number of genes is a daunting task.

Dimensionality reduction algorithms, including feature extraction and feature selection algorithms, are commonly used to solve this problem. Feature (gene) extraction algorithms are statistical methods that transform an existing input surface into a new one by generating new features based on the extant ones to reduce input features representation (Momeni et al., 2020). Therefore, as they affect the input feature space during the transformation, the final prediction model from this new-feature surface lacks interpretability. Alternatively, in dimensionality reduction, feature selection algorithms are frequently preferred to feature extraction algorithms for microarray gene expression dimensionality reduction (Momeni et al., 2020). These algorithms have been applied to gene expression datasets to select ideal discriminating genes called biomarkers (Zawbaa, Emary, Grosan, & Snasel, 2018). Accordingly, a successful gene selection algorithm should select the fewest genes that achieve the best performance based on specific criteria, such as model accuracy, sensitivity, and sensibility (Chen, Meng, & Su, 2020).

Fig. 1 illustrates the main stages in microarray data analysis. These stages include data exploration, data preprocessing, dimensionality reduction, machine learning (ML) algorithm, and evaluation. As shown in the diagram, data acquisition is the process of obtaining or gathering data to work with. Then, data exploration is used to understand the nature of the acquired data by comprehending its features, format, and quality. Next, data preprocessing is used to convert raw data into a readable and high-quality format. Afterward, dimensionality reduction algorithms can be used to identify the relevant genes. Then, an ML algorithm is used to create a cancer classification model. Finally, the ML model is evaluated.

Surveys that address numerous feature reduction algorithms for gene analysis have been reported in the literature. However, to the best of our knowledge, none provide a comprehensive/systematic and broad view of data preprocessing methods, data reduction algorithms, including feature selection, feature extraction, and their hybrid from the perspectives of an ML and the ML algorithms employed. For example, a previous survey (Kumar, Sooraj, & Ramakrishnan, 2017) focused only on various gene selection algorithms based on supervised learning algorithms. In contrast, another survey (Almugren & Alshamlan, 2019c) concentrated only on hybrid gene selection algorithms. Furthermore, a previous review (Mahendran, Vincent, Srinivasan, & Chang, 2020) focused only on gene selection algorithms from an ML perspective, classifying them into supervised, unsupervised, and semisupervised feature selection algorithms. In contrast, another survey (Hira & Gillies, 2015) presented in 2015 focused only on feature selection and extraction algorithms in DNA microarray datasets.

The present review bridges the gap identified above while examining various recent studies in genomics in detail, investigating the severe difficulty of dealing with microarray datasets —particularly gene expression datasets—and several proposed solutions. A summary of various data preprocessing methods for gene expression datasets is presented. A novel taxonomy of data reduction algorithms is proposed, classifying the reviewed research into three groups: research on feature selection, feature extraction, and their hybrid. Feature selection studies include filter, wrapper, embedded, ensemble, and hybrid algorithms. Therein, each class is further classified as supervised, unsupervised, and semisupervised based on the applied ML algorithm (Fig. 2). Finally, this review examines frequently used ML algorithms and summarizes the deductions and new trends in disease diagnosis and prognosis techniques based on microarray gene expression data.

Key contributions of this review are as follows:

  • Examine up-to-date data reduction algorithms for microarray gene expression datasets under three classes: feature selection, feature extraction, and hybrid of feature extraction and selection.

  • Examine recent feature selection algorithms for microarray gene expression datasets under five classes: filter, wrapper, embedded, ensemble, and hybrid. They are further classified on the basis of ML algorithms under three classes: supervised, unsupervised, and semisupervised

  • Examine ML algorithms used to diagnose cancer based on a DNA microarray.

  • Provide a summary of datasets used to develop cancer classification models based on the microarray.

  • Analyze the performance of developed algorithms and provide an outline of strengths and limitations of each type of data reduction algorithm.

  • Outstanding challenges of coping with microarray datasets and future works are listed and explained briefly, which might help draw a roadmap for researchers on recent trends and potential future research directions.

The rest of the paper is structured as follows. Section 2 presents the methodology and strategies used to obtain study details such as research questions, keywords, and study selection criteria. The data preprocessing overview is thoroughly explained in Section 3. Sections 4, 5, and 6 present the reviews of feature selection algorithms, feature extraction algorithms and the hybrid of feature extraction and selection algorithms, and ML algorithms, respectively. The main observations are presented in Section 7. Finally, Section 8 discusses the challenges and future trends, and Section 9 presents the conclusion of this review.

Section snippets

Review methodology

This section highlights the methods and protocols used to produce this review, the search strategy used, and the research questions. The general article selection process as well as inclusion and exclusion metrics are also provided.

Data preprocessing overview

ML and deep learning (DL) algorithms are extremely sensitive to low-quality input data. Therefore, examining and improving data quality is a critical step in analyzing gene expressions. Several factors should be considered to ensure data quality, including existence, validity, accuracy, timeliness, believability, interpretability, completeness, relevance, and consistency. Simply put, it is a series of steps that function in tandem. The first step is ensuring the data is accessible and recorded

Feature selection algorithms

The human body comprises thousands of genes, although only a few are implicated in cancer diagnosis and complications in the human body. Therefore, identifying the associated genes in a typical cancer type, such as leukemia, may facilitate cancer condition analysis. Precisely, 2n possible subsets of genes are available for a dataset comprising n features (Pirgazi, Alimoradi, Abharian, & Olyaee, 2019). Therefore, a successful gene selection algorithm should select the minimum number of genes

Review of feature extraction algorithms

Feature extraction transforms input data with several features into a reduced representation of a feature set. For example, given X=[x1,x2,,xl]T, determine a new set of n features by transforming X of l to a new set of g independent variables known as components, where nl. Feature extraction algorithms are categorized as linear and nonlinear algorithms. Linear feature extraction copes with linear separable features; the high-dimensional space is mapped to a new low-dimensional linear subspace

Classification problem definition

In a classification problem, a given input vector is the observation of f components called features. However, each observation represents a gene expression profile, the features are gene expression coefficients, and observations correspond to patients in this review. A training set of observations is denoted by X=[x1,x2,,xj]T, and the class labels is denoted by Y=[y1,y2,,yj]T, yj{1,2,,n}, where j is number of observations and n is the number of classes. Decision functions that are the

Analysis and discussion

This section analyzes the results of the 155 selected studies connected with the previously defined research questions. Furthermore, the reviewed results of (1) commonly used microarray gene expression datasets, (2) extensively applied gene reduction algorithms, (3) broadly developed gene selection algorithms, and (4) frequently applied ML algorithms are presented and discussed in the following subsections.

Challenges and future directions

This section outlines future research directions that can be followed to address the recent limitations of microarray gene expression datasets, enhance the efficiency of ML algorithms and classify and detect cancer types with great precision. Despite the positive effects of the reviewed literature, some limitations and challenges must be overcome. Therefore, this review highlighted essential difficulties and inherent trends, directions for future research, and challenges in the following

Conclusion

Microarray gene expression datasets have excessive number of dimensions with few samples. Our review concludes that data dimensionality significantly influences a diagnostic system’s accuracy in many circumstances. Therefore, effective algorithms are required to handle these datasets by decreasing redundancy and dependency and preserving informative genes. This review examined studies on gene expression microarray datasets from 2010 to 2022 and data reduction algorithms used to eliminate

CRediT authorship contribution statement

Sarah Osama: Conceptualization, Methodology, Formal analysis, Resources, Writing – original draft, Writing – review & editing. Hassan Shaban: Supervision, Writing – review & editing. Abdelmgeid A. Ali: Supervision, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (203)

  • ChuangL.-Y. et al.

    A hybrid feature selection method for DNA microarray data

    Computers in Biology and Medicine

    (2011)
  • DabbaA. et al.

    Gene selection and classification of microarray data method based on mutual information and moth flame algorithm.

    Expert Systems with Applications

    (2021)
  • DasA.K. et al.

    Ensemble feature selection using bi-objective genetic algorithm

    Knowledge-Based Systems

    (2017)
  • DashR.

    An adaptive harmony search approach for gene selection and classification of high dimensional medical data

    Journal of King Saud University-Computer and Information Sciences

    (2021)
  • DashR. et al.

    Pipelining the ranking techniques for microarray data classification: A case study

    Applied Soft Computing

    (2016)
  • DashtbanM. et al.

    Gene selection for microarray cancer classification using a new evolutionary method employing artificial intelligence concepts

    Genomics

    (2017)
  • DashtbanM. et al.

    Gene selection for tumor classification using a novel bio-inspired multi-objective approach

    Genomics

    (2018)
  • EbrahimpourM.K. et al.

    Ensemble of feature selection methods: A hesitant fuzzy sets approach

    Applied Soft Computing

    (2017)
  • ElyasigomariV. et al.

    Cancer classification using a novel gene selection approach by means of shuffling based on data clustering with optimization

    Applied Soft Computing

    (2015)
  • EngreitzJ.M. et al.

    Independent component analysis: mining microarray data for fundamental human gene expression modules

    Journal of Biomedical Informatics

    (2010)
  • GangavarapuT. et al.

    A novel filter–wrapper hybrid greedy ensemble approach optimized using the genetic algorithm to reduce the dimensionality of high-dimensional biomedical datasets

    Applied Soft Computing

    (2019)
  • GaoX. et al.

    A novel effective diagnosis model based on optimized least squares support machine for gene microarray

    Applied Soft Computing

    (2018)
  • GhoshM. et al.

    Recursive memetic algorithm for gene selection in microarray data

    Expert Systems with Applications

    (2019)
  • GrisciB.I. et al.

    Neuroevolution as a tool for microarray gene expression pattern identification in cancer research

    Journal of Biomedical Informatics

    (2019)
  • GuoS. et al.

    A L1-regularized feature selection method for local dimension reduction on microarray data

    Computational Biology and Chemistry

    (2017)
  • HanX. et al.

    Feature subset selection by gravitational search algorithm optimization

    Information Sciences

    (2014)
  • HanX.H. et al.

    A hybrid cancer classification model based recursive binary gravitational search algorithm in microarray data

    Procedia Computer Science

    (2019)
  • JainI. et al.

    Correlation feature selection based improved-binary particle swarm optimization for gene selection and cancer classification

    Applied Soft Computing

    (2018)
  • KanehisaM.

    A database for post-genome analysis.

    Trends in Genetics: TIG

    (1997)
  • KangC. et al.

    Feature selection and tumor classification for microarray data using relaxed lasso and generalized multi-class support vector machine

    Journal of Theoretical Biology

    (2019)
  • AbinashM. et al.

    A study on wrapper-based feature selection algorithm for leukemia dataset

  • Al-BaityH.H. et al.

    A new optimized wrapper gene selection method for breast cancer prediction

    CMC-Computers Materials & Continua

    (2021)
  • Al-ObeidatF. et al.

    Gene encoder: A feature selection technique through unsupervised deep learning-based clustering for large gene expression data

    Neural Computing and Applications

    (2020)
  • AlgamalZ.Y. et al.

    A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification

    Advances in Data Analysis and Classification

    (2019)
  • AlkuhlaniA. et al.

    Multistage feature selection approach for high-dimensional cancer data

    Soft Computing

    (2017)
  • AllamM. et al.

    Optimal feature selection using binary teaching learning based optimization algorithm

    Journal of King Saud University-Computer and Information Sciences

    (2018)
  • AlmugrenN. et al.

    FF-SVM: new firefly-based gene selection algorithm for microarray cancer classification

  • AlmugrenN. et al.

    New bio-marker gene discovery algorithms for cancer gene expression profile

    IEEE Access

    (2019)
  • AlmugrenN. et al.

    A survey on hybrid feature selection methods in microarray gene expression data for cancer classification

    IEEE Access

    (2019)
  • AlomariO.A. et al.

    A hybrid filter-wrapper gene selection method for cancer classification

  • AlomariO.A. et al.

    A novel gene selection method using modified MRMR and hybrid bat-inspired algorithm with β-hill climbing

    Applied Intelligence: The International Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Technologies

    (2018)
  • AlonU. et al.

    Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays

    Proceedings of the National Academy of Sciences

    (1999)
  • AlrefaiN. et al.

    Optimized feature selection method using particle swarm intelligence with ensemble learning for cancer classification based on microarray datasets

    Neural Computing and Applications

    (2022)
  • AlshamlanH. et al.

    mRMR-ABC: A hybrid gene selection algorithm for cancer classification using microarray gene expression profiling

    Biomed Research International

    (2015)
  • AlweshahM. et al.

    The monarch butterfly optimization algorithm for solving feature selection problems

    Neural Computing and Applications

    (2020)
  • AlzubiR. et al.

    A hybrid feature selection method for complex diseases SNPs

    IEEE Access

    (2017)
  • AnnavarapuC.S.R. et al.

    Clustering-based hybrid feature selection approach for high dimensional microarray data

    Chemometrics and Intelligent Laboratory Systems

    (2021)
  • AzizR. et al.

    Artificial neural network classification of microarray data using new hybrid gene selection method

    International Journal of Data Mining and Bioinformatics

    (2017)
  • AzizR. et al.

    Artificial neural network classification of high dimensional data with novel optimization approach of dimension reduction

    Annals of Data Science

    (2018)
  • BadaouiF. et al.

    Dimensionality reduction and class prediction algorithm with application to microarray big data

    Journal of Big Data

    (2017)
  • Cited by (18)

    View all citing articles on Scopus
    View full text