ReviewGene reduction and machine learning algorithms for cancer classification based on microarray gene expression data: A comprehensive review
Graphical abstract
Introduction
In recent years, data analytics have gained considerable attention in the field of biomedical research, with a deep concentration on the most critical clinical conditions, such as cancer (Momeni, Hassanzadeh, Abadeh, & Bellazzi, 2020). For example, DNA microarray techniques have been adopted to represent and describe thousands of gene expressions and generate microarray gene expression data (Sayed, Nassef, Badr, & Farag, 2019). Gene expression profiling expresses the current physiological state of a cell and contains valuable information about gene activity. These data have effectively been used for early diagnosis and prognosis of cancer types, in the past few years (Bolón-Canedo & Alonso-Betanzos, 2019).
A significant defect in such data is that they comprise inordinate gene counts (the curse of dimensionality) against a few observations (the curse of data sparsity). However, high-dimensionality data that include irrelevant, redundant, and noisy genes cause incorrect diagnoses owing to the few cancer-causing genes (Almugren & Alshamlan, 2019c). Such datasets tend to overfitting, producing non-reproducible results and are impacted by high variance (Momeni et al., 2020). For data analysis, coping with a dataset consisting of a limited number of observations and a huge number of genes is a daunting task.
Dimensionality reduction algorithms, including feature extraction and feature selection algorithms, are commonly used to solve this problem. Feature (gene) extraction algorithms are statistical methods that transform an existing input surface into a new one by generating new features based on the extant ones to reduce input features representation (Momeni et al., 2020). Therefore, as they affect the input feature space during the transformation, the final prediction model from this new-feature surface lacks interpretability. Alternatively, in dimensionality reduction, feature selection algorithms are frequently preferred to feature extraction algorithms for microarray gene expression dimensionality reduction (Momeni et al., 2020). These algorithms have been applied to gene expression datasets to select ideal discriminating genes called biomarkers (Zawbaa, Emary, Grosan, & Snasel, 2018). Accordingly, a successful gene selection algorithm should select the fewest genes that achieve the best performance based on specific criteria, such as model accuracy, sensitivity, and sensibility (Chen, Meng, & Su, 2020).
Fig. 1 illustrates the main stages in microarray data analysis. These stages include data exploration, data preprocessing, dimensionality reduction, machine learning (ML) algorithm, and evaluation. As shown in the diagram, data acquisition is the process of obtaining or gathering data to work with. Then, data exploration is used to understand the nature of the acquired data by comprehending its features, format, and quality. Next, data preprocessing is used to convert raw data into a readable and high-quality format. Afterward, dimensionality reduction algorithms can be used to identify the relevant genes. Then, an ML algorithm is used to create a cancer classification model. Finally, the ML model is evaluated.
Surveys that address numerous feature reduction algorithms for gene analysis have been reported in the literature. However, to the best of our knowledge, none provide a comprehensive/systematic and broad view of data preprocessing methods, data reduction algorithms, including feature selection, feature extraction, and their hybrid from the perspectives of an ML and the ML algorithms employed. For example, a previous survey (Kumar, Sooraj, & Ramakrishnan, 2017) focused only on various gene selection algorithms based on supervised learning algorithms. In contrast, another survey (Almugren & Alshamlan, 2019c) concentrated only on hybrid gene selection algorithms. Furthermore, a previous review (Mahendran, Vincent, Srinivasan, & Chang, 2020) focused only on gene selection algorithms from an ML perspective, classifying them into supervised, unsupervised, and semisupervised feature selection algorithms. In contrast, another survey (Hira & Gillies, 2015) presented in 2015 focused only on feature selection and extraction algorithms in DNA microarray datasets.
The present review bridges the gap identified above while examining various recent studies in genomics in detail, investigating the severe difficulty of dealing with microarray datasets —particularly gene expression datasets—and several proposed solutions. A summary of various data preprocessing methods for gene expression datasets is presented. A novel taxonomy of data reduction algorithms is proposed, classifying the reviewed research into three groups: research on feature selection, feature extraction, and their hybrid. Feature selection studies include filter, wrapper, embedded, ensemble, and hybrid algorithms. Therein, each class is further classified as supervised, unsupervised, and semisupervised based on the applied ML algorithm (Fig. 2). Finally, this review examines frequently used ML algorithms and summarizes the deductions and new trends in disease diagnosis and prognosis techniques based on microarray gene expression data.
Key contributions of this review are as follows:
- •
Examine up-to-date data reduction algorithms for microarray gene expression datasets under three classes: feature selection, feature extraction, and hybrid of feature extraction and selection.
- •
Examine recent feature selection algorithms for microarray gene expression datasets under five classes: filter, wrapper, embedded, ensemble, and hybrid. They are further classified on the basis of ML algorithms under three classes: supervised, unsupervised, and semisupervised
- •
Examine ML algorithms used to diagnose cancer based on a DNA microarray.
- •
Provide a summary of datasets used to develop cancer classification models based on the microarray.
- •
Analyze the performance of developed algorithms and provide an outline of strengths and limitations of each type of data reduction algorithm.
- •
Outstanding challenges of coping with microarray datasets and future works are listed and explained briefly, which might help draw a roadmap for researchers on recent trends and potential future research directions.
The rest of the paper is structured as follows. Section 2 presents the methodology and strategies used to obtain study details such as research questions, keywords, and study selection criteria. The data preprocessing overview is thoroughly explained in Section 3. Sections 4, 5, and 6 present the reviews of feature selection algorithms, feature extraction algorithms and the hybrid of feature extraction and selection algorithms, and ML algorithms, respectively. The main observations are presented in Section 7. Finally, Section 8 discusses the challenges and future trends, and Section 9 presents the conclusion of this review.
Section snippets
Review methodology
This section highlights the methods and protocols used to produce this review, the search strategy used, and the research questions. The general article selection process as well as inclusion and exclusion metrics are also provided.
Data preprocessing overview
ML and deep learning (DL) algorithms are extremely sensitive to low-quality input data. Therefore, examining and improving data quality is a critical step in analyzing gene expressions. Several factors should be considered to ensure data quality, including existence, validity, accuracy, timeliness, believability, interpretability, completeness, relevance, and consistency. Simply put, it is a series of steps that function in tandem. The first step is ensuring the data is accessible and recorded
Feature selection algorithms
The human body comprises thousands of genes, although only a few are implicated in cancer diagnosis and complications in the human body. Therefore, identifying the associated genes in a typical cancer type, such as leukemia, may facilitate cancer condition analysis. Precisely, possible subsets of genes are available for a dataset comprising n features (Pirgazi, Alimoradi, Abharian, & Olyaee, 2019). Therefore, a successful gene selection algorithm should select the minimum number of genes
Review of feature extraction algorithms
Feature extraction transforms input data with several features into a reduced representation of a feature set. For example, given , determine a new set of n features by transforming X of l to a new set of g independent variables known as components, where . Feature extraction algorithms are categorized as linear and nonlinear algorithms. Linear feature extraction copes with linear separable features; the high-dimensional space is mapped to a new low-dimensional linear subspace
Classification problem definition
In a classification problem, a given input vector is the observation of components called features. However, each observation represents a gene expression profile, the features are gene expression coefficients, and observations correspond to patients in this review. A training set of observations is denoted by , and the class labels is denoted by , , where j is number of observations and n is the number of classes. Decision functions that are the
Analysis and discussion
This section analyzes the results of the 155 selected studies connected with the previously defined research questions. Furthermore, the reviewed results of (1) commonly used microarray gene expression datasets, (2) extensively applied gene reduction algorithms, (3) broadly developed gene selection algorithms, and (4) frequently applied ML algorithms are presented and discussed in the following subsections.
Challenges and future directions
This section outlines future research directions that can be followed to address the recent limitations of microarray gene expression datasets, enhance the efficiency of ML algorithms and classify and detect cancer types with great precision. Despite the positive effects of the reviewed literature, some limitations and challenges must be overcome. Therefore, this review highlighted essential difficulties and inherent trends, directions for future research, and challenges in the following
Conclusion
Microarray gene expression datasets have excessive number of dimensions with few samples. Our review concludes that data dimensionality significantly influences a diagnostic system’s accuracy in many circumstances. Therefore, effective algorithms are required to handle these datasets by decreasing redundancy and dependency and preserving informative genes. This review examined studies on gene expression microarray datasets from 2010 to 2022 and data reduction algorithms used to eliminate
CRediT authorship contribution statement
Sarah Osama: Conceptualization, Methodology, Formal analysis, Resources, Writing – original draft, Writing – review & editing. Hassan Shaban: Supervision, Writing – review & editing. Abdelmgeid A. Ali: Supervision, Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (203)
- et al.
G-forest: An ensemble method for cost-sensitive feature selection in gene expression microarrays
Artificial Intelligence in Medicine
(2020) - et al.
Gene selection and classification of microarray gene expression data based on a new adaptive L1-norm elastic net penalty
Informatics in Medicine Unlocked
(2021) Co-ABC: Correlation artificial bee colony algorithm for biomarker gene discovery using gene expression profile
Saudi Journal of Biological Sciences
(2018)- et al.
Genetic bee colony (GBC) algorithm: A new gene selection method for microarray cancer classification
Computational Biology and Chemistry
(2015) - et al.
Memory based cuckoo search algorithm for feature selection of gene expression dataset
Informatics in Medicine Unlocked
(2021) - et al.
A novel approach for dimension reduction of microarray
Computational Biology and Chemistry
(2017) - et al.
An ensemble of filters and classifiers for microarray data classification
Pattern Recognition
(2012) - et al.
Molecular classification of crohn’s disease and ulcerative colitis patients using transcriptional profiles in peripheral blood mononuclear cells
The Journal of Molecular Diagnostics
(2006) - et al.
Applying particle swarm optimization-based decision tree classifier for cancer classification on gene expression data
Applied Soft Computing
(2014) - et al.
Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival
Blood
(2004)
A hybrid feature selection method for DNA microarray data
Computers in Biology and Medicine
Gene selection and classification of microarray data method based on mutual information and moth flame algorithm.
Expert Systems with Applications
Ensemble feature selection using bi-objective genetic algorithm
Knowledge-Based Systems
An adaptive harmony search approach for gene selection and classification of high dimensional medical data
Journal of King Saud University-Computer and Information Sciences
Pipelining the ranking techniques for microarray data classification: A case study
Applied Soft Computing
Gene selection for microarray cancer classification using a new evolutionary method employing artificial intelligence concepts
Genomics
Gene selection for tumor classification using a novel bio-inspired multi-objective approach
Genomics
Ensemble of feature selection methods: A hesitant fuzzy sets approach
Applied Soft Computing
Cancer classification using a novel gene selection approach by means of shuffling based on data clustering with optimization
Applied Soft Computing
Independent component analysis: mining microarray data for fundamental human gene expression modules
Journal of Biomedical Informatics
A novel filter–wrapper hybrid greedy ensemble approach optimized using the genetic algorithm to reduce the dimensionality of high-dimensional biomedical datasets
Applied Soft Computing
A novel effective diagnosis model based on optimized least squares support machine for gene microarray
Applied Soft Computing
Recursive memetic algorithm for gene selection in microarray data
Expert Systems with Applications
Neuroevolution as a tool for microarray gene expression pattern identification in cancer research
Journal of Biomedical Informatics
A L1-regularized feature selection method for local dimension reduction on microarray data
Computational Biology and Chemistry
Feature subset selection by gravitational search algorithm optimization
Information Sciences
A hybrid cancer classification model based recursive binary gravitational search algorithm in microarray data
Procedia Computer Science
Correlation feature selection based improved-binary particle swarm optimization for gene selection and cancer classification
Applied Soft Computing
A database for post-genome analysis.
Trends in Genetics: TIG
Feature selection and tumor classification for microarray data using relaxed lasso and generalized multi-class support vector machine
Journal of Theoretical Biology
A study on wrapper-based feature selection algorithm for leukemia dataset
A new optimized wrapper gene selection method for breast cancer prediction
CMC-Computers Materials & Continua
Gene encoder: A feature selection technique through unsupervised deep learning-based clustering for large gene expression data
Neural Computing and Applications
A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification
Advances in Data Analysis and Classification
Multistage feature selection approach for high-dimensional cancer data
Soft Computing
Optimal feature selection using binary teaching learning based optimization algorithm
Journal of King Saud University-Computer and Information Sciences
FF-SVM: new firefly-based gene selection algorithm for microarray cancer classification
New bio-marker gene discovery algorithms for cancer gene expression profile
IEEE Access
A survey on hybrid feature selection methods in microarray gene expression data for cancer classification
IEEE Access
A hybrid filter-wrapper gene selection method for cancer classification
A novel gene selection method using modified MRMR and hybrid bat-inspired algorithm with -hill climbing
Applied Intelligence: The International Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Technologies
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays
Proceedings of the National Academy of Sciences
Optimized feature selection method using particle swarm intelligence with ensemble learning for cancer classification based on microarray datasets
Neural Computing and Applications
mRMR-ABC: A hybrid gene selection algorithm for cancer classification using microarray gene expression profiling
Biomed Research International
The monarch butterfly optimization algorithm for solving feature selection problems
Neural Computing and Applications
A hybrid feature selection method for complex diseases SNPs
IEEE Access
Clustering-based hybrid feature selection approach for high dimensional microarray data
Chemometrics and Intelligent Laboratory Systems
Artificial neural network classification of microarray data using new hybrid gene selection method
International Journal of Data Mining and Bioinformatics
Artificial neural network classification of high dimensional data with novel optimization approach of dimension reduction
Annals of Data Science
Dimensionality reduction and class prediction algorithm with application to microarray big data
Journal of Big Data
Cited by (18)
Pattern recognition frequency-based feature selection with multi-objective discrete evolution strategy for high-dimensional medical datasets
2024, Expert Systems with ApplicationsAdvancing gene feature selection: Comprehensive learning modified hunger games search for high-dimensional data
2024, Biomedical Signal Processing and ControlGene selection and tumor identification based on a hybrid of the multi-filter embedded recursive mountain gazelle algorithm
2023, Computers in Biology and MedicineMetal/covalent-organic framework-based biosensors for nucleic acid detection
2023, Coordination Chemistry ReviewsDispersed differential hunger games search for high dimensional gene data feature selection
2023, Computers in Biology and Medicine