Gene reduction and machine learning algorithms for cancer classification based on microarray gene expression data: A comprehensive review

doi:10.1016/j.eswa.2022.118946

Expert Systems with Applications

Volume 213, Part A, 1 March 2023, 118946

https://doi.org/10.1016/j.eswa.2022.118946 Get rights and content

Highlights

•
A comprehensive review on microarray gene expression reduction is proposed.
•
A novel taxonomy of up-to-date gene reduction algorithms is presented.
•
The strength and limitations of each type of gene reduction algorithms are provided.
•
Provide an outline of the accurately used type of machine learning algorithms.
•
Analyze the performance of the state-of-the-art algorithms.

Abstract

Disease diagnosis and prediction methods in biotechnology and medicine have significantly advanced over time. Consequently, analyzing raw gene expression is crucial for identifying diseases such as cancer. Interestingly, microarrays are a tool that records gene expression from deoxyribonucleic acid (DNA) or ribonucleic acid. This technique exhibits intriguing characteristics, such as generating high-dimensional data with a small sample size. However, in the case of such dataset, the classification model is prone to overfitting. This limitation can be overcome by reducing the dimensions of the microarray datasets to a reasonable number. Machine learning (ML)-based data reduction has recently achieved considerable attention in genomic research. Therefore, this review examines recent studies that present state-of-the-art data reduction and classification algorithms for microarray gene expression data to diagnose tumors and analyzes their performance. To the best of our knowledge, this is the first review that provides a comprehensive view of data preprocessing, dimensionality reduction, including feature (i.e., gene) selection, feature extraction, and their hybrid, and ML algorithms. The paper is structured as follows. First, this review summarizes several data preprocessing methods applied to gene expression datasets. Then, a detailed review of various ML-based feature selection algorithms, including filter, wrapper, embedded, ensemble, and hybrid algorithms, is discussed. These algorithms are examined under three main classes—supervised, unsupervised, and semisupervised ML. Next, the feature extraction and hybrid of feature extraction and selection algorithms are thoroughly reviewed. Furthermore, a detailed review of broadly applied ML algorithms to simplify tumor and nontumor classification using microarray datasets is presented. Finally, the challenges and open questions related to gene expression datasets for accurate cancer classification and detection are highlighted.

Graphical abstract

Introduction

In recent years, data analytics have gained considerable attention in the field of biomedical research, with a deep concentration on the most critical clinical conditions, such as cancer (Momeni, Hassanzadeh, Abadeh, & Bellazzi, 2020). For example, DNA microarray techniques have been adopted to represent and describe thousands of gene expressions and generate microarray gene expression data (Sayed, Nassef, Badr, & Farag, 2019). Gene expression profiling expresses the current physiological state of a cell and contains valuable information about gene activity. These data have effectively been used for early diagnosis and prognosis of cancer types, in the past few years (Bolón-Canedo & Alonso-Betanzos, 2019).

A significant defect in such data is that they comprise inordinate gene counts (the curse of dimensionality) against a few observations (the curse of data sparsity). However, high-dimensionality data that include irrelevant, redundant, and noisy genes cause incorrect diagnoses owing to the few cancer-causing genes (Almugren & Alshamlan, 2019c). Such datasets tend to overfitting, producing non-reproducible results and are impacted by high variance (Momeni et al., 2020). For data analysis, coping with a dataset consisting of a limited number of observations and a huge number of genes is a daunting task.

Dimensionality reduction algorithms, including feature extraction and feature selection algorithms, are commonly used to solve this problem. Feature (gene) extraction algorithms are statistical methods that transform an existing input surface into a new one by generating new features based on the extant ones to reduce input features representation (Momeni et al., 2020). Therefore, as they affect the input feature space during the transformation, the final prediction model from this new-feature surface lacks interpretability. Alternatively, in dimensionality reduction, feature selection algorithms are frequently preferred to feature extraction algorithms for microarray gene expression dimensionality reduction (Momeni et al., 2020). These algorithms have been applied to gene expression datasets to select ideal discriminating genes called biomarkers (Zawbaa, Emary, Grosan, & Snasel, 2018). Accordingly, a successful gene selection algorithm should select the fewest genes that achieve the best performance based on specific criteria, such as model accuracy, sensitivity, and sensibility (Chen, Meng, & Su, 2020).

Fig. 1 illustrates the main stages in microarray data analysis. These stages include data exploration, data preprocessing, dimensionality reduction, machine learning (ML) algorithm, and evaluation. As shown in the diagram, data acquisition is the process of obtaining or gathering data to work with. Then, data exploration is used to understand the nature of the acquired data by comprehending its features, format, and quality. Next, data preprocessing is used to convert raw data into a readable and high-quality format. Afterward, dimensionality reduction algorithms can be used to identify the relevant genes. Then, an ML algorithm is used to create a cancer classification model. Finally, the ML model is evaluated.

Surveys that address numerous feature reduction algorithms for gene analysis have been reported in the literature. However, to the best of our knowledge, none provide a comprehensive/systematic and broad view of data preprocessing methods, data reduction algorithms, including feature selection, feature extraction, and their hybrid from the perspectives of an ML and the ML algorithms employed. For example, a previous survey (Kumar, Sooraj, & Ramakrishnan, 2017) focused only on various gene selection algorithms based on supervised learning algorithms. In contrast, another survey (Almugren & Alshamlan, 2019c) concentrated only on hybrid gene selection algorithms. Furthermore, a previous review (Mahendran, Vincent, Srinivasan, & Chang, 2020) focused only on gene selection algorithms from an ML perspective, classifying them into supervised, unsupervised, and semisupervised feature selection algorithms. In contrast, another survey (Hira & Gillies, 2015) presented in 2015 focused only on feature selection and extraction algorithms in DNA microarray datasets.

The present review bridges the gap identified above while examining various recent studies in genomics in detail, investigating the severe difficulty of dealing with microarray datasets —particularly gene expression datasets—and several proposed solutions. A summary of various data preprocessing methods for gene expression datasets is presented. A novel taxonomy of data reduction algorithms is proposed, classifying the reviewed research into three groups: research on feature selection, feature extraction, and their hybrid. Feature selection studies include filter, wrapper, embedded, ensemble, and hybrid algorithms. Therein, each class is further classified as supervised, unsupervised, and semisupervised based on the applied ML algorithm (Fig. 2). Finally, this review examines frequently used ML algorithms and summarizes the deductions and new trends in disease diagnosis and prognosis techniques based on microarray gene expression data.

Key contributions of this review are as follows:

•
Examine up-to-date data reduction algorithms for microarray gene expression datasets under three classes: feature selection, feature extraction, and hybrid of feature extraction and selection.
•
Examine recent feature selection algorithms for microarray gene expression datasets under five classes: filter, wrapper, embedded, ensemble, and hybrid. They are further classified on the basis of ML algorithms under three classes: supervised, unsupervised, and semisupervised
•
Examine ML algorithms used to diagnose cancer based on a DNA microarray.
•
Provide a summary of datasets used to develop cancer classification models based on the microarray.
•
Analyze the performance of developed algorithms and provide an outline of strengths and limitations of each type of data reduction algorithm.
•
Outstanding challenges of coping with microarray datasets and future works are listed and explained briefly, which might help draw a roadmap for researchers on recent trends and potential future research directions.

The rest of the paper is structured as follows. Section 2 presents the methodology and strategies used to obtain study details such as research questions, keywords, and study selection criteria. The data preprocessing overview is thoroughly explained in Section 3. Sections 4, 5, and 6 present the reviews of feature selection algorithms, feature extraction algorithms and the hybrid of feature extraction and selection algorithms, and ML algorithms, respectively. The main observations are presented in Section 7. Finally, Section 8 discusses the challenges and future trends, and Section 9 presents the conclusion of this review.

Section snippets

Review methodology

This section highlights the methods and protocols used to produce this review, the search strategy used, and the research questions. The general article selection process as well as inclusion and exclusion metrics are also provided.

Data preprocessing overview

ML and deep learning (DL) algorithms are extremely sensitive to low-quality input data. Therefore, examining and improving data quality is a critical step in analyzing gene expressions. Several factors should be considered to ensure data quality, including existence, validity, accuracy, timeliness, believability, interpretability, completeness, relevance, and consistency. Simply put, it is a series of steps that function in tandem. The first step is ensuring the data is accessible and recorded

Feature selection algorithms

The human body comprises thousands of genes, although only a few are implicated in cancer diagnosis and complications in the human body. Therefore, identifying the associated genes in a typical cancer type, such as leukemia, may facilitate cancer condition analysis. Precisely, $2^{n}$ possible subsets of genes are available for a dataset comprising n features (Pirgazi, Alimoradi, Abharian, & Olyaee, 2019). Therefore, a successful gene selection algorithm should select the minimum number of genes

Review of feature extraction algorithms

Feature extraction transforms input data with several features into a reduced representation of a feature set. For example, given $X = {[x_{1}, x_{2}, \dots, x_{l}]}^{T}$ , determine a new set of n features by transforming X of l to a new set of g independent variables known as components, where $n ≪ l$ . Feature extraction algorithms are categorized as linear and nonlinear algorithms. Linear feature extraction copes with linear separable features; the high-dimensional space is mapped to a new low-dimensional linear subspace

Classification problem definition

In a classification problem, a given input vector is the observation of $f$ components called features. However, each observation represents a gene expression profile, the features are gene expression coefficients, and observations correspond to patients in this review. A training set of observations is denoted by $X = {[x_{1}, x_{2}, \dots, x_{j}]}^{T}$ , and the class labels is denoted by $Y = {[y_{1}, y_{2}, \dots, y_{j}]}^{T}$ , $y_{j} \in {1, 2, \dots, n}$ , where j is number of observations and n is the number of classes. Decision functions that are the

Analysis and discussion

This section analyzes the results of the 155 selected studies connected with the previously defined research questions. Furthermore, the reviewed results of (1) commonly used microarray gene expression datasets, (2) extensively applied gene reduction algorithms, (3) broadly developed gene selection algorithms, and (4) frequently applied ML algorithms are presented and discussed in the following subsections.

Challenges and future directions

This section outlines future research directions that can be followed to address the recent limitations of microarray gene expression datasets, enhance the efficiency of ML algorithms and classify and detect cancer types with great precision. Despite the positive effects of the reviewed literature, some limitations and challenges must be overcome. Therefore, this review highlighted essential difficulties and inherent trends, directions for future research, and challenges in the following

Conclusion

Microarray gene expression datasets have excessive number of dimensions with few samples. Our review concludes that data dimensionality significantly influences a diagnostic system’s accuracy in many circumstances. Therefore, effective algorithms are required to handle these datasets by decreasing redundancy and dependency and preserving informative genes. This review examined studies on gene expression microarray datasets from 2010 to 2022 and data reduction algorithms used to eliminate

CRediT authorship contribution statement

Sarah Osama: Conceptualization, Methodology, Formal analysis, Resources, Writing – original draft, Writing – review & editing. Hassan Shaban: Supervision, Writing – review & editing. Abdelmgeid A. Ali: Supervision, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (203)

AbdullaM. et al.
G-forest: An ensemble method for cost-sensitive feature selection in gene expression microarrays
Artificial Intelligence in Medicine
(2020)
AlharthiA.M. et al.
Gene selection and classification of microarray gene expression data based on a new adaptive L1-norm elastic net penalty
Informatics in Medicine Unlocked
(2021)
AlshamlanH.M.
Co-ABC: Correlation artificial bee colony algorithm for biomarker gene discovery using gene expression profile
Saudi Journal of Biological Sciences
(2018)
AlshamlanH.M. et al.
Genetic bee colony (GBC) algorithm: A new gene selection method for microarray cancer classification
Computational Biology and Chemistry
(2015)
AlzaqebahM. et al.
Memory based cuckoo search algorithm for feature selection of gene expression dataset
Informatics in Medicine Unlocked
(2021)
AzizR. et al.
A novel approach for dimension reduction of microarray
Computational Biology and Chemistry
(2017)
Bolón-CanedoV. et al.
An ensemble of filters and classifiers for microarray data classification
Pattern Recognition
(2012)
BurczynskiM.E. et al.
Molecular classification of crohn’s disease and ulcerative colitis patients using transcriptional profiles in peripheral blood mononuclear cells
The Journal of Molecular Diagnostics
(2006)
ChenK.-H. et al.
Applying particle swarm optimization-based decision tree classifier for cancer classification on gene expression data
Applied Soft Computing
(2014)
ChiarettiS. et al.
Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival
Blood
(2004)

ChuangL.-Y. et al.

A hybrid feature selection method for DNA microarray data

Computers in Biology and Medicine

(2011)

DabbaA. et al.

Gene selection and classification of microarray data method based on mutual information and moth flame algorithm.

Expert Systems with Applications

(2021)

DasA.K. et al.

Ensemble feature selection using bi-objective genetic algorithm

Knowledge-Based Systems

(2017)

DashR.

An adaptive harmony search approach for gene selection and classification of high dimensional medical data

Journal of King Saud University-Computer and Information Sciences

(2021)

DashR. et al.

Pipelining the ranking techniques for microarray data classification: A case study

Applied Soft Computing

(2016)

DashtbanM. et al.

Gene selection for microarray cancer classification using a new evolutionary method employing artificial intelligence concepts

Genomics

(2017)

DashtbanM. et al.

Gene selection for tumor classification using a novel bio-inspired multi-objective approach

Genomics

(2018)

EbrahimpourM.K. et al.

Ensemble of feature selection methods: A hesitant fuzzy sets approach

Applied Soft Computing

(2017)

ElyasigomariV. et al.

Cancer classification using a novel gene selection approach by means of shuffling based on data clustering with optimization

Applied Soft Computing

(2015)

EngreitzJ.M. et al.

Independent component analysis: mining microarray data for fundamental human gene expression modules

Journal of Biomedical Informatics

(2010)

GangavarapuT. et al.

A novel filter–wrapper hybrid greedy ensemble approach optimized using the genetic algorithm to reduce the dimensionality of high-dimensional biomedical datasets

Applied Soft Computing

(2019)

GaoX. et al.

A novel effective diagnosis model based on optimized least squares support machine for gene microarray

Applied Soft Computing

(2018)

GhoshM. et al.

Recursive memetic algorithm for gene selection in microarray data

Expert Systems with Applications

(2019)

GrisciB.I. et al.

Neuroevolution as a tool for microarray gene expression pattern identification in cancer research

Journal of Biomedical Informatics

(2019)

GuoS. et al.

A L1-regularized feature selection method for local dimension reduction on microarray data

Computational Biology and Chemistry

(2017)

HanX. et al.

Feature subset selection by gravitational search algorithm optimization

Information Sciences

(2014)

HanX.H. et al.

A hybrid cancer classification model based recursive binary gravitational search algorithm in microarray data

Procedia Computer Science

(2019)

JainI. et al.

Correlation feature selection based improved-binary particle swarm optimization for gene selection and cancer classification

Applied Soft Computing

(2018)

KanehisaM.

A database for post-genome analysis.

Trends in Genetics: TIG

(1997)

KangC. et al.

Feature selection and tumor classification for microarray data using relaxed lasso and generalized multi-class support vector machine

Journal of Theoretical Biology

(2019)

AbinashM. et al.

A study on wrapper-based feature selection algorithm for leukemia dataset

Al-BaityH.H. et al.

A new optimized wrapper gene selection method for breast cancer prediction

CMC-Computers Materials & Continua

(2021)

Al-ObeidatF. et al.

Gene encoder: A feature selection technique through unsupervised deep learning-based clustering for large gene expression data

Neural Computing and Applications

(2020)

AlgamalZ.Y. et al.

A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification

Advances in Data Analysis and Classification

(2019)

AlkuhlaniA. et al.

Multistage feature selection approach for high-dimensional cancer data

Soft Computing

(2017)

AllamM. et al.

Optimal feature selection using binary teaching learning based optimization algorithm

Journal of King Saud University-Computer and Information Sciences

(2018)

AlmugrenN. et al.

FF-SVM: new firefly-based gene selection algorithm for microarray cancer classification

AlmugrenN. et al.

New bio-marker gene discovery algorithms for cancer gene expression profile

IEEE Access

(2019)

AlmugrenN. et al.

A survey on hybrid feature selection methods in microarray gene expression data for cancer classification

IEEE Access

(2019)

AlomariO.A. et al.

A hybrid filter-wrapper gene selection method for cancer classification

AlomariO.A. et al.

A novel gene selection method using modified MRMR and hybrid bat-inspired algorithm with $β$ -hill climbing

Applied Intelligence: The International Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Technologies

(2018)

AlonU. et al.

Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays

Proceedings of the National Academy of Sciences

(1999)

AlrefaiN. et al.

Optimized feature selection method using particle swarm intelligence with ensemble learning for cancer classification based on microarray datasets

Neural Computing and Applications

(2022)

AlshamlanH. et al.

mRMR-ABC: A hybrid gene selection algorithm for cancer classification using microarray gene expression profiling

Biomed Research International

(2015)

AlweshahM. et al.

The monarch butterfly optimization algorithm for solving feature selection problems

Neural Computing and Applications

(2020)

AlzubiR. et al.

A hybrid feature selection method for complex diseases SNPs

IEEE Access

(2017)

AnnavarapuC.S.R. et al.

Clustering-based hybrid feature selection approach for high dimensional microarray data

Chemometrics and Intelligent Laboratory Systems

(2021)

AzizR. et al.

Artificial neural network classification of microarray data using new hybrid gene selection method

International Journal of Data Mining and Bioinformatics

(2017)

AzizR. et al.

Artificial neural network classification of high dimensional data with novel optimization approach of dimension reduction

Annals of Data Science

(2018)

BadaouiF. et al.

Dimensionality reduction and class prediction algorithm with application to microarray big data

Journal of Big Data

(2017)

Cited by (18)

Pattern recognition frequency-based feature selection with multi-objective discrete evolution strategy for high-dimensional medical datasets
2024, Expert Systems with Applications
Feature selection has a prominent role in high-dimensional datasets to increase classification accuracy, decrease the learning algorithm computational time, and present the most informative features to decision-makers. This paper proposes a two-stage hybrid feature selection for high-dimensional medical datasets: Maximum Pattern Recognition - Multi-objective Discrete Evolution Strategy (MPR-MDES). MPR is a rapid filter ranker that significantly outperforms existing frequency-based rankers in recognizing non-linear patterns, effectively eliminating a majority of non-informative features. Then, the wrapper Multi-objective Discrete Evolution Strategy (MDES) uses the remaining features and obtains sets of solutions which are automatically presented to decision-makers. The experiments conducted on large medical datasets demonstrate that MPR-MDES achieves considerable improvements compared to state-of-the-art methods, in terms of both classification accuracy and dimensionality reduction. In this sense, the proposal successfully performs when presenting informative feature sets to decision-makers. The implementation is available on https://github.com/KhaosResearch/MPR-MDES.
A dynamic multiple classifier system using graph neural network for high dimensional overlapped data
2024, Information Fusion
Dynamic selection techniques select a subset of the classifiers from a pool according to their perceived competence in labeling each given query instance in particular. To do so, most techniques rely on the locality assumption for the selection task, meaning that similar instances should share a set of adequate classifiers, so their competencies are usually estimated over a local region surrounding the query. However, as the local distribution is crucial to these techniques, a poor region definition due to the presence of high dimensionality and class overlap can have a negative impact on their performance, thus limiting their application. Thus, we propose in this work a dynamic selection technique to better deal with sparse and overlapped data in which the instance–instance and the classifier–classifier relationships are leveraged to learn the dynamic classifier combination rule. The proposed technique uses a multi-label graph neural network as a meta-learner, so both the data modeled as a graph, without directly defining the local region, and the classifiers’ inter-dependencies modeled in the meta-labels are used to learn an embedded space where the dynamic selection task is more straightforward. Experimental results over 35 high dimensional datasets show that the proposed method significantly outperforms the static selection baseline and most evaluated dynamic selection techniques when using a diverse ensemble. Moreover, the proposed technique surpassed the contending state-of-the-art techniques over the problems with the highest excess of incompetent classifiers in overlap regions, further suggesting its suitability to deal with challenging local distributions. Code available at: github.com/marianaasouza/gnn_des.
Advancing gene feature selection: Comprehensive learning modified hunger games search for high-dimensional data
2024, Biomedical Signal Processing and Control
Gene selection eliminates redundant or duplicate information to optimize computational resources and improve classification accuracy. In this research, a novel gene selection technique called CLMHGS is introduced, which utilizes the hunger games search (HGS) framework along with an comprehensive learning strategy and a modified approach food strategy to gene selection. The primary objective of CLMHGS is to achieve global optimization. Additionally, its binary counterpart, bCLMHGS, has been employed to address the challenge of feature selection. To prove the efficacy of CLMHGS, it is compared to HGS, single strategy embedded HGS, eight recent algorithms, and seven advanced algorithms on the IEEE CEC 2017 suite. Besides, CLMHGS outperforms many CEC winners and highly effective differential evolution (DE)-based approaches on benchmark functions. Moreover, based on classification accuracy, the quantity of selected features, fitness values, and execution time, the experimental findings indicate that bCLMHGS outperforms both bHGS and the majority of other feature selection algorithms across 14 public datasets. Thus, bCLMHGS can be an excellent optimizer and a powerful wrapper-mode feature selection method.
Gene selection and tumor identification based on a hybrid of the multi-filter embedded recursive mountain gazelle algorithm
2023, Computers in Biology and Medicine
Microarray gene expression data are useful for identifying gene expression patterns associated with cancer outcomes; however, their high dimensionality make it difficult to extract meaningful information and accurately classify tumors. Hence, developing effective methods for reducing dimensionality while preserving relevant information is a crucial task. Hybrid-based gene selection methods are widely proposed in the gene expression analysis domain and can still be enhanced in terms of efficiency and reliability. This study proposes a new hybrid-based gene selection method, called multi-filter embedded mountain gazelle optimizer (MUL-MGO), which utilizes two filters and an embedded method to remove irrelevant genes, followed by selecting the most relevant genes using recently developed MGO algorithm. To the best of our knowledge, this is the first work to exploit MGO as a gene or feature selection method. A new version of MGO, called recursive mountain gazelle optimizer (RMGO), which implements MGO algorithm recursively to avoid local optima, minimize search space, and obtain minimum gene count without decreasing the classifier’s performance, is developed. The proposed RMGO is used to develop a new hybrid gene selection method employing similar filters and embedded methods as MUL-MGO, but with a recursive MGO algorithm version. The resulting method is called multi-filter embedded recursive mountain gazelle optimizer (MUL-RMGO). Several classifiers are used for cancer classification. Accordingly, several experimental studies are performed on eight microarray gene expression datasets to demonstrate the proficiencies of MUL-MGO and MUL-RMGO methods. The experimental findings indicate the efficiency and productivity of the suggested MUL-MGO and MUL-RMGO methods for gene selection. The methods outperform cutting-edge methods in the literature, with MUL-RMGO exceeding MUL-MGO in terms of accuracy and selected gene count.
Metal/covalent-organic framework-based biosensors for nucleic acid detection
2023, Coordination Chemistry Reviews
The detection of nucleic acid marker is essential for accurate and timely diagnosis and prognosis of different types of diseases. However, the sensitivity and specificity of the assay still need to be improved. Due to the highly ordered porous structure, adjustable pore size, and large specific surface area, metal/covalent-organic frameworks (MOFs/COFs) can construct multifunctional sensors, which show significant advantages of adsorption, fluorescence quenching/emission, and electrocatalysis in nucleic acid detection. This review provides an overview of the design and construction of biosensors using MOFs/COFs for nucleic acid detection. The interaction between MOFs/COFs and nucleic acids and the detection of ctDNAs, microRNAs, mRNAs, genes, and other nucleic acid molecules based on these biosensors have been comprehensively discussed. In addition, the successful cases proposed by our and other groups have been analyzed in detail. The focus is on the rational design of biosensors by taking advantage of the unique structures and optical properties of MOFs/COFs to improve the sensitivity and selectivity of nucleic acid detection. We hope that the perspectives and insights provided in this review will promote the exploration of further potential applications of MOFs and/or COFs in biomarker sensing and imaging.
Dispersed differential hunger games search for high dimensional gene data feature selection
2023, Computers in Biology and Medicine
The realms of modern medicine and biology have provided substantial data sets of genetic roots that exhibit a high dimensionality. Clinical practice and associated processes are primarily dependent on data-driven decision-making. However, the high dimensionality of the data in these domains increases the complexity and size of processing. It can be challenging to determine representative genes while reducing the data's dimensionality. A successful gene selection will serve to mitigate the computing costs and refine the accuracy of the classification by eliminating superfluous or duplicative features. To address this concern, this research suggests a wrapper gene selection approach based on the HGS, combined with a dispersed foraging strategy and a differential evolution strategy, to form a new algorithm named DDHGS. Introducing the DDHGS algorithm to the global optimization field and its binary derivative bDDHGS to the feature selection problem is anticipated to refine the existing search balance between explorative and exploitative cores. We assess and confirm the efficacy of our proposed method, DDHGS, by comparing it with DE and HGS combined with a single strategy, seven classic algorithms, and ten advanced algorithms on the IEEE CEC 2017 test suite. Furthermore, to further evaluate DDHGS' performance, we compare it with several CEC winners and DE-based techniques of great efficiency on 23 popular optimization functions and the IEEE CEC 2014 benchmark test suite. The experimentation asserted that the bDDHGS approach was able to surpass bHGS and a variety of existing methods when applied to fourteen feature selection datasets from the UCI repository. The metrics measured--classification accuracy, the number of selected features, fitness scores, and execution time--all showed marked improvements with the use of bDDHGS. Considering all results, it can be concluded that bDDHGS is an optimal optimizer and an effective feature selection tool in the wrapper mode.

View all citing articles on Scopus

View full text

ReviewGene reduction and machine learning algorithms for cancer classification based on microarray gene expression data: A comprehensive review

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Review methodology

Data preprocessing overview

Feature selection algorithms

Review of feature extraction algorithms

Classification problem definition

Analysis and discussion

Challenges and future directions

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Artificial Intelligence in Medicine

Informatics in Medicine Unlocked

Saudi Journal of Biological Sciences

Computational Biology and Chemistry

Informatics in Medicine Unlocked

Computational Biology and Chemistry

Pattern Recognition

The Journal of Molecular Diagnostics

Applied Soft Computing

Blood

Computers in Biology and Medicine

Expert Systems with Applications

Knowledge-Based Systems

Journal of King Saud University-Computer and Information Sciences

Applied Soft Computing

Genomics

Genomics

Applied Soft Computing

Applied Soft Computing

Journal of Biomedical Informatics

Applied Soft Computing

Applied Soft Computing

Expert Systems with Applications

Journal of Biomedical Informatics

Computational Biology and Chemistry

Information Sciences

Procedia Computer Science

Applied Soft Computing

Trends in Genetics: TIG

Journal of Theoretical Biology

A study on wrapper-based feature selection algorithm for leukemia dataset

A new optimized wrapper gene selection method for breast cancer prediction

CMC-Computers Materials & Continua

Gene encoder: A feature selection technique through unsupervised deep learning-based clustering for large gene expression data

Neural Computing and Applications

A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification

Advances in Data Analysis and Classification

Multistage feature selection approach for high-dimensional cancer data

Soft Computing

Optimal feature selection using binary teaching learning based optimization algorithm

Journal of King Saud University-Computer and Information Sciences

FF-SVM: new firefly-based gene selection algorithm for microarray cancer classification

New bio-marker gene discovery algorithms for cancer gene expression profile

IEEE Access

A survey on hybrid feature selection methods in microarray gene expression data for cancer classification

IEEE Access

A hybrid filter-wrapper gene selection method for cancer classification

A novel gene selection method using modified MRMR and hybrid bat-inspired algorithm with β-hill climbing

Applied Intelligence: The International Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Technologies

Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays

Proceedings of the National Academy of Sciences

Optimized feature selection method using particle swarm intelligence with ensemble learning for cancer classification based on microarray datasets

Neural Computing and Applications

mRMR-ABC: A hybrid gene selection algorithm for cancer classification using microarray gene expression profiling

Biomed Research International

The monarch butterfly optimization algorithm for solving feature selection problems

Neural Computing and Applications

A hybrid feature selection method for complex diseases SNPs

IEEE Access

Clustering-based hybrid feature selection approach for high dimensional microarray data

Chemometrics and Intelligent Laboratory Systems

Artificial neural network classification of microarray data using new hybrid gene selection method

International Journal of Data Mining and Bioinformatics

Artificial neural network classification of high dimensional data with novel optimization approach of dimension reduction

Review
Gene reduction and machine learning algorithms for cancer classification based on microarray gene expression data: A comprehensive review

A novel gene selection method using modified MRMR and hybrid bat-inspired algorithm with $β$ -hill climbing