A differential evolution based feature combination selection algorithm for high-dimensional data
Introduction
High-dimensional data are becoming increasingly common in various fields, and one of the most active areas of high-dimensional data analysis is genome-wide association studies (GWAS) [5], [19], [46]. GWAS are undertaken with the purpose of detecting the associations between single nucleotide polymorphisms (SNPs) and traits like major human diseases. Although some achievements of GWAS have been made in the identification of single-locus SNPs associated with Mendelian diseases, detecting susceptible SNPs related to complex diseases faces significant obstacles. One of the prime reasons for this is that the attack of numerous complex diseases would be affected by the interaction between SNPs. Such interactive effects of SNPs are called epistasis or epistatic interactions [36], [6], [43]. The comprehension of disease-associated epistatic interactions has significant value in the study of the pathogenesis of complex diseases, disease prevention, and research of biological medicine [21], [18], [31].
A GWAS dataset contains thousands of SNP combinations, but only a small fraction are associated with the disease. In this work, the problem of identifying disease-associated combinations is treated as a feature combination selection problem. However, selecting feature combinations from high-dimensional data faces serious computational challenges. For instance, it takes approximately statistical tests to find the pairwise combinations from 50,000 SNPs genotyped in thousands of samples.
Currently, there are several search algorithms that can deal with selecting feature combinations [16], [24]. According to their optimization strategies, these methods can be divided into four categories: exhaustive search algorithms, machine learning-based algorithms, stochastic search algorithms, and evolutionary algorithms. Stochastic search algorithms and evolutionary algorithms can handle high-dimensional data, but usually perform poorly in terms of accuracy and stability. Therefore, developing a more efficient and stable search algorithm is highly desired.
Differential evolution (DE) [25], [4] is an efficient population-based heuristic search algorithm. The DE algorithm solves the optimization problem through cooperation and competition among individuals in a population, and is essentially an evolutionary algorithm combined with the idea of optimal preservation. The algorithm is easily realized and has shown excellent performance in solving various problems, such as feature selection [44], numerical optimization [30], and ordering problems [1]. However, it is difficult to select disease-associated combinations using the traditional DE algorithm.
To make the DE algorithm select feature combinations from high-dimensional data more effectively, a novel differential evolution algorithm is proposed in this work. The proposed algorithm uses a BSP tree, which can memorize the search history [3], [42]. The search history records the position of the evaluated solution, and hence the proposed algorithm can avoid revisiting a reevaluated solution and guide the reevaluated solution to find an unvisited position. We call the proposed algorithm search-history-guided differential evolution (HGDE). HGDE possesses overwhelming advantages in three aspects (the source code of the proposed algorithm is available online at https://sourceforge.net/projects/hgde/).
- (1)
It is a fast method that does not need to evaluate all feature combinations.
- (2)
It can remove duplicates to maintain the diversity of the population, thereby reducing the probability of falling into local optima.
- (3)
It suggests the search direction by using the search history. If an individual is duplicated in the population, the search history can guide the duplicate to find an unvisited position.
Section snippets
Related work
Existing feature combination selection algorithms can be roughly divided into four categories: exhaustive search algorithms, machine learning-based algorithms, stochastic search algorithms, and evolutionary algorithms.
Exhaustive search algorithms enumerate all possible SNP combinations to identify disease-associated combinations. Multifactor dimensionality reduction (MDR) [22] and Boolean operation-based screening and testing (BOOST) [35] are two classical exhaustive algorithms. These methods
Problem definition
In the GWAS field, SNPs are bi-allelic markers. As a general rule, a capital letter (,…) is used to indicate a major allele and a lower-case letter (,…) is used to indicate a minor allele. There are three kinds of genotype combinations: the homozygous major genotype (AA), heterozygous genotype (Aa), and homozygous minor genotype (aa). Common practice in coding the genotype data is to express {} as {0, 1, 2}.
It is first assumed herein that there are N samples (including cases
HGDE framework
HGDE is a population-based algorithm that can select disease-associated feature combinations in a discrete domain. The algorithm works through a cyclic process, as presented in Algorithm 1.Algorithm 1 HGDE Input: dataset ; number of individuals ; user-specified statistical significance threshold Output: all recorded solutions 1: Initialize a population = . 2: Obtain current best individual . 3: while the stopping criterion is not satisfied do 4: for = 1 to do 5:
Experiments
The performance of HGDE was evaluated using both synthetic and real datasets. On synthetic datasets, HGDE was compared with five state-of-the-art algorithms in terms of detection power, running time, and stability. For the real-world biological dataset, the six compared algorithms were run on the age-related macular degeneration (AMD) dataset [14]. All algorithms were implemented in Java, and the experiments run on a Windows 10 system with an Intel(R) Core(TM) i7-8550U [email protected] GHz and 16 GB
Conclusions
In this work, a simple but effective algorithm called HGDE is proposed to select feature combinations from high-dimensional data. HGDE can remove duplicates to maintain population diversity and suggest the search direction through guidance from the search-history strategy, thereby reducing the probability of falling into local optima. HGDE is evaluated by comparison with CINOEDV, IEACO, epiACO, FHSA-SED, and DESeeker on synthetic datasets. The results show that the performance of the proposed
CRediT authorship contribution statement
Boxin Guan: Conceptualization, Methodology, Software, Writing - original draft, Writing - review & editing. Yuhai Zhao: Data curation, Writing - original draft, Supervision. Ying Yin: Visualization, Investigation. Yuan Li: Supervision, Software, Validation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work was supported by the National Natural Science Foundation Program of China under Grant 61772124 and the start up of North China University of Technology.
References (46)
- et al.
Comparative oncogenomics identifies nedd9 as a melanoma metastasis gene
Cell
(2006) - et al.
The immunohistochemical overexpression of ribonucleotide reductase regulatory subunit M1 (RRM1) protein is a predictor of shorter survival to gemcitabine-based chemotherapy in advanced non-small cell lung cancer (NSCLC)
Lung Cancer
(2010) - et al.
Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer
American Journal of Human Genetics
(2001) - et al.
epiACO – a method for identifying epistasis based on ant colony optimization algorithm
BioData Mining
(2017) - et al.
Boost: a fast approach to detecting gene-gene interactions in genome-wide case-control studies
American Journal of Human Genetics
(2010) - et al.
Histone demethylases KDM4A and KDM4C regulate differentiation of embryonic stem cells to endothelial cells
Stem Cell Reports
(2015) - et al.
Variable neighborhood algebraic differential evolution: an application to the linear ordering problem with cumulative costs
Information Sciences
(2019) A tutorial on statistical methods for population association studies
Nature Reviews Genetics
(2006)- et al.
An evolutionary algorithm that makes decision based on the entire previous search history
IEEE Transactions on Evolutionary Computation
(2011) - et al.
Differential evolution: A survey of the state-of-the-art
IEEE Transactions on Evolutionary Computation
(2011)
Progress and challenges in genome-wide association studies in humans
Nature
Are interactions between cis-regulatory variants evidence for biological epistasis or statistical artifacts?
The American Journal of Human Genetics
A new heuristic optimization algorithm: Harmony search
Simulation
Self-adjusting ant colony optimization based on information entropy for detecting epistatic interactions
Genes
DESeeker: detecting epistatic interactions using a two-stage differential evolution algorithm
IEEE Access
Learning genetic epistasis using bayesian network scoring criteria
BMC Bioinformatics
MACOED: a multi-objective ant colony optimization algorithm for SNP epistasis detection in genome-wide association studies
Bioinformatics
Particle swarm optimization
Proceedings of IEEE International Conference on Neural Networks
Complement factor H polymorphism in age-related macular degeneration
Science
An overview of SNP interactions in genome-wide association studies
Briefings in Functional Genomics
TRM: a powerful two-stage machine learning approach for identifying SNP-SNP interactions
Annals of Human Genetics
Detecting high-order SNP interactions based on pairwise SNP combinations
Genes
The new NHGRI-EBI catalog of published genome-wide association studies (GWAS Catalog)
Nucleic Acids Research
Cited by (36)
A multitasking multi-objective differential evolution gene selection algorithm enhanced with new elite and guidance strategies for tumor identification
2024, Expert Systems with ApplicationsAn improved binary dandelion algorithm using sine cosine operator and restart strategy for feature selection
2024, Expert Systems with ApplicationsImproving the undersampling technique by optimizing the termination condition for software defect prediction
2024, Expert Systems with ApplicationsDifferential evolution based on network structure for feature selection
2023, Information Sciences