Elsevier

Information Sciences

Volume 547, 8 February 2021, Pages 870-886
Information Sciences

A differential evolution based feature combination selection algorithm for high-dimensional data

https://doi.org/10.1016/j.ins.2020.08.081Get rights and content

Abstract

Feature combination selection is used in object classification to select complementary features that can produce a powerful combination. One active area of selecting feature combinations is genome-wide association studies (GWAS). However, selecting feature combinations from high-dimensional GWAS data faces a serious issue of high computational complexity. In this paper, a fast evolutionary optimization method named search-history-guided differential evolution (HGDE) is proposed to deal with the problem. This method applies the search history memorized in a binary space partitioning tree to enhance its power for selecting feature combinations. We perform a comparative study on the proposed HGDE algorithm and other state-of-the-art algorithms using synthetic datasets, and later employ the HGDE algorithm in experiments on a real age-related macular degeneration dataset. The experimental results show that this proposed algorithm has superior performance in the selection of feature combinations. Moreover, the results provide a reference for studying the functional mechanisms of age-related macular degeneration.

Introduction

High-dimensional data are becoming increasingly common in various fields, and one of the most active areas of high-dimensional data analysis is genome-wide association studies (GWAS) [5], [19], [46]. GWAS are undertaken with the purpose of detecting the associations between single nucleotide polymorphisms (SNPs) and traits like major human diseases. Although some achievements of GWAS have been made in the identification of single-locus SNPs associated with Mendelian diseases, detecting susceptible SNPs related to complex diseases faces significant obstacles. One of the prime reasons for this is that the attack of numerous complex diseases would be affected by the interaction between SNPs. Such interactive effects of SNPs are called epistasis or epistatic interactions [36], [6], [43]. The comprehension of disease-associated epistatic interactions has significant value in the study of the pathogenesis of complex diseases, disease prevention, and research of biological medicine [21], [18], [31].

A GWAS dataset contains thousands of SNP combinations, but only a small fraction are associated with the disease. In this work, the problem of identifying disease-associated combinations is treated as a feature combination selection problem. However, selecting feature combinations from high-dimensional data faces serious computational challenges. For instance, it takes approximately 1.25×109 statistical tests to find the pairwise combinations from 50,000 SNPs genotyped in thousands of samples.

Currently, there are several search algorithms that can deal with selecting feature combinations [16], [24]. According to their optimization strategies, these methods can be divided into four categories: exhaustive search algorithms, machine learning-based algorithms, stochastic search algorithms, and evolutionary algorithms. Stochastic search algorithms and evolutionary algorithms can handle high-dimensional data, but usually perform poorly in terms of accuracy and stability. Therefore, developing a more efficient and stable search algorithm is highly desired.

Differential evolution (DE) [25], [4] is an efficient population-based heuristic search algorithm. The DE algorithm solves the optimization problem through cooperation and competition among individuals in a population, and is essentially an evolutionary algorithm combined with the idea of optimal preservation. The algorithm is easily realized and has shown excellent performance in solving various problems, such as feature selection [44], numerical optimization [30], and ordering problems [1]. However, it is difficult to select disease-associated combinations using the traditional DE algorithm.

To make the DE algorithm select feature combinations from high-dimensional data more effectively, a novel differential evolution algorithm is proposed in this work. The proposed algorithm uses a BSP tree, which can memorize the search history [3], [42]. The search history records the position of the evaluated solution, and hence the proposed algorithm can avoid revisiting a reevaluated solution and guide the reevaluated solution to find an unvisited position. We call the proposed algorithm search-history-guided differential evolution (HGDE). HGDE possesses overwhelming advantages in three aspects (the source code of the proposed algorithm is available online at https://sourceforge.net/projects/hgde/).

  • (1)

    It is a fast method that does not need to evaluate all feature combinations.

  • (2)

    It can remove duplicates to maintain the diversity of the population, thereby reducing the probability of falling into local optima.

  • (3)

    It suggests the search direction by using the search history. If an individual is duplicated in the population, the search history can guide the duplicate to find an unvisited position.

Section snippets

Related work

Existing feature combination selection algorithms can be roughly divided into four categories: exhaustive search algorithms, machine learning-based algorithms, stochastic search algorithms, and evolutionary algorithms.

Exhaustive search algorithms enumerate all possible SNP combinations to identify disease-associated combinations. Multifactor dimensionality reduction (MDR) [22] and Boolean operation-based screening and testing (BOOST) [35] are two classical exhaustive algorithms. These methods

Problem definition

In the GWAS field, SNPs are bi-allelic markers. As a general rule, a capital letter (A,B,…) is used to indicate a major allele and a lower-case letter (a,b,…) is used to indicate a minor allele. There are three kinds of genotype combinations: the homozygous major genotype (AA), heterozygous genotype (Aa), and homozygous minor genotype (aa). Common practice in coding the genotype data is to express {AA,Aa,aa} as {0, 1, 2}.

It is first assumed herein that there are N samples (including Nd cases

HGDE framework

HGDE is a population-based algorithm that can select disease-associated feature combinations in a discrete domain. The algorithm works through a cyclic process, as presented in Algorithm 1.

Algorithm 1 HGDE
Input: dataset D; number of individuals M; user-specified statistical significance threshold α
Output: all recorded solutions
 1: Initialize a population PG = Xi,G i=1,2,,M.
 2: Obtain current best individual Xbest,G.
 3: while the stopping criterion is not satisfied do
 4:  for i = 1 to M do
 5:   Vi,G=

Experiments

The performance of HGDE was evaluated using both synthetic and real datasets. On synthetic datasets, HGDE was compared with five state-of-the-art algorithms in terms of detection power, running time, and stability. For the real-world biological dataset, the six compared algorithms were run on the age-related macular degeneration (AMD) dataset [14]. All algorithms were implemented in Java, and the experiments run on a Windows 10 system with an Intel(R) Core(TM) i7-8550U [email protected] GHz and 16 GB

Conclusions

In this work, a simple but effective algorithm called HGDE is proposed to select feature combinations from high-dimensional data. HGDE can remove duplicates to maintain population diversity and suggest the search direction through guidance from the search-history strategy, thereby reducing the probability of falling into local optima. HGDE is evaluated by comparison with CINOEDV, IEACO, epiACO, FHSA-SED, and DESeeker on synthetic datasets. The results show that the performance of the proposed

CRediT authorship contribution statement

Boxin Guan: Conceptualization, Methodology, Software, Writing - original draft, Writing - review & editing. Yuhai Zhao: Data curation, Writing - original draft, Supervision. Ying Yin: Visualization, Investigation. Yuan Li: Supervision, Software, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by the National Natural Science Foundation Program of China under Grant 61772124 and the start up of North China University of Technology.

References (46)

  • P. Donnelly

    Progress and challenges in genome-wide association studies in humans

    Nature

    (2009)
  • A. Fish et al.

    Are interactions between cis-regulatory variants evidence for biological epistasis or statistical artifacts?

    The American Journal of Human Genetics

    (2016)
  • Z.W. Geem et al.

    A new heuristic optimization algorithm: Harmony search

    Simulation

    (2001)
  • B. Guan et al.

    Self-adjusting ant colony optimization based on information entropy for detecting epistatic interactions

    Genes

    (2019)
  • B. Guan et al.

    DESeeker: detecting epistatic interactions using a two-stage differential evolution algorithm

    IEEE Access

    (2019)
  • X. Jiang et al.

    Learning genetic epistasis using bayesian network scoring criteria

    BMC Bioinformatics

    (2011)
  • P.J. Jing et al.

    MACOED: a multi-objective ant colony optimization algorithm for SNP epistasis detection in genome-wide association studies

    Bioinformatics

    (2014)
  • J. Kennedy et al.

    Particle swarm optimization

    Proceedings of IEEE International Conference on Neural Networks

    (1995)
  • R. Klein et al.

    Complement factor H polymorphism in age-related macular degeneration

    Science

    (2005)
  • P. Li et al.

    An overview of SNP interactions in genome-wide association studies

    Briefings in Functional Genomics

    (2014)
  • H.Y. Lin et al.

    TRM: a powerful two-stage machine learning approach for identifying SNP-SNP interactions

    Annals of Human Genetics

    (2012)
  • J. Liu et al.

    Detecting high-order SNP interactions based on pairwise SNP combinations

    Genes

    (2017)
  • J. MacArthur et al.

    The new NHGRI-EBI catalog of published genome-wide association studies (GWAS Catalog)

    Nucleic Acids Research

    (2016)
  • Cited by (36)

    View all citing articles on Scopus
    View full text