A deep hybrid model to detect multi-locus interacting SNPs in the presence of noise

https://doi.org/10.1016/j.ijmedinf.2018.09.003Get rights and content

Highlights

  • Deep hybrid method is proposed to detect multi-locus SNP interactions in high dimensional genome.

  • The method combines deep neural networks and random forest.

  • The proposed method is evaluated on various simulated scenarios and real data applications in the absence and presence of noise.

  • The experimental findings led to the improvement in the performance of the proposed method over the traditional machine learning approaches.

Abstract

Identifying genetic variants associated with complex diseases is a central focus of genome-wide association studies. These studies extensively adopt univariate analysis by ignoring interaction effects. It is widely accepted that the etiology of most complex diseases depends on interactions between genetic variants and / or environmental factors. Several machine learning and data mining methods have been consistently successful in exposing these interaction effects. However, there has been no major breakthrough due to various biological complexities, and statistical computational challenges facing in the field of genetic epidemiology, despite of many efforts. Deep learning is emerging machine learning approach that promises to reveal the hidden patterns of big data for accurate predictions. In this study, a deep neural network is unified with a random forest by forming hybrid architecture, for achieving reliable detection of multi-locus interactions between single nucleotide polymorphisms. The proposed hybrid method is evaluated on various simulated scenarios in the absence of main effect for six epistasis models. The best model with optimal hyper-parameters (grid and random grid search) is chosen to enhance the power of the method by maximising the model’s prediction accuracy. The performance metrics of each model is analysed for both training and validation. Further, the performance of the method in the presence of noise due to missing data, genotyping errors, genetic heterogeneity, and phenocopy, and their combined effects are evaluated. The power of the method in detecting two-locus interactions is compared with the previous methods in the presence and absence of noise. On an average, the power of the proposed method is much higher than the previous methods for all simulated scenarios. Finally, findings are confirmed on a chronical dialysis patient’s data, obtained from the published study performed at the Kaohsiung Chang Gung Memorial Hospital. It is observed that the interaction between SNP 21 (2) and SNP 28 (2) in the mitochondrial D-loop has the highest risk for the disease manifestation.

Section snippets

Background

In the current era of big genome data, there has been growing interest in identification and characterisation of geneotype-phenotype relations to reveal the susceptibility of a complex disease. Single nucleotide polymorphism (SNP) is a genetic variation caused due to a change in a single nucleotide (A, T, C and G) of a DNA sequence [1]. About one SNP occurs in every 300 nucleotides, such that there are around 10 million SNPs in the human genome [2]. A number of genome-wide association studies

Methods

The main goal of the proposed method is to improve the identification of SNP interactions in high-dimensional data by enriching the deep data representation learning with the capability of random forest [42,43]. Fig. 1 illustrates the block diagram of the proposed hybrid method. In step one, case-control based input data are represented as n-factors, whose subjects are observed by determining their exposure to a phenotype. In multifactor combination stage, factors are combined in n-dimensional

Deep neural networks

A deep multilayered neural network is trained to detect higher-order SNP interactions as in our previous study [39,40]. However, in this study, each layer is trained using autoencoder as unsupervised feature learning instead of supervised learning using multilayered perceptrons (MLPs) [36,43]. The basic computational unit of the neural networks is a neuron, inspired biologically from a human brain. A multilayered neural network trained in the proposed hybrid method is illustrated in Fig. 2. It

Optimising hyper-parameters

In our previous work, number of models were trained under various simulated scenarios with various combinations of hyper-parameters [41]. Performance of each model was compared to obtain the best model. The manual search becomes tedious when the parameters of the network increases, and reproducing the results became more complex. Further, choosing the configurations when dealing with high-dimensional genome data was a critical step in optimising hyper-parameters manually. Hence, an automatic

Back propagation

The weights and biases between the neurons of DNN layers are learned from input and output samples. This learning process minimises loss function Lφ, which measures the predicted output with respect to the true output of a sample [43].W1,b1,WL,bL=arg minW1,b1,WL,bLLφ

This loss function is minimised by using backpropagation algorithm [36,42], which computes a gradient of loss function by using chain rule for derivatives. Stochastic gradient descent (SGD) computes derivative of each parameter with

Random forest

Random forest [44,45] is an ensemble of multiple classification and regression trees (CARTs). Each tree is grown from bootstrap sample of the original data using a random subset of total number of predictor variables at node level, rather than considering all possible predictor variables. This results in forest of unpruned trees. Final prediction is obtained based on aggregating the majority votes represented in ensemble of trees grown using bagging (bootstrap aggregating). The observations

Variable importance

The important feature of RF is to measure importance of each predictor variables. Gini importance and mean decrease in accuracy (MDA) are some of the popular approaches used in the literature. Although, Gini importance is easy to compute, it shows bias in selecting the variables with different categories [46]. Hence in this study, MDA measure is used to compute the importance of two-locus SNP interactions identified by the models as reported in [13]. The unscaled MDA estimation for each tree T

Interactions

Variable importance is used to rank the higher-order interacting SNPs. Highly ranked interactions are considered to be highly associated with the disease. The performance of the RF is calculated with respect to the heritability to identify main, and interaction effects as reported by [45]. Heritability is a common measure of variance in a phenotype that attributes to genetic variation [47]. It can also measure the effect of a locus on a disease. Winham [45] reported C as a complex disease which

Multifactor combination

A SNP is the variation in a single nucleotide of a DNA sequence. Due to duplication of genes, SNPs are biallele (A and a) whose genotypes are homozygous dominant (AA), heterozygous (aA\Aa), and homozygous recessive (aa). Statistically, AA, aA\Aa, and aa are represented by the values 1, 2, and 3 respectively. Hence, there are 3k genotype combinations for k loci in k dimensional space. In case of two-locus SNPs from the pool of m factors, each factor with three genotypes creates contingency table

Data input

As defined in definitions 1 and 5, case-control datasets comprise of s samples with m factors, and a class label, which either takes 0 (control) or 1(case). Each factor is a SNP at a locus. Consider SNP A (AA, Aa\aA, and aa), and SNP B at locus A and B (BB, Bb\bB, and bb) which are used to generate simulated datasets for six different two-locus epistasis models. Number of simulated datasets (in presence and absence of noise) is generated with various scenarios for six epistasis models. The

Evaluation

The performance of the models during training, validation, and testing are evaluated by determining model’s metrics. The performance of the models in the presence of noise due to GE, PC, GH, and MS are also evaluated. Training speed and time to execute the models are evaluated by varying width and depth of the network, along with various activation functions. The overall best model with highest prediction accuracy and lowest logloss along with the highest cross validation consistency (CVC) is

Results

Number of simulated datasets in the presence and absence of noise due to GE, GH, PC, and MS, and their combined effects are evaluated on the DNN-RF method. 5% of GE is generated with overrepresentation of one allele. GH is simulated for 50% of the data with two different two-locus combinations (SNP5-SNP10, and SNP3-SNP4) to increase the risk of a disease. PC is generated for 50% of the cases, which are considered to have low risk genotypes according to the epistasis models. The cases are

Discussion and conclusions

An important confounding factor in the case-control datasets is population stratification, which can lead to spurious associations between SNPs. The studies presented in this paper are restricted to homogenous populations. Improving the ability of the proposed method to handle population stratification could be a major aspect to be considered. In GWAS, number of approaches have successfully implemented to detect main effects by controlling the population stratification [59]. However, only few

Authors contributions

Suneetha Uppu made substantial contributions to conception, design and acquisition of datasets. Further, implemented and evaluated the proposed method by analysing and interpreting the results.

All the authors contributed in preparing the manuscript, interpreting the results and revising it critically for important intellectual content.

Availability of data and materials

All data generated or analyzed during this study are included within this article. The simulated datasets are available upon the request.

Acknowledgments

We would like to acknowledge the contribution of an Australian Government’s Commonwealth Research Funding in supporting this research. We thank John Wallace from Ritchie Lab, Pennsylvania State University for his expert assistance in simulating the datasets in the presence of common sources of noise for all six epistasis models described by Dr.Jason Moore. We also thank Dr.Ryan Urbanowicz, Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania for

References (68)

  • (US), N.L.o.M

    Genetics Home Reference [Internet], Vol. 2013 Sep 16

    (2016)
  • L. Padyukov

    Between the Lines of Genetic Code: Genetic Interactions in Understanding Disease and Complex Phenotypes

    (2013)
  • H.J. Cordell

    Detecting gene–gene interactions that underlie human diseases

    Nat. Rev. Genet.

    (2009)
  • C.L. Koo et al.

    A review for detecting gene-gene interactions using machine learning methods in genetic epidemiology

    Biomed. Res. Int.

    (2013)
  • S. Uppu et al.

    A review on methods for detecting SNP interactions in high-dimensional genomic data

    IEEE/ACM Trans. Comput. Biol. Bioinf.

    (2018)
  • Y. Chung et al.

    Odds ratio based multifactor-dimensionality reduction method for detecting gene–gene interactions

    Bioinformatics

    (2007)
  • M.L. Calle et al.

    MB-MDR: Model-Based Multifactor Dimensionality Reduction for Detecting Interactions in High-Dimensional Genomic Data

    (2008)
  • J. Gui et al.

    A robust multifactor dimensionality reduction method for detecting gene–gene interactions with application to the genetic analysis of bladder cancer susceptibility

    Ann. Hum. Genet.

    (2011)
  • D.F. Schwarz et al.

    On safari to random jungle: a fast implementation of random forests for high-dimensional data

    Bioinformatics

    (2010)
  • R. Jiang et al.

    A random forest approach to the detection of epistatic interactions in case-control studies

    BMC Bioinf.

    (2009)
  • L. De Lobel et al.

    A screening methodology based on random forests to improve the detection of gene–gene interactions

    Eur. J. Hum. Genet.

    (2010)
  • M. Yoshida et al.

    SNPInterForest: a new method for detecting epistatic interactions

    BMC Bioinf.

    (2011)
  • C. Yang et al.

    SNPHarvester: a filtering-based approach for detecting epistatic interactions in genome-wide association studies

    Bioinformatics

    (2009)
  • X. Zhang et al.

    TEAM: efficient two-locus epistasis tests in human genome-wide association study

    Bioinformatics

    (2010)
  • H.Y. Lin et al.

    TRM: a powerful two‐stage machine learning approach for identifying SNP‐SNP interactions

    Ann. Hum. Genet.

    (2012)
  • A.A. Motsinger et al.

    Comparison of neural network optimization approaches for studies of human genetics

    Applications of Evolutionary Computing

    (2006)
  • N.E. Hardison et al.

    The power of quantitative grammatical evolution neural networks to detect gene-gene interactions

    Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, ACM

    (2011)
  • S.H. Chen et al.

    A support vector machine approach for detecting gene‐gene interaction

    Genet. Epidemiol.

    (2008)
  • A. Özgür et al.

    Identifying gene-disease associations using centrality on a literature mined gene-interaction network

    Bioinformatics

    (2008)
  • Y.H. Fang et al.

    SVM‐based generalized multifactor dimensionality reduction approaches for detecting Gene‐Gene interactions in family studies

    Genet. Epidemiol.

    (2012)
  • H. Zhang et al.

    Improving accuracy for cancer classification with a new algorithm for genes selection

    BMC Bioinform.

    (2012)
  • H. Schwender et al.

    Identification of SNP interactions using logic regression

    Biostatistics

    (2008)
  • C. Kooperberg et al.

    Identifying interacting SNPs using Monte carlo logic regression

    Genet. Epidemiol.

    (2005)
  • C.C. Chen et al.

    Methods for identifying SNP interactions: a review on variations of logic regression, random Forest and bayesian logistic regression

    IEEE/ACM Trans. Comput. Biol. Bioinform.

    (2011)
  • Cited by (10)

    • Machine learning approaches to genome-wide association studies

      2022, Journal of King Saud University - Science
      Citation Excerpt :

      Detection of SNP interactions remains a significant problem because of the high-dimensionality of genomic data, including the GWAS datasets. This is due to such characteristics as biomolecular complexity, lack of marginal effects, missing heritability, and the limits of computational capacities (Gusareva et al., 2014; Padyukov, 2013; Uppu and Krishna, 2018). There is relative low number of genetic studies that involve people of African ancestry (Benafif et al., 2018; Gurdasani et al., 2015; Mulder et al., 2018; Radouani et al., 2020).

    • EpiHNet: Detecting epistasis by heterogeneous molecule network

      2022, Methods
      Citation Excerpt :

      SNPInterForest [22] constructs random forest by partitioning samples into different groups iteratively using SNP or SNP combinations, and each path in the tree is related to a possible interaction. Uppu et al. [23] unified deep neural network with random forest to achieve reliable detection of multi-locus interactions. However, this kind of methods usually face the challenge of biological interpretation of the analytical results.

    • Improving prediction for medical institution with limited patient data: Leveraging hospital-specific data based on multicenter collaborative research network

      2021, Artificial Intelligence in Medicine
      Citation Excerpt :

      The NDF model comprises the representation learning functionality known from neural network layers and the state-of-the-art learning performance of random forest (RF), which is an ideal ensemble learning approach for working with classification problems and easily distributable on parallel hardware. The architecture of fusing the deep learning networks and the tree-structured classifiers has become increasingly popular in biomedical data analysis due to their advantageous composition [38,39]. Both deep and shallow architectures (d/s-NDF) have been proposed and have achieved ideal performance on complex classification tasks.

    • EpiMC: Detecting Epistatic Interactions Using Multiple Clusterings

      2022, IEEE/ACM Transactions on Computational Biology and Bioinformatics
    View all citing articles on Scopus
    View full text