A deep hybrid model to detect multi-locus interacting SNPs in the presence of noise

doi:10.1016/j.ijmedinf.2018.09.003

International Journal of Medical Informatics

Volume 119, November 2018, Pages 134-151

https://doi.org/10.1016/j.ijmedinf.2018.09.003 Get rights and content

Highlights

•
Deep hybrid method is proposed to detect multi-locus SNP interactions in high dimensional genome.
•
The method combines deep neural networks and random forest.
•
The proposed method is evaluated on various simulated scenarios and real data applications in the absence and presence of noise.
•
The experimental findings led to the improvement in the performance of the proposed method over the traditional machine learning approaches.

Abstract

Identifying genetic variants associated with complex diseases is a central focus of genome-wide association studies. These studies extensively adopt univariate analysis by ignoring interaction effects. It is widely accepted that the etiology of most complex diseases depends on interactions between genetic variants and / or environmental factors. Several machine learning and data mining methods have been consistently successful in exposing these interaction effects. However, there has been no major breakthrough due to various biological complexities, and statistical computational challenges facing in the field of genetic epidemiology, despite of many efforts. Deep learning is emerging machine learning approach that promises to reveal the hidden patterns of big data for accurate predictions. In this study, a deep neural network is unified with a random forest by forming hybrid architecture, for achieving reliable detection of multi-locus interactions between single nucleotide polymorphisms. The proposed hybrid method is evaluated on various simulated scenarios in the absence of main effect for six epistasis models. The best model with optimal hyper-parameters (grid and random grid search) is chosen to enhance the power of the method by maximising the model’s prediction accuracy. The performance metrics of each model is analysed for both training and validation. Further, the performance of the method in the presence of noise due to missing data, genotyping errors, genetic heterogeneity, and phenocopy, and their combined effects are evaluated. The power of the method in detecting two-locus interactions is compared with the previous methods in the presence and absence of noise. On an average, the power of the proposed method is much higher than the previous methods for all simulated scenarios. Finally, findings are confirmed on a chronical dialysis patient’s data, obtained from the published study performed at the Kaohsiung Chang Gung Memorial Hospital. It is observed that the interaction between SNP 21 (2) and SNP 28 (2) in the mitochondrial D-loop has the highest risk for the disease manifestation.

Section snippets

Background

In the current era of big genome data, there has been growing interest in identification and characterisation of geneotype-phenotype relations to reveal the susceptibility of a complex disease. Single nucleotide polymorphism (SNP) is a genetic variation caused due to a change in a single nucleotide (A, T, C and G) of a DNA sequence [1]. About one SNP occurs in every 300 nucleotides, such that there are around 10 million SNPs in the human genome [2]. A number of genome-wide association studies

Methods

The main goal of the proposed method is to improve the identification of SNP interactions in high-dimensional data by enriching the deep data representation learning with the capability of random forest [42,43]. Fig. 1 illustrates the block diagram of the proposed hybrid method. In step one, case-control based input data are represented as n-factors, whose subjects are observed by determining their exposure to a phenotype. In multifactor combination stage, factors are combined in n-dimensional

Deep neural networks

A deep multilayered neural network is trained to detect higher-order SNP interactions as in our previous study [39,40]. However, in this study, each layer is trained using autoencoder as unsupervised feature learning instead of supervised learning using multilayered perceptrons (MLPs) [36,43]. The basic computational unit of the neural networks is a neuron, inspired biologically from a human brain. A multilayered neural network trained in the proposed hybrid method is illustrated in Fig. 2. It

Optimising hyper-parameters

In our previous work, number of models were trained under various simulated scenarios with various combinations of hyper-parameters [41]. Performance of each model was compared to obtain the best model. The manual search becomes tedious when the parameters of the network increases, and reproducing the results became more complex. Further, choosing the configurations when dealing with high-dimensional genome data was a critical step in optimising hyper-parameters manually. Hence, an automatic

Back propagation

The weights and biases between the neurons of DNN layers are learned from input and output samples. This learning process minimises loss function $L (φ)$ , which measures the predicted output with respect to the true output of a sample [43]. $\{W^{1}, b^{1}, W^{L}, b^{L}\} = \underset{\{W^{1}, b^{1}, W^{L}, b^{L}\}}{arg min} L (φ)$

This loss function is minimised by using backpropagation algorithm [36,42], which computes a gradient of loss function by using chain rule for derivatives. Stochastic gradient descent (SGD) computes derivative of each parameter with

Random forest

Random forest [44,45] is an ensemble of multiple classification and regression trees (CARTs). Each tree is grown from bootstrap sample of the original data using a random subset of total number of predictor variables at node level, rather than considering all possible predictor variables. This results in forest of unpruned trees. Final prediction is obtained based on aggregating the majority votes represented in ensemble of trees grown using bagging (bootstrap aggregating). The observations

Variable importance

The important feature of RF is to measure importance of each predictor variables. Gini importance and mean decrease in accuracy (MDA) are some of the popular approaches used in the literature. Although, Gini importance is easy to compute, it shows bias in selecting the variables with different categories [46]. Hence in this study, MDA measure is used to compute the importance of two-locus SNP interactions identified by the models as reported in [13]. The unscaled MDA estimation for each tree $T$

Interactions

Variable importance is used to rank the higher-order interacting SNPs. Highly ranked interactions are considered to be highly associated with the disease. The performance of the RF is calculated with respect to the heritability to identify main, and interaction effects as reported by [45]. Heritability is a common measure of variance in a phenotype that attributes to genetic variation [47]. It can also measure the effect of a locus on a disease. Winham [45] reported $C$ as a complex disease which

Multifactor combination

A SNP is the variation in a single nucleotide of a DNA sequence. Due to duplication of genes, SNPs are biallele (A and a) whose genotypes are homozygous dominant (AA), heterozygous (aA\Aa), and homozygous recessive (aa). Statistically, AA, aA\Aa, and aa are represented by the values 1, 2, and 3 respectively. Hence, there are $3^{k}$ genotype combinations for $k$ loci in $k$ dimensional space. In case of two-locus SNPs from the pool of $m$ factors, each factor with three genotypes creates contingency table

Data input

As defined in definitions 1 and 5, case-control datasets comprise of $s$ samples with $m$ factors, and a class label, which either takes 0 (control) or 1(case). Each factor is a SNP at a locus. Consider SNP A (AA, Aa\aA, and aa), and SNP B at locus A and B (BB, Bb\bB, and bb) which are used to generate simulated datasets for six different two-locus epistasis models. Number of simulated datasets (in presence and absence of noise) is generated with various scenarios for six epistasis models. The

Evaluation

The performance of the models during training, validation, and testing are evaluated by determining model’s metrics. The performance of the models in the presence of noise due to GE, PC, GH, and MS are also evaluated. Training speed and time to execute the models are evaluated by varying width and depth of the network, along with various activation functions. The overall best model with highest prediction accuracy and lowest logloss along with the highest cross validation consistency (CVC) is

Results

Number of simulated datasets in the presence and absence of noise due to GE, GH, PC, and MS, and their combined effects are evaluated on the DNN-RF method. 5% of GE is generated with overrepresentation of one allele. GH is simulated for 50% of the data with two different two-locus combinations (SNP5-SNP10, and SNP3-SNP4) to increase the risk of a disease. PC is generated for 50% of the cases, which are considered to have low risk genotypes according to the epistasis models. The cases are

Discussion and conclusions

An important confounding factor in the case-control datasets is population stratification, which can lead to spurious associations between SNPs. The studies presented in this paper are restricted to homogenous populations. Improving the ability of the proposed method to handle population stratification could be a major aspect to be considered. In GWAS, number of approaches have successfully implemented to detect main effects by controlling the population stratification [59]. However, only few

Authors contributions

Suneetha Uppu made substantial contributions to conception, design and acquisition of datasets. Further, implemented and evaluated the proposed method by analysing and interpreting the results.

All the authors contributed in preparing the manuscript, interpreting the results and revising it critically for important intellectual content.

Availability of data and materials

All data generated or analyzed during this study are included within this article. The simulated datasets are available upon the request.

Acknowledgments

We would like to acknowledge the contribution of an Australian Government’s Commonwealth Research Funding in supporting this research. We thank John Wallace from Ritchie Lab, Pennsylvania State University for his expert assistance in simulating the datasets in the presence of common sources of noise for all six epistasis models described by Dr.Jason Moore. We also thank Dr.Ryan Urbanowicz, Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania for

References (68)

E.S. Gusareva et al.
Genome-wide association interaction analysis for Alzheimer’s disease
Neurobiol. Aging
(2014)
M.D. Ritchie et al.
Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer
Am. J. Hum. Genet.
(2001)
X.-Y. Lou et al.
A generalized combinatorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine dependence
Am. J. Hum. Genet.
(2007)
M.D. Ritchie et al.
Genetic programming neural networks: a powerful bioinformatics tool for human genetics
Appl. Soft Comput.
(2007)
Y. Shen et al.
Support vector machines with L 1 penalty for detecting gene–gene interactions
Int. J. Data Min. Bioinform.
(2012)
S. Purcell et al.
PLINK: a tool set for whole-genome association and population-based linkage analyses
Am. J. Hum. Genet.
(2007)
R. Culverhouse et al.
A perspective on epistasis: limits of models displaying no main effect
Am. J. Hum. Genet.
(2002)
S. Bhattacharjee et al.
Using principal components of genetic variation for robust and powerful detection of gene-gene interactions in case-control and case-only studies
Am. J. Hum. Genet.
(2010)
X. Wan et al.
BOOST: a fast approach to detecting gene-gene interactions in genome-wide case-control studies
Am. J. Hum. Genet.
(2010)
W.S. Bush et al.
Genome-wide association studies
PLoS Comput. Biol.
(2012)

(US), N.L.o.M

Genetics Home Reference [Internet], Vol. 2013 Sep 16

(2016)

L. Padyukov

Between the Lines of Genetic Code: Genetic Interactions in Understanding Disease and Complex Phenotypes

(2013)

H.J. Cordell

Detecting gene–gene interactions that underlie human diseases

Nat. Rev. Genet.

(2009)

C.L. Koo et al.

A review for detecting gene-gene interactions using machine learning methods in genetic epidemiology

Biomed. Res. Int.

(2013)

S. Uppu et al.

A review on methods for detecting SNP interactions in high-dimensional genomic data

IEEE/ACM Trans. Comput. Biol. Bioinf.

(2018)

Y. Chung et al.

Odds ratio based multifactor-dimensionality reduction method for detecting gene–gene interactions

Bioinformatics

(2007)

M.L. Calle et al.

MB-MDR: Model-Based Multifactor Dimensionality Reduction for Detecting Interactions in High-Dimensional Genomic Data

(2008)

J. Gui et al.

A robust multifactor dimensionality reduction method for detecting gene–gene interactions with application to the genetic analysis of bladder cancer susceptibility

Ann. Hum. Genet.

(2011)

D.F. Schwarz et al.

On safari to random jungle: a fast implementation of random forests for high-dimensional data

Bioinformatics

(2010)

R. Jiang et al.

A random forest approach to the detection of epistatic interactions in case-control studies

BMC Bioinf.

(2009)

L. De Lobel et al.

A screening methodology based on random forests to improve the detection of gene–gene interactions

Eur. J. Hum. Genet.

(2010)

M. Yoshida et al.

SNPInterForest: a new method for detecting epistatic interactions

BMC Bioinf.

(2011)

C. Yang et al.

SNPHarvester: a filtering-based approach for detecting epistatic interactions in genome-wide association studies

Bioinformatics

(2009)

X. Zhang et al.

TEAM: efficient two-locus epistasis tests in human genome-wide association study

Bioinformatics

(2010)

H.Y. Lin et al.

TRM: a powerful two‐stage machine learning approach for identifying SNP‐SNP interactions

Ann. Hum. Genet.

(2012)

A.A. Motsinger et al.

Comparison of neural network optimization approaches for studies of human genetics

Applications of Evolutionary Computing

(2006)

N.E. Hardison et al.

The power of quantitative grammatical evolution neural networks to detect gene-gene interactions

Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, ACM

(2011)

S.H. Chen et al.

A support vector machine approach for detecting gene‐gene interaction

Genet. Epidemiol.

(2008)

A. Özgür et al.

Identifying gene-disease associations using centrality on a literature mined gene-interaction network

Bioinformatics

(2008)

Y.H. Fang et al.

SVM‐based generalized multifactor dimensionality reduction approaches for detecting Gene‐Gene interactions in family studies

Genet. Epidemiol.

(2012)

H. Zhang et al.

Improving accuracy for cancer classification with a new algorithm for genes selection

BMC Bioinform.

(2012)

H. Schwender et al.

Identification of SNP interactions using logic regression

Biostatistics

(2008)

C. Kooperberg et al.

Identifying interacting SNPs using Monte carlo logic regression

Genet. Epidemiol.

(2005)

C.C. Chen et al.

Methods for identifying SNP interactions: a review on variations of logic regression, random Forest and bayesian logistic regression

IEEE/ACM Trans. Comput. Biol. Bioinform.

(2011)

Cited by (10)

Machine learning approaches to genome-wide association studies
2022, Journal of King Saud University - Science
Citation Excerpt :
Detection of SNP interactions remains a significant problem because of the high-dimensionality of genomic data, including the GWAS datasets. This is due to such characteristics as biomolecular complexity, lack of marginal effects, missing heritability, and the limits of computational capacities (Gusareva et al., 2014; Padyukov, 2013; Uppu and Krishna, 2018). There is relative low number of genetic studies that involve people of African ancestry (Benafif et al., 2018; Gurdasani et al., 2015; Mulder et al., 2018; Radouani et al., 2020).
Genome-wide Association Studies (GWAS) are conducted to identify single nucleotide polymorphisms (variants) associated with a phenotype within a specific population. These variants associated with diseases have a complex molecular aetiology with which they cause the disease phenotype. The genotyping data generated from subjects of study is of high dimensionality, which is a challenge. The problem is that the dataset has a large number of features and a relatively smaller sample size. However, statistical testing is the standard approach being applied to identify these variants that influence the phenotype of interest. The wide applications and abilities of Machine Learning (ML) algorithms promise to understand the effects of these variants better. The aim of this work is to discuss the applications and future trends of ML algorithms in GWAS towards understanding the effects of population genetic variant. It was discovered that algorithms such as classification, regression, ensemble, and neural networks have been applied to GWAS for which this work has further discussed comprehensively including their application areas. The ML algorithms have been applied to the identification of significant single nucleotide polymorphisms (SNP), disease risk assessment & prediction, detection of epistatic non-linear interaction, and integrated with other omics sets. This comprehensive review has highlighted these areas of application and sheds light on the promise of innovating machine learning algorithms into the computational and statistical pipeline of genome-wide association studies. This will be beneficial for better understanding of how variants are affected by disease biology and how the same variants can influence risk by developing a particular phenotype for favourable natural selection.
EpiHNet: Detecting epistasis by heterogeneous molecule network
2022, Methods
Citation Excerpt :
SNPInterForest [22] constructs random forest by partitioning samples into different groups iteratively using SNP or SNP combinations, and each path in the tree is related to a possible interaction. Uppu et al. [23] unified deep neural network with random forest to achieve reliable detection of multi-locus interactions. However, this kind of methods usually face the challenge of biological interpretation of the analytical results.
Epistasis between single nucleotide polymorphisms (SNPs) plays an important role in elucidating the missing heritability of complex diseases. Diverse approaches have been invented for detecting SNP interactions, but they canonically neglect the important and useful connections between SNPs and other bio-molecules (i.e., miRNAs and lncRNAs). To comprehensively model these disease related molecules, a heterogeneous bio-molecular network based solution EpiHNet is introduced for high-order SNP interactions detection. EpiHNet firstly uses case/control data to construct an SNP statistical network, and meta-path based similarity on the heterogeneous network composed with SNPs, genes, lncRNAs, miRNAs and diseases to define another SNP relational network. The SNP relational network can explore and exploit different associations between molecules and diseases to complement the SNP statistical network and search the significantly associated SNPs. Next, EpiHNet integrates these two networks into a composite network, applies the modularity based clustering with fast search strategy to divide SNP nodes into different clusters. After that, it detects SNP interactions based on SNP combinations derived from each cluster. Synthetic experiments on diverse two-locus and three-locus disease models manifest that EpiHNet outperforms competitive baselines, even without the heterogeneous network. For real WTCCC breast cancer data, EpiHNet also demonstrates expressive results on detecting high-order SNP interactions.
Improving prediction for medical institution with limited patient data: Leveraging hospital-specific data based on multicenter collaborative research network
2021, Artificial Intelligence in Medicine
Citation Excerpt :
The NDF model comprises the representation learning functionality known from neural network layers and the state-of-the-art learning performance of random forest (RF), which is an ideal ensemble learning approach for working with classification problems and easily distributable on parallel hardware. The architecture of fusing the deep learning networks and the tree-structured classifiers has become increasingly popular in biomedical data analysis due to their advantageous composition [38,39]. Both deep and shallow architectures (d/s-NDF) have been proposed and have achieved ideal performance on complex classification tasks.
Clinical decision support assisted by prediction models usually faces the challenges of limited clinical data and a lack of labels when the model is developed with data from a single medical institution. Accordingly, research on multicenter clinical collaborative networks, which can provide external medical data, has received increasing attention. With the increasing availability of machine learning techniques such as transfer learning, leveraging large-scale patient data from multiple hospitals to build data-driven predictive models with clinical application potential provides an alternative solution to address the problem of limited patient data.
A multicenter hybrid semi-supervised transfer learning model (MHSTL) is proposed in this study on the basis of unified common data model to ensure multicenter data standardized representation. Then the hospital-specific features, along with the co-occurrence features across domains, are aligned through a representation learning architecture that is built based on deep neural networks and the newly proposed neural decision forest model. In this process, limited patient data from the target hospital, both labeled and unlabeled, are incorporated during the feature adaptation process, thereby contributing to better model performance. Without patient-level data sharing, the proposed model learning strategy which overcomes feature misalignment and distribution divergence, enables the multi-source transfer learning process in the case of insufficient and unlabeled patient data at target hospital.
The effectiveness of the proposed transfer learning model was evaluated on a collaborative research network of colorectal cancer patients in the US and China. The results demonstrate that the proposed model can achieve much better performance for predicting target risk with limited resources on patient data than baseline models      . Better discrimination and calibration ability are also observed when sufficient labeled data are not available in the target hospital for prognosis prediction tasks      . Further exploratory experiments show that the proposed approach exhibits good model generalizability regardless of the data heterogeneity. With the help of the SHapley Additive exPlanations for model interpretation, the effectiveness of incorporating hospital-specific features in the transfer learning model is shown.
In this study, the proposed method can develop prediction models from multiple source hospitals and exhibit good performance by leveraging cross-domain hospital-specific feature information, therefore enhancing the model prediction when applied to single medical institution with limited patient data.
Ki-67 Quantification in Breast Cancer by Digital Imaging AI Software and its Concordance with Manual Method
2023, Journal of the College of Physicians and Surgeons Pakistan
EpiMC: Detecting Epistatic Interactions Using Multiple Clusterings
2022, IEEE/ACM Transactions on Computational Biology and Bioinformatics
A framework for modeling epistatic interaction
2021, Bioinformatics

View all citing articles on Scopus

View full text

A deep hybrid model to detect multi-locus interacting SNPs in the presence of noise

Highlights

Abstract

Section snippets

Background

Methods

Deep neural networks

Optimising hyper-parameters

Back propagation

Random forest

Variable importance

Interactions

Multifactor combination

Data input

Evaluation

Results

Discussion and conclusions

Authors contributions

Availability of data and materials

Acknowledgments

Neurobiol. Aging

Am. J. Hum. Genet.

Am. J. Hum. Genet.

Appl. Soft Comput.

Int. J. Data Min. Bioinform.

Am. J. Hum. Genet.

Am. J. Hum. Genet.

Am. J. Hum. Genet.

Am. J. Hum. Genet.

Genome-wide association studies

PLoS Comput. Biol.

Genetics Home Reference [Internet], Vol. 2013 Sep 16

Between the Lines of Genetic Code: Genetic Interactions in Understanding Disease and Complex Phenotypes

Detecting gene–gene interactions that underlie human diseases

Nat. Rev. Genet.

A review for detecting gene-gene interactions using machine learning methods in genetic epidemiology

Biomed. Res. Int.

A review on methods for detecting SNP interactions in high-dimensional genomic data

IEEE/ACM Trans. Comput. Biol. Bioinf.

Odds ratio based multifactor-dimensionality reduction method for detecting gene–gene interactions

Bioinformatics

MB-MDR: Model-Based Multifactor Dimensionality Reduction for Detecting Interactions in High-Dimensional Genomic Data

A robust multifactor dimensionality reduction method for detecting gene–gene interactions with application to the genetic analysis of bladder cancer susceptibility

Ann. Hum. Genet.

On safari to random jungle: a fast implementation of random forests for high-dimensional data

Bioinformatics

A random forest approach to the detection of epistatic interactions in case-control studies

BMC Bioinf.

A screening methodology based on random forests to improve the detection of gene–gene interactions

Eur. J. Hum. Genet.

SNPInterForest: a new method for detecting epistatic interactions

BMC Bioinf.

SNPHarvester: a filtering-based approach for detecting epistatic interactions in genome-wide association studies

Bioinformatics

TEAM: efficient two-locus epistasis tests in human genome-wide association study

Bioinformatics

TRM: a powerful two‐stage machine learning approach for identifying SNP‐SNP interactions

Ann. Hum. Genet.

Comparison of neural network optimization approaches for studies of human genetics

Applications of Evolutionary Computing

The power of quantitative grammatical evolution neural networks to detect gene-gene interactions

Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, ACM

A support vector machine approach for detecting gene‐gene interaction

Genet. Epidemiol.

Identifying gene-disease associations using centrality on a literature mined gene-interaction network

Bioinformatics

SVM‐based generalized multifactor dimensionality reduction approaches for detecting Gene‐Gene interactions in family studies

Genet. Epidemiol.

Improving accuracy for cancer classification with a new algorithm for genes selection

BMC Bioinform.

Identification of SNP interactions using logic regression

Biostatistics

Identifying interacting SNPs using Monte carlo logic regression

Genet. Epidemiol.

Methods for identifying SNP interactions: a review on variations of logic regression, random Forest and bayesian logistic regression

IEEE/ACM Trans. Comput. Biol. Bioinform.