Personalized prediction of genes with tumor-causing somatic mutations based on multi-modal deep Boltzmann machine

doi:10.1016/j.neucom.2018.02.096

Neurocomputing

Volume 324, 9 January 2019, Pages 51-62

https://doi.org/10.1016/j.neucom.2018.02.096 Get rights and content

Abstract

When diagnosed at an advanced stage, most cancer patients suffer from treatment failure, recurrences and low survival. Taking advantage of high-throughput sequencing and deep learning techniques, we developed an early cancer monitoring method based on multi-modal deep Boltzmann machine to (1) learn association between matched germline and somatic mutations captured by whole exome sequencing from available samples of cancer patients, and (2) predict patient-specific high-risk genes whose somatic mutations are required to drive normal tissues to a tumor state. Our experiments on a set of breast cancer samples show that our method significantly outperforms the currently used frequency-based method in the personalized prediction of genes carrying critical mutations.

Introduction

A majority of cancers are diagnosed at the middle- or late-stage at which time most tumors have spread and become incurable. Consequently, with some notable exceptions, improvements in overall survival and morbidity over the past few decades have been modest. To meet the challenges of the surge in cancer cases in the future, it is envisioned that, besides the promotion of lifestyle changes, improving early diagnosis is the best strategy for reducing the impact of carcinogenesis.

The onset of cancers is contributed by both germline mutations inherited from their parents and post-zygotic somatic mutations gradually accumulated during the development and life of an individual [1]. The mutational portrait of a cancer patient reflects the history and present of the tumor genome. Cancers are caused by critical disruption of a range of pathways [2], [3]. Across all types of cancers, some mutational signatures are shared by many, while a specific cancer may have unique signatures as well [4], [5]. Even within a cancer, patients subject to different subtypes show distinct mutational makeups.

Recent studies unveil that tumor cells dump circulating tumor DNA (ctDNA) in the blood. The development of deep sequencing technologies are enabling detection of such cell-free DNA (cfDNA) [6], [7]. The convergence of mutational signatures and cfDNA detection implies a promising noninvasive method for personalized diagnosis, treatment and prognosis. However, to date, no computational method has been developed to construct clinically useful predictive models from individualized genome sequencing data, mostly because of excessive variability in the identity of mutated genes even within tumors of the same type.

Deep learning [8] approaches have achieved unprecedented successes in modelling and understanding complex data such as text, images, video, time-series, natural languages, and ultra-throughput sequencing data of genomes. As disruptive technologies, deep learning algorithms have revolutionized the way people view machine intelligence, and demonstrated significant improvements over traditional computational methods and unique potentials in solving previously unsolvable complex problems. To the interest of this paper, the most attractive strengths of deep learning models include its capability of modelling the joint distribution or association among observations from multiple perspectives, and inferring missing information given partially observed data [9], which inspired us to develop a machine learning approach for the personalized prediction of cancer-causing genes in early cancer diagnosis based on cfDNA monitoring techniques.

The objective of our long-term research is to develop and validate a genome-based diagnostic test that is able to precisely monitor healthy individuals for the appearance of cancer-driving mutations (see Fig. 1). To pursue part of this goal, specifically in this paper, we propose a multi-modal deep Boltzmann machine (MDBM) approach for the personalized prediction of key genes whose somatic mutations are required for the first steps of malignant transformation. A MDBM model first learns the associations between a sample’s germline and founding mutational profiles using available training data. Then in the test phase, it uses an individual’s germline genetic makeup to predict genes with cancer-causing founding mutations specific to this individual.

Section snippets

Related work

Before presenting our method, it is necessary to briefly review restricted Boltzmann machine (RBM) and deep Boltzmann machine (DBM), because they serve as building blocks of the multi-modal deep Boltzmann machine. Here, we shall limit our discussion on models for binary data. For discussions from the exponential family perspective, please refer to [10].

Method

In this section, we present the formulation, learning and inference algorithms, and application of multi-modal deep Boltzmann machine (MDBM) for the modelling of cancer mutation data.

Computational experiments

We investigated the performance of the three MDBM models (Fig. 4) on the TCGA breast cancer WES data presented at gene level. The data, computational experiments and analysis are sequentially described below.

Discussions and conclusions

Due to the heterogeneous nature of a cancer, personalized early cancer detection is critical for successful treatment. In this study, we propose a procedure of personalized monitoring of cancer causing genes powered by deep generative models. On our processed germline-somatic WES data for breast cancer, our preliminary computational experiments indicate that our personalized prediction method has superior accuracy than the traditionally applied frequency-based method.

The success of MDBM in

Acknowledgement

This study was supported by the National Research Council Canada Ideation Program. We thank the anonymous reviewers for their constructive comments that helped improve this article.

Yifeng Li is a Research Officer in the Scientific Data Mining Team of Digital Technologies Research Centre, National Research Council of Canada (NRC). Prior to his joining to NRC, Dr. Li was a post-doctorate at the Wasserman Laboratory of the Centre for Molecular Medicine and Therapeutics, University of British Columbia, Canada. He obtained his Ph.D. from the School of Computer Science, University of Windsor, Canada, in 2013. His doctoral dissertation was recognized by the Governor General’s

References (104)

D. Barnett et al.
BamTools: A C++ API and toolkit for analyzing and managing BAM files
Bioinformatics
(2011)
S. Park et al.
Clinical relevance and molecular phenotypes in gastric cancer, of TP53 mutations and gene expressions, in combination with other gene mutations
Sci. Rep.
(2016)
E. Yousef et al.
MCM2: An alternative to Ki-67 for measuring breast cancer cell proliferation
Modern Pathol.
(2017)
J. Adnane et al.
BEK and FLG, two receptors to members of the FGF family, are amplified in subsets of human breast cancers
Oncogene
(1991)
F. D’Avila et al.
Exome sequencing identifies variants in two genes encoding the LIM-proteins NRAP and FHL1 in an Italian patient with BAG3 myofibrillar myopathy
J. Muscle Res. Cell Motil.
(2016)
WuH. et al.
Identifying overlapping mutated driver pathways by constructing gene networks in cancer
BMC Bioinformatics
(2015)
R. Artuso et al.
Investigation of modifier genes within copy number variations in Rett syndrome
J. Human Genetics
(2011)
L. Bartoloni et al.
Mutations in the DNAH11 (axonemal heavy chain dynein type 11) gene cause one form of situs inversus totalis and most likely primary ciliary dyskinesia
PNAS
(2002)
WangM. et al.
AHNAK2 is a novel prognostic marker and oncogenic protein for clear cell renal cell carcinoma
Theranostics
(2017)
M. Wirtenberger et al.
Association of genetic variants in the Rho guanine nucleotide exchange factor AKAP13 with familial breast cancer
Carcinogenesis
(2006)

I. Lee et al.

Unconventional role of the inwardly rectifying potassium channel Kir2.2 as a constitutive activator of RelA in cancer

Cancer Res.

(2012)

D. Freed et al.

Somatic mosaicism in the human genome

Genes

(2014)

B. Cogelstein et al.

Cancer genome landscapes

Science

(2013)

T. Helleday et al.

Mechanisms underlying mutational signatures in human cancers

Nat. Rev. Genet.

(2014)

L. Alexandrov et al.

Signatures of mutational processes in human cancer

Nature

(2013)

L. Alexandrov et al.

Mutational signatures associated with tobacco smoking in human cancer

Science

(2016)

A. Newman et al.

Integrated digital error suppression for improved detection of circulating tumor DNA

Nat. Biotechnol.

(2016)

WanJ. et al.

Liquid biopsies come of age: Towards implementation of circulating tumour DNA

Nat. Rev. Cancer

(2017)

Y. LeCun et al.

Deep learning

Nature

(2015)

LiY. et al.

A review on machine learning principles for multi-view biological data integration

Brief. Bioinform.

(2018)

LiY. et al.

Exponential family restricted Boltzmann machines and annealed importance sampling

Proceedings of the International Joint Conference on Neural Networks

(2018)

P. Smolensky

Information processing in dynamical systems: Foundations of harmony theory

G. Hinton

Training products of experts by minimizing contrastive divergence

Neural Comput.

(2002)

T. Tieleman

Training restricted Boltzmann machines using approximations to the likelihood gradient

Proceedings of the International Conference on Machine Learning

(2008)

R. Salakhutdinov et al.

On the quantitative analysis of deep belief networks

Proceedings of the International Conference on Machine Learning

(2008)

G. Hinton et al.

A fast learning algorithm for deep belief nets

Neural Comput.

(2006)

R. Salakhutdinov et al.

Deep Boltzmann machine

Proceedings of the International Conference on Artificial Intelligence and Statistics

(2009)

Y. Bengio et al.

Greedy layer-wise training of deep networks

Advances in Neural Information Processing Systems

(2006)

N. Srivastava et al.

Multimodal learning with deep Boltzmann machines

J. Mach. Learn. Res.

(2014)

J. Fuster

Cortex and Mind

(2003)

C.M. Bishop

Pattern Recognition and Machine Learning

(2009)

C.G.A. Network

Comprehensive molecular portraits of human breast tumours

Nature

(2012)

N. Zaman et al.

Signaling network assessment of mutations and copy number variations predict breast cancer subtype-specific drug targets

Cell Rep.

(2013)

M. DePristo et al.

A framework for variation discovery and genotyping using next-generation DNA sequencing data

Nat. Gen.

(2011)

D. Koboldt et al.

VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing

Genome Res.

(2012)

BaoL. et al.

AbsCN-seq: A statistical method to estimate tumor purity, ploidy and absolute copy numbers from next-generation sequencing data

Bioinformatics

(2014)

W. McLaren et al.

The ensembl variant effect predictor

Genome Biol.

(2016)

J. Leja et al.

Novel markers for enterochromaffin cells and gastrointestinal neuroendocrine carcinomas

Modern Pathol.

(2009)

G. Aust et al.

Adhesion GPCRs in tumorigenesis

Handbook of Experimental Pharmacology

(2016)

M. McLemore et al.

Introducing the MUC16 gene: Implications for prevention and early detection in epithelial ovarian cancer

Biol. Res. Nursing

(2005)

L. Norum et al.

Elevated CA 125 in breast cancer - a sign of advanced disease

Tumor Biol.

(2001)

I. Lakshmanan et al.

MUC16 induced rapid G2/M transition via interactions with JAK2 for increased proliferation and anti-apoptosis in breast cancer cells

Oncogene

(2012)

M. Felder et al.

MUC16 (CA125): Tumor biomarker to cancer therapy, a work in progress

Molecular Cancer

(2014)

J. Ringel et al.

The MUC gene family: Their role in diagnosis and early detection of pancreatic cancer

Molecular Cancer

(2003)

C. Machado et al.

D-Titin: A giant protein with dual roles in chromosomes and muscles

J. Cell Biol.

(2000)

M. Hofree et al.

Network-based stratification of tumor mutations

Nat. Methods

(2013)

T. Langenhan et al.

Adhesion G protein-coupled receptors in nervous system development and disease

Nat. Rev. Neurosci.

(2016)

Y. Wang et al.

Deficiency of very large G-protein-coupled receptor-1 is a risk factor of tumor-related epilepsy: A whole transcriptome sequencing analysis

Journal of Neuro-Oncology

(2015)

V. Melotte et al.

Spectrin repeat containing nuclear envelope 1 and forkhead box protein E1 are promising markers for the detection of colorectal cancer in blood

Cancer Prevent. Res.

(2015)

M. Colozza et al.

Proliferative markers as prognostic and predictive tools in early breast cancer: Where are we now?

Annals Oncol.

(2005)

Cited by (10)

A personalized classification model using similarity learning via supervised autoencoder
2022, Applied Soft Computing
Citation Excerpt :
There have been studies that measure similarity based on the distance between input variables of observations [1], using external variables [3], or using latent variables transformed from input variables [4,5]. Many researchers have studied personalized modeling until recently in healthcare and medical informatics [6–9]. In electrical health record (EHR) data, there are several subtypes for specific diseases, which causes heterogeneous properties.
Personalized modeling usually trains a predictive model for a new point using only observations similar to the new point. However, existing methodologies have limitations that do not reflect the target variable in the similarity calculation nor the density of neighbors. Thus, this paper proposes a new personalized modeling method. The proposed methodology transforms the input variables into the latent variables through a supervised autoencoder and calculates the similarity measure between observations in the transformed latent space. The proposed method also considers the neighborhood density around the test point. As a result of the experiments with real datasets, it was found that the proposed method outperformed other benchmark methods and showed the interpretability of the predictive model.
Smart deep learning-based approach for non-destructive freshness diagnosis of common carp fish
2020, Journal of Food Engineering
Citation Excerpt :
Deep learning can link large machinery data and intelligent machine monitoring, model hierarchical representations behind data and predict patterns via multiple layers of information processing modules (Zhao et al., 2019). Different forms of deep learning models have been developed including Deep Belief Network (Ye et al., 2018), Deep Boltzmann Machines (Li et al., 2019), Auto-encoders (Shao et al., 2018), Recurrent Neural Networks and Convolutional Neural Networks (Kimmel et al., 2019). Convolutional Neural Network (CNN) as a specific class of deep learning techniques includes a collection of non-linear transformation functions.
Assessment and intelligent monitoring of fish freshness are of the utmost importance in yield and trade of fishery products. Rapid and precise assessment of fish freshness using conventional methods considering the great volume of industrial production is challenging. In this study, instead of feature-engineering-based methods, a novel and accurate fish freshness detection is proposed based on the images obtained from common carp and by applying a deep convolutional neural network (CNN). To classify fish images based on freshness by the proposed approach, first, VGG-16 architecture was applied to extract features from fish images automatically. Then, a developed classifier block constructed by dropout and dense layers was utilized to classify fish images. The obtained results showed the classification accuracy of 98.21%, and in conclusion, the proposed CNN-based method has lower complexity with higher accuracy compared to traditional classiﬁcation methods. This method is well-capable of monitoring and classifying fish freshness as a fast, low-cost, precise, non-destructive, real-time and automated technique.
Temporal convolutional network for a Fast DNA mutation detection in breast cancer data
2023, PLoS ONE
Computational Intelligence in Cancer Diagnostics: A Contemporary Review of Smart Phone Apps, Current Problems, and Future Research Potentials
2023, Diagnostics
AI-Powered Diagnosis of Skin Cancer: A Contemporary Review, Open Challenges and Future Research Directions
2023, Cancers
A Personalized Classification Model Using Similarity Learning Via Supervised Autoencoder
2022, SSRN

View all citing articles on Scopus

François Fauteux is a Research Officer at National Research Council Canada since 2009, and currently part of the Scientific Data Mining team of Digital Technologies Research Centre, and Principal Investigator in the Biologics and Biomanufacturing Program, Target Selection and Prioritization. He obtained his Ph.D in Plant Science – Bioinformatics from McGill University in 2009. His current research interests include genomics, transcriptomics and proteomics data mining for cancer drug discovery and crop improvement.

Jinfeng Zou received the B.S. degree in Computer Science from Harbin Normal University, China, in 2000, the M.S. degree in Bioinformatics from Harbin Medical University, China, in 2009, and the Ph.D. degree in Biomedical Engineering from University of Electronic Science and Technology of China in 2012. She is currently a Research Associate Officer in National Research Council Canada. Her research interests include biomarker discovery for cancer recurrence, treatment response and early-detection, and personalized medicine based on high-throughput omics data.

André Nantel has been at the National Research Council of Canada since 1995 and he is currently the Head of the Cellular and Molecular Pharmacology section within the Human Health Therapeutics Research Centre. Most of his research are in the field of functional genomics applied to cancer, fungal pathogenesis and protein production in CHO cells.

Youlian Pan holds his Ph.D. in Biology and his Master of Computer Sciences, both from Dalhousie University, Halifax, Canada. He is currently a Senior Research Scientist in Data Mining at the National Research Council of Canada. His research interest is in data mining and machine learning with biological application, specifically in high throughput sequencing data, gene expression profiling and systems biology with an objective of discovering and developing biomarkers related with human diseases and crops’ stress tolerance in adverse environments.

View full text

Personalized prediction of genes with tumor-causing somatic mutations based on multi-modal deep Boltzmann machine

Abstract

Introduction

Section snippets

Related work

Method

Computational experiments

Discussions and conclusions

Acknowledgement

Bioinformatics

Sci. Rep.

Modern Pathol.

Oncogene

J. Muscle Res. Cell Motil.

BMC Bioinformatics

J. Human Genetics

PNAS

Theranostics

Carcinogenesis

Cancer Res.

Somatic mosaicism in the human genome

Genes

Cancer genome landscapes

Science

Mechanisms underlying mutational signatures in human cancers

Nat. Rev. Genet.

Signatures of mutational processes in human cancer

Nature

Mutational signatures associated with tobacco smoking in human cancer

Science

Integrated digital error suppression for improved detection of circulating tumor DNA

Nat. Biotechnol.

Liquid biopsies come of age: Towards implementation of circulating tumour DNA

Nat. Rev. Cancer

Deep learning

Nature

A review on machine learning principles for multi-view biological data integration

Brief. Bioinform.

Exponential family restricted Boltzmann machines and annealed importance sampling

Proceedings of the International Joint Conference on Neural Networks

Information processing in dynamical systems: Foundations of harmony theory

Training products of experts by minimizing contrastive divergence

Neural Comput.

Training restricted Boltzmann machines using approximations to the likelihood gradient

Proceedings of the International Conference on Machine Learning

On the quantitative analysis of deep belief networks

Proceedings of the International Conference on Machine Learning

A fast learning algorithm for deep belief nets

Neural Comput.

Deep Boltzmann machine

Proceedings of the International Conference on Artificial Intelligence and Statistics

Greedy layer-wise training of deep networks

Advances in Neural Information Processing Systems

Multimodal learning with deep Boltzmann machines

J. Mach. Learn. Res.

Cortex and Mind

Pattern Recognition and Machine Learning

Comprehensive molecular portraits of human breast tumours

Nature

Signaling network assessment of mutations and copy number variations predict breast cancer subtype-specific drug targets

Cell Rep.

A framework for variation discovery and genotyping using next-generation DNA sequencing data

Nat. Gen.

VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing

Genome Res.

AbsCN-seq: A statistical method to estimate tumor purity, ploidy and absolute copy numbers from next-generation sequencing data

Bioinformatics

The ensembl variant effect predictor

Genome Biol.

Novel markers for enterochromaffin cells and gastrointestinal neuroendocrine carcinomas

Modern Pathol.

Adhesion GPCRs in tumorigenesis

Handbook of Experimental Pharmacology

Introducing the MUC16 gene: Implications for prevention and early detection in epithelial ovarian cancer

Biol. Res. Nursing

Elevated CA 125 in breast cancer - a sign of advanced disease

Tumor Biol.

MUC16 induced rapid G2/M transition via interactions with JAK2 for increased proliferation and anti-apoptosis in breast cancer cells

Oncogene

MUC16 (CA125): Tumor biomarker to cancer therapy, a work in progress