Elsevier

Neurocomputing

Volume 324, 9 January 2019, Pages 51-62
Neurocomputing

Personalized prediction of genes with tumor-causing somatic mutations based on multi-modal deep Boltzmann machine

https://doi.org/10.1016/j.neucom.2018.02.096Get rights and content

Abstract

When diagnosed at an advanced stage, most cancer patients suffer from treatment failure, recurrences and low survival. Taking advantage of high-throughput sequencing and deep learning techniques, we developed an early cancer monitoring method based on multi-modal deep Boltzmann machine to (1) learn association between matched germline and somatic mutations captured by whole exome sequencing from available samples of cancer patients, and (2) predict patient-specific high-risk genes whose somatic mutations are required to drive normal tissues to a tumor state. Our experiments on a set of breast cancer samples show that our method significantly outperforms the currently used frequency-based method in the personalized prediction of genes carrying critical mutations.

Introduction

A majority of cancers are diagnosed at the middle- or late-stage at which time most tumors have spread and become incurable. Consequently, with some notable exceptions, improvements in overall survival and morbidity over the past few decades have been modest. To meet the challenges of the surge in cancer cases in the future, it is envisioned that, besides the promotion of lifestyle changes, improving early diagnosis is the best strategy for reducing the impact of carcinogenesis.

The onset of cancers is contributed by both germline mutations inherited from their parents and post-zygotic somatic mutations gradually accumulated during the development and life of an individual [1]. The mutational portrait of a cancer patient reflects the history and present of the tumor genome. Cancers are caused by critical disruption of a range of pathways [2], [3]. Across all types of cancers, some mutational signatures are shared by many, while a specific cancer may have unique signatures as well [4], [5]. Even within a cancer, patients subject to different subtypes show distinct mutational makeups.

Recent studies unveil that tumor cells dump circulating tumor DNA (ctDNA) in the blood. The development of deep sequencing technologies are enabling detection of such cell-free DNA (cfDNA) [6], [7]. The convergence of mutational signatures and cfDNA detection implies a promising noninvasive method for personalized diagnosis, treatment and prognosis. However, to date, no computational method has been developed to construct clinically useful predictive models from individualized genome sequencing data, mostly because of excessive variability in the identity of mutated genes even within tumors of the same type.

Deep learning [8] approaches have achieved unprecedented successes in modelling and understanding complex data such as text, images, video, time-series, natural languages, and ultra-throughput sequencing data of genomes. As disruptive technologies, deep learning algorithms have revolutionized the way people view machine intelligence, and demonstrated significant improvements over traditional computational methods and unique potentials in solving previously unsolvable complex problems. To the interest of this paper, the most attractive strengths of deep learning models include its capability of modelling the joint distribution or association among observations from multiple perspectives, and inferring missing information given partially observed data [9], which inspired us to develop a machine learning approach for the personalized prediction of cancer-causing genes in early cancer diagnosis based on cfDNA monitoring techniques.

The objective of our long-term research is to develop and validate a genome-based diagnostic test that is able to precisely monitor healthy individuals for the appearance of cancer-driving mutations (see Fig. 1). To pursue part of this goal, specifically in this paper, we propose a multi-modal deep Boltzmann machine (MDBM) approach for the personalized prediction of key genes whose somatic mutations are required for the first steps of malignant transformation. A MDBM model first learns the associations between a sample’s germline and founding mutational profiles using available training data. Then in the test phase, it uses an individual’s germline genetic makeup to predict genes with cancer-causing founding mutations specific to this individual.

Section snippets

Related work

Before presenting our method, it is necessary to briefly review restricted Boltzmann machine (RBM) and deep Boltzmann machine (DBM), because they serve as building blocks of the multi-modal deep Boltzmann machine. Here, we shall limit our discussion on models for binary data. For discussions from the exponential family perspective, please refer to [10].

Method

In this section, we present the formulation, learning and inference algorithms, and application of multi-modal deep Boltzmann machine (MDBM) for the modelling of cancer mutation data.

Computational experiments

We investigated the performance of the three MDBM models (Fig. 4) on the TCGA breast cancer WES data presented at gene level. The data, computational experiments and analysis are sequentially described below.

Discussions and conclusions

Due to the heterogeneous nature of a cancer, personalized early cancer detection is critical for successful treatment. In this study, we propose a procedure of personalized monitoring of cancer causing genes powered by deep generative models. On our processed germline-somatic WES data for breast cancer, our preliminary computational experiments indicate that our personalized prediction method has superior accuracy than the traditionally applied frequency-based method.

The success of MDBM in

Acknowledgement

This study was supported by the National Research Council Canada Ideation Program. We thank the anonymous reviewers for their constructive comments that helped improve this article.

Yifeng Li is a Research Officer in the Scientific Data Mining Team of Digital Technologies Research Centre, National Research Council of Canada (NRC). Prior to his joining to NRC, Dr. Li was a post-doctorate at the Wasserman Laboratory of the Centre for Molecular Medicine and Therapeutics, University of British Columbia, Canada. He obtained his Ph.D. from the School of Computer Science, University of Windsor, Canada, in 2013. His doctoral dissertation was recognized by the Governor General’s

References (104)

  • I. Lee et al.

    Unconventional role of the inwardly rectifying potassium channel Kir2.2 as a constitutive activator of RelA in cancer

    Cancer Res.

    (2012)
  • D. Freed et al.

    Somatic mosaicism in the human genome

    Genes

    (2014)
  • B. Cogelstein et al.

    Cancer genome landscapes

    Science

    (2013)
  • T. Helleday et al.

    Mechanisms underlying mutational signatures in human cancers

    Nat. Rev. Genet.

    (2014)
  • L. Alexandrov et al.

    Signatures of mutational processes in human cancer

    Nature

    (2013)
  • L. Alexandrov et al.

    Mutational signatures associated with tobacco smoking in human cancer

    Science

    (2016)
  • A. Newman et al.

    Integrated digital error suppression for improved detection of circulating tumor DNA

    Nat. Biotechnol.

    (2016)
  • WanJ. et al.

    Liquid biopsies come of age: Towards implementation of circulating tumour DNA

    Nat. Rev. Cancer

    (2017)
  • Y. LeCun et al.

    Deep learning

    Nature

    (2015)
  • LiY. et al.

    A review on machine learning principles for multi-view biological data integration

    Brief. Bioinform.

    (2018)
  • LiY. et al.

    Exponential family restricted Boltzmann machines and annealed importance sampling

    Proceedings of the International Joint Conference on Neural Networks

    (2018)
  • P. Smolensky

    Information processing in dynamical systems: Foundations of harmony theory

  • G. Hinton

    Training products of experts by minimizing contrastive divergence

    Neural Comput.

    (2002)
  • T. Tieleman

    Training restricted Boltzmann machines using approximations to the likelihood gradient

    Proceedings of the International Conference on Machine Learning

    (2008)
  • R. Salakhutdinov et al.

    On the quantitative analysis of deep belief networks

    Proceedings of the International Conference on Machine Learning

    (2008)
  • G. Hinton et al.

    A fast learning algorithm for deep belief nets

    Neural Comput.

    (2006)
  • R. Salakhutdinov et al.

    Deep Boltzmann machine

    Proceedings of the International Conference on Artificial Intelligence and Statistics

    (2009)
  • Y. Bengio et al.

    Greedy layer-wise training of deep networks

    Advances in Neural Information Processing Systems

    (2006)
  • N. Srivastava et al.

    Multimodal learning with deep Boltzmann machines

    J. Mach. Learn. Res.

    (2014)
  • J. Fuster

    Cortex and Mind

    (2003)
  • C.M. Bishop

    Pattern Recognition and Machine Learning

    (2009)
  • C.G.A. Network

    Comprehensive molecular portraits of human breast tumours

    Nature

    (2012)
  • N. Zaman et al.

    Signaling network assessment of mutations and copy number variations predict breast cancer subtype-specific drug targets

    Cell Rep.

    (2013)
  • M. DePristo et al.

    A framework for variation discovery and genotyping using next-generation DNA sequencing data

    Nat. Gen.

    (2011)
  • D. Koboldt et al.

    VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing

    Genome Res.

    (2012)
  • BaoL. et al.

    AbsCN-seq: A statistical method to estimate tumor purity, ploidy and absolute copy numbers from next-generation sequencing data

    Bioinformatics

    (2014)
  • W. McLaren et al.

    The ensembl variant effect predictor

    Genome Biol.

    (2016)
  • J. Leja et al.

    Novel markers for enterochromaffin cells and gastrointestinal neuroendocrine carcinomas

    Modern Pathol.

    (2009)
  • G. Aust et al.

    Adhesion GPCRs in tumorigenesis

    Handbook of Experimental Pharmacology

    (2016)
  • M. McLemore et al.

    Introducing the MUC16 gene: Implications for prevention and early detection in epithelial ovarian cancer

    Biol. Res. Nursing

    (2005)
  • L. Norum et al.

    Elevated CA 125 in breast cancer - a sign of advanced disease

    Tumor Biol.

    (2001)
  • I. Lakshmanan et al.

    MUC16 induced rapid G2/M transition via interactions with JAK2 for increased proliferation and anti-apoptosis in breast cancer cells

    Oncogene

    (2012)
  • M. Felder et al.

    MUC16 (CA125): Tumor biomarker to cancer therapy, a work in progress

    Molecular Cancer

    (2014)
  • J. Ringel et al.

    The MUC gene family: Their role in diagnosis and early detection of pancreatic cancer

    Molecular Cancer

    (2003)
  • C. Machado et al.

    D-Titin: A giant protein with dual roles in chromosomes and muscles

    J. Cell Biol.

    (2000)
  • M. Hofree et al.

    Network-based stratification of tumor mutations

    Nat. Methods

    (2013)
  • T. Langenhan et al.

    Adhesion G protein-coupled receptors in nervous system development and disease

    Nat. Rev. Neurosci.

    (2016)
  • Y. Wang et al.

    Deficiency of very large G-protein-coupled receptor-1 is a risk factor of tumor-related epilepsy: A whole transcriptome sequencing analysis

    Journal of Neuro-Oncology

    (2015)
  • V. Melotte et al.

    Spectrin repeat containing nuclear envelope 1 and forkhead box protein E1 are promising markers for the detection of colorectal cancer in blood

    Cancer Prevent. Res.

    (2015)
  • M. Colozza et al.

    Proliferative markers as prognostic and predictive tools in early breast cancer: Where are we now?

    Annals Oncol.

    (2005)
  • Cited by (10)

    • A personalized classification model using similarity learning via supervised autoencoder

      2022, Applied Soft Computing
      Citation Excerpt :

      There have been studies that measure similarity based on the distance between input variables of observations [1], using external variables [3], or using latent variables transformed from input variables [4,5]. Many researchers have studied personalized modeling until recently in healthcare and medical informatics [6–9]. In electrical health record (EHR) data, there are several subtypes for specific diseases, which causes heterogeneous properties.

    • Smart deep learning-based approach for non-destructive freshness diagnosis of common carp fish

      2020, Journal of Food Engineering
      Citation Excerpt :

      Deep learning can link large machinery data and intelligent machine monitoring, model hierarchical representations behind data and predict patterns via multiple layers of information processing modules (Zhao et al., 2019). Different forms of deep learning models have been developed including Deep Belief Network (Ye et al., 2018), Deep Boltzmann Machines (Li et al., 2019), Auto-encoders (Shao et al., 2018), Recurrent Neural Networks and Convolutional Neural Networks (Kimmel et al., 2019). Convolutional Neural Network (CNN) as a specific class of deep learning techniques includes a collection of non-linear transformation functions.

    View all citing articles on Scopus

    Yifeng Li is a Research Officer in the Scientific Data Mining Team of Digital Technologies Research Centre, National Research Council of Canada (NRC). Prior to his joining to NRC, Dr. Li was a post-doctorate at the Wasserman Laboratory of the Centre for Molecular Medicine and Therapeutics, University of British Columbia, Canada. He obtained his Ph.D. from the School of Computer Science, University of Windsor, Canada, in 2013. His doctoral dissertation was recognized by the Governor General’s Gold Medal. His research interests include deep neural networks, machine learning, big data integration, large-scale optimization, and bioinformatics.

    François Fauteux is a Research Officer at National Research Council Canada since 2009, and currently part of the Scientific Data Mining team of Digital Technologies Research Centre, and Principal Investigator in the Biologics and Biomanufacturing Program, Target Selection and Prioritization. He obtained his Ph.D in Plant Science – Bioinformatics from McGill University in 2009. His current research interests include genomics, transcriptomics and proteomics data mining for cancer drug discovery and crop improvement.

    Jinfeng Zou received the B.S. degree in Computer Science from Harbin Normal University, China, in 2000, the M.S. degree in Bioinformatics from Harbin Medical University, China, in 2009, and the Ph.D. degree in Biomedical Engineering from University of Electronic Science and Technology of China in 2012. She is currently a Research Associate Officer in National Research Council Canada. Her research interests include biomarker discovery for cancer recurrence, treatment response and early-detection, and personalized medicine based on high-throughput omics data.

    André Nantel has been at the National Research Council of Canada since 1995 and he is currently the Head of the Cellular and Molecular Pharmacology section within the Human Health Therapeutics Research Centre. Most of his research are in the field of functional genomics applied to cancer, fungal pathogenesis and protein production in CHO cells.

    Youlian Pan holds his Ph.D. in Biology and his Master of Computer Sciences, both from Dalhousie University, Halifax, Canada. He is currently a Senior Research Scientist in Data Mining at the National Research Council of Canada. His research interest is in data mining and machine learning with biological application, specifically in high throughput sequencing data, gene expression profiling and systems biology with an objective of discovering and developing biomarkers related with human diseases and crops’ stress tolerance in adverse environments.

    View full text