Review article
Reprint of “Abstraction for data integration: Fusing mammalian molecular, cellular and phenotype big datasets for better knowledge extraction”

https://doi.org/10.1016/j.compbiolchem.2015.08.005Get rights and content

Highlights

  • A small fraction of biomedical Big Data is converted to useful knowledge or reused.

  • Overview of a collection of structured mostly molecular mammalian biomedical Big Data resources.

  • Biases within data from these resources are suspected.

  • Data abstraction to attribute tables, networks and gene-sets enables reuse of biomedical datasets for integrative analyses.

  • Once data is abstracted it can be integrated and analyzed using supervised, unsupervised and integrative methods.

Abstract

With advances in genomics, transcriptomics, metabolomics and proteomics, and more expansive electronic clinical record monitoring, as well as advances in computation, we have entered the Big Data era in biomedical research. Data gathering is growing rapidly while only a small fraction of this data is converted to useful knowledge or reused in future studies. To improve this, an important concept that is often overlooked is data abstraction. To fuse and reuse biomedical datasets from diverse resources, data abstraction is frequently required. Here we summarize some of the major Big Data biomedical research resources for genomics, proteomics and phenotype data, collected from mammalian cells, tissues and organisms. We then suggest simple data abstraction methods for fusing this diverse but related data. Finally, we demonstrate examples of the potential utility of such data integration efforts, while warning about the inherit biases that exist within such data.

Introduction

Big Data does not have to be defined by sheer size, i.e., giga-bytes, tera-bytes, or peta-bytes of data, but by the fact that almost all the variables of a complex system can be measured over time and under different conditions (Mayer-Schönberger and Cukier, 2013). Computational biology tools and databases rapidly emerge with an attempt to organize and integrate molecular and phenotype data for the ultimate goal of making predictions by performing virtual experiments. Data integration enables imputing missing values given the already existing data, identifying unexpected relationships between variables, mostly through correlation analyses such as unsupervised clustering, learn-to-rank methods such as enrichment analyses, network reconstruction methods, and supervised machine learning algorithms which are used to make predictions for unseen instances. Integrating x-omics data, a.k.a. the integrome is not as difficult as it may seem because most diverse datasets and resources represent their data in a relatively structured format with common fields such as cells, genes, proteins, drugs, diseases, and assays. Such diverse but structured data can be converted into attribute tables, bi-partite graphs, single-node-type networks, hierarchies and set libraries. Such data structures provide different views of the same data and are useful for different data integration purposes. Combining two or more datasets, if they share common entities such as: genes/proteins, cells, small-molecules/drugs, tissues/tumors/patients, or diseases/phenotypes/side-effects, can lead to new insights. Here we summarize some of the most relevant resources for x-omics data integration for better extracting knowledge from Big Data. We then define the data structures that can be used to combine such resources, and briefly review the primary methods that can be used to operate on the combined data for knowledge discovery, while providing a few examples applied to real data. While we recognize that typically system level data and the methods to integrate and analyze such data were initially developed for model organisms such as yeast, worm, fly and zebra fish, the focus of this review is on data collected from the mammalian system, as well as databases and computation tools applied to the data from mammalian cells, tissues and organisms. Finally, we discuss the concept and implications of the different biases that may exist across the diverse datasets we describe. In this next section we enlist major relevant emergent Big Data resources in computational systems biology.

Section snippets

Mouse Genome Informatics Mammalian Phenotype Ontology (MGI-MPO)

The Mammalian Phenotype Ontology (Smith et al., 2004) initially developed by the Mouse Genome Informatics group at the Jackson Labs (Blake et al., 2014) and expanded to an international initiative called KOMP (Austin et al., 2004) is a useful resource for connecting gene knockouts in mice to phenotypes. The MGI-MPO ontology is a controlled vocabulary of mouse phenotype terms that are related to each other in a hierarchical network, where at each branch-point a term is linked to a set of more

Attribute tables and bi-partite graphs

An attribute table is the most common and raw form for organizing high-content experimental data. Computationally, attribute tables can be represented as a matrix that defines the relationships between entities of two different classes (Fig. 1A) (Balakrishnan and Ranganathan, 2012). The row labels correspond to entities of one class and the column labels correspond to entities of the other class. Typically, the rows are the variables of a system, and measuring their level captures some aspect

Unsupervised clustering

In the previous section we discussed how the information content of many open online resources can be converted into the simple data structures of attribute tables, bi-partite graphs, networks and set libraries. By organizing all this data into these formats, the task of data integration becomes straightforward. Entities in attribute tables, bi-partite graphs, networks and set libraries can be clustered based on entity similarity. Clustering is an unsupervised machine learning task for which

Conclusions

One of the most important aspects of integrating data for converting Big Data to knowledge is dealing with identifiers. IDs are used by different resources to represent entities with the same but sometimes also partially overlapping meaning. One example is gene IDs and protein IDs. ID mapping is a critical aspect of the data integration process which we ignored in our discussions. Another important aspect for data integration is data normalization. Because data is collected by different

Acknowledgements

Funding: This work was supported in part by grants from the NIH: U54HL127624, U54CA189201, R01GM098316 and T32HL007824.

References (140)

  • N. Atias et al.

    An algorithmic framework for predicting side effects of drugs

    J. Comput. Biol.

    (2011)
  • C.P. Austin

    The knockout mouse project

    Nat. Genet.

    (2004)
  • G.D. Bader et al.

    Pathguide: a pathway resource list

    Nucleic Acids Res.

    (2006)
  • R. Balakrishnan et al.
  • S. Bandyopadhyay et al.

    Unsupervised Classification

    (2013)
  • J. Barretina

    The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity

    Nature

    (2012)
  • T. Barrett

    NCBI GEO: archive for functional genomics data sets—update

    Nucleic Acids Res.

    (2013)
  • A. Bate et al.

    Quantitative signal detection using spontaneous ADR reporting

    Pharmacoepidemiol. Drug Saf.

    (2009)
  • K.G. Becker

    The genetic association database

    Nat. Genet.

    (2004)
  • S. Berger et al.

    Genes2Networks: connecting lists of gene symbols using mammalian protein interactions databases

    BMC Bioinform.

    (2007)
  • B.E. Bernstein

    The NIH Roadmap Epigenomics Mapping Consortium

    Nat. Biotechnol.

    (2010)
  • C.M. Bishop
    (2006)
  • J.A. Blake

    The Mouse Genome Database: integration of and access to knowledge about the laboratory mouse

    Nucleic Acids Res.

    (2014)
  • J.S. Boehm et al.

    Towards systematic functional characterization of cancer genomes

    Nat. Rev. Genet.

    (2011)
  • L. Breiman

    Random forests

    Mach. Learn.

    (2001)
  • L.O. Bryzgalov

    Detection of regulatory SNPs in human genome using ChIP-seq ENCODE data

    PLoS One

    (2013)
  • M. Campillos

    Drug target identification using side-effect similarity

    Science

    (2008)
  • Cancer Genome Atlas Research Network

    The Cancer Genome Atlas Pan-Cancer analysis project

    Nat. Genet.

    (2013)
  • Cancer Genome Atlas Research Network

    Integrated genomic characterization of endometrial carcinoma

    Nature

    (2013)
  • Cancer Genome Atlas Research Network

    Comprehensive molecular characterization of clear cell renal cell carcinoma

    Nature

    (2013)
  • Cancer Genome Atlas Research Network

    Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia

    N. Engl. J. Med.

    (2013)
  • Cancer Genome Atlas Research Network

    Comprehensive genomic characterization defines human glioblastoma genes and core pathways

    Nature

    (2008)
  • Cancer Genome Atlas Research Network

    Integrated genomic analyses of ovarian carcinoma

    Nature

    (2011)
  • Cancer Genome Atlas Research Network

    Comprehensive genomic characterization of squamous cell lung cancers

    Nature

    (2012)
  • Cancer Genome Atlas Network

    Comprehensive molecular characterization of human colon and rectal cancer

    Nature

    (2012)
  • Cancer Genome Atlas Network

    Comprehensive molecular portraits of human breast tumours

    Nature

    (2012)
  • L.H. Chadwick

    The NIH roadmap epigenomics program data resource

    Epigenomics

    (2012)
  • A. Chatr-Aryamontri

    The BioGRID interaction database: 2013 update

    Nucleic Acids Res.

    (2013)
  • E.Y. Chen

    Expression2Kinases: mRNA profiling linked to multiple upstream regulatory layers

    Bioinform.

    (2012)
  • E.Y. Chen

    Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool

    BMC Bioinform.

    (2013)
  • H.W. Cheung

    Systematic investigation of genetic vulnerabilities across cancer cell lines reveals lineage-specific dependencies in ovarian cancer

    Proc. Natl. Acad. Sci. U. S. A.

    (2011)
  • H. Choi

    Analysis of protein complexes through model-based biclustering of label-free quantitative AP-MS data

    Mol. Syst. Biol.

    (2010)
  • N.R. Clark et al.

    Introduction to statistical methods to analyze large data sets: principal components analysis

    Sci. Signal.

    (2011)
  • N.R. Clark

    Sets2Networks: network inference from repeated observations of sets

    BMC Syst. Biol.

    (2012)
  • N.R. Clark

    The characteristic direction: a geometrical approach to identify differentially expressed genes

    BMC Bioinform.

    (2014)
  • EP Consortium

    The ENCODE (ENCyclopedia of DNA elements) project

    Science

    (2004)
  • EP Consortium

    A user’s guide to the encyclopedia of DNA elements (ENCODE)

    PLoS Biol.

    (2011)
  • GT Consortium

    The genotype-tissue expression (GTEx) project

    Nat. Genet.

    (2013)
  • C. Cortes et al.

    Support-vector networks

    Mach. Learn.

    (1995)
  • D. Croft

    The Reactome pathway knowledgebase

    Nucleic Acids Res.

    (2014)
  • Cited by (9)

    • Role of 14-3-3 sigma in over-expression of P-gp by rifampin and paclitaxel stimulation through interaction with PXR

      2017, Cellular Signalling
      Citation Excerpt :

      A total of 22 proteins were identified from each gel band including chaperone proteins, cytoskeletal proteins, ribosomal proteins, transcription factors, and signaling proteins, as summarized in Table 1. Of the 22 proteins, six proteins were confirmed to be related with PXR by searching for PXR-related proteins on the Harmonizome website of computational systems biology analysis (http://amp.pharm.mssm.edu/Harmonizome/) [28,29]. Thirty nine proteins, 1868, and 1379 genes were identified as PXR-related molecules following analysis by TRANSFAC, MotifMap, and Pathway Commons datasets, respectively (data not shown) [30–34].

    • Advances in systems biology – New trends and perspectives

      2015, Computational Biology and Chemistry
    • Identification of key genes and pathways in chronic rhinosinusitis with nasal polyps using bioinformatics analysis

      2019, American Journal of Otolaryngology - Head and Neck Medicine and Surgery
      Citation Excerpt :

      Nevertheless, large numbers of microarray data have not been fully utilized. These data can be reanalyzed with the latest bioinformatic algorithms, unveiling new molecular mechanisms [11,12]. Therefore, in the present study, we downloaded the publicly available gene expression data related to CRSwNP and applied bioinformatic tools to identify the different gene signatures between polyp samples and control samples.

    View all citing articles on Scopus

    A publishers’ error resulted in this article appearing in the wrong issue. The article is reprinted here for the reader's convenience and for the continuity of the special issue. For citation purposes, please use the original publication details “Computational Biology and Chemistry” 58 (2015) 104–119.

    View full text