Reprint of “Abstraction for data integration: Fusing mammalian molecular, cellular and phenotype big datasets for better knowledge extraction”

doi:10.1016/j.compbiolchem.2015.08.005

Computational Biology and Chemistry

Volume 59, Part B, December 2015, Pages 123-138

https://doi.org/10.1016/j.compbiolchem.2015.08.005 Get rights and content

Highlights

•
A small fraction of biomedical Big Data is converted to useful knowledge or reused.
•
Overview of a collection of structured mostly molecular mammalian biomedical Big Data resources.
•
Biases within data from these resources are suspected.
•
Data abstraction to attribute tables, networks and gene-sets enables reuse of biomedical datasets for integrative analyses.
•
Once data is abstracted it can be integrated and analyzed using supervised, unsupervised and integrative methods.

Abstract

With advances in genomics, transcriptomics, metabolomics and proteomics, and more expansive electronic clinical record monitoring, as well as advances in computation, we have entered the Big Data era in biomedical research. Data gathering is growing rapidly while only a small fraction of this data is converted to useful knowledge or reused in future studies. To improve this, an important concept that is often overlooked is data abstraction. To fuse and reuse biomedical datasets from diverse resources, data abstraction is frequently required. Here we summarize some of the major Big Data biomedical research resources for genomics, proteomics and phenotype data, collected from mammalian cells, tissues and organisms. We then suggest simple data abstraction methods for fusing this diverse but related data. Finally, we demonstrate examples of the potential utility of such data integration efforts, while warning about the inherit biases that exist within such data.

Graphical abstract

Introduction

Big Data does not have to be defined by sheer size, i.e., giga-bytes, tera-bytes, or peta-bytes of data, but by the fact that almost all the variables of a complex system can be measured over time and under different conditions (Mayer-Schönberger and Cukier, 2013). Computational biology tools and databases rapidly emerge with an attempt to organize and integrate molecular and phenotype data for the ultimate goal of making predictions by performing virtual experiments. Data integration enables imputing missing values given the already existing data, identifying unexpected relationships between variables, mostly through correlation analyses such as unsupervised clustering, learn-to-rank methods such as enrichment analyses, network reconstruction methods, and supervised machine learning algorithms which are used to make predictions for unseen instances. Integrating x-omics data, a.k.a. the integrome is not as difficult as it may seem because most diverse datasets and resources represent their data in a relatively structured format with common fields such as cells, genes, proteins, drugs, diseases, and assays. Such diverse but structured data can be converted into attribute tables, bi-partite graphs, single-node-type networks, hierarchies and set libraries. Such data structures provide different views of the same data and are useful for different data integration purposes. Combining two or more datasets, if they share common entities such as: genes/proteins, cells, small-molecules/drugs, tissues/tumors/patients, or diseases/phenotypes/side-effects, can lead to new insights. Here we summarize some of the most relevant resources for x-omics data integration for better extracting knowledge from Big Data. We then define the data structures that can be used to combine such resources, and briefly review the primary methods that can be used to operate on the combined data for knowledge discovery, while providing a few examples applied to real data. While we recognize that typically system level data and the methods to integrate and analyze such data were initially developed for model organisms such as yeast, worm, fly and zebra fish, the focus of this review is on data collected from the mammalian system, as well as databases and computation tools applied to the data from mammalian cells, tissues and organisms. Finally, we discuss the concept and implications of the different biases that may exist across the diverse datasets we describe. In this next section we enlist major relevant emergent Big Data resources in computational systems biology.

Section snippets

Mouse Genome Informatics Mammalian Phenotype Ontology (MGI-MPO)

The Mammalian Phenotype Ontology (Smith et al., 2004) initially developed by the Mouse Genome Informatics group at the Jackson Labs (Blake et al., 2014) and expanded to an international initiative called KOMP (Austin et al., 2004) is a useful resource for connecting gene knockouts in mice to phenotypes. The MGI-MPO ontology is a controlled vocabulary of mouse phenotype terms that are related to each other in a hierarchical network, where at each branch-point a term is linked to a set of more

Attribute tables and bi-partite graphs

An attribute table is the most common and raw form for organizing high-content experimental data. Computationally, attribute tables can be represented as a matrix that defines the relationships between entities of two different classes (Fig. 1A) (Balakrishnan and Ranganathan, 2012). The row labels correspond to entities of one class and the column labels correspond to entities of the other class. Typically, the rows are the variables of a system, and measuring their level captures some aspect

Unsupervised clustering

In the previous section we discussed how the information content of many open online resources can be converted into the simple data structures of attribute tables, bi-partite graphs, networks and set libraries. By organizing all this data into these formats, the task of data integration becomes straightforward. Entities in attribute tables, bi-partite graphs, networks and set libraries can be clustered based on entity similarity. Clustering is an unsupervised machine learning task for which

Conclusions

One of the most important aspects of integrating data for converting Big Data to knowledge is dealing with identifiers. IDs are used by different resources to represent entities with the same but sometimes also partially overlapping meaning. One example is gene IDs and protein IDs. ID mapping is a critical aspect of the data integration process which we ignored in our discussions. Another important aspect for data integration is data normalization. Because data is collected by different

Acknowledgements

Funding: This work was supported in part by grants from the NIH: U54HL127624, U54CA189201, R01GM098316 and T32HL007824.

References (140)

A. Basu
An interactive resource to identify cancer genetic and lineage dependencies targeted by small molecules
Cell
(2013)
M. Ciofani
A validated regulatory network for Th17 cell specification
Cell
(2012)
Y. Gilad et al.
Revealing the architecture of gene regulation: the promise of eQTL studies
Trends Genet.
(2008)
X. He
Sherlock: detecting gene-disease associations by matching patterns of expression QTL and GWAS
Am. J. Hum. Genet.
(2013)
A.K. Jain
Data clustering: 50 years beyond K-means
Pattern Recognit. Lett.
(2010)
R. Karnik et al.
Browsing (Epi) genomes: a guide to data resources and epigenome browsers for stem cell researchers
Cell Stem Cell
(2013)
H.S. Kim
Systematic identification of molecular subtype-selective vulnerabilities in non-small-cell lung cancer
Cell
(2013)
A. Malovannaya
Analysis of the human endogenous coregulator complexome
Cell
(2011)
J. Amberger
McKusick’s Online Mendelian Inheritance in Man (OMIM)
Nucleic Acids Res.
(2009)
J. Amberger et al.
A new face and new challenges for Online Mendelian Inheritance in Man (OMIM(R))
Hum. Mutat.
(2011)

N. Atias et al.

An algorithmic framework for predicting side effects of drugs

J. Comput. Biol.

(2011)

C.P. Austin

The knockout mouse project

Nat. Genet.

(2004)

G.D. Bader et al.

Pathguide: a pathway resource list

Nucleic Acids Res.

(2006)

R. Balakrishnan et al.

S. Bandyopadhyay et al.

Unsupervised Classification

(2013)

J. Barretina

The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity

Nature

(2012)

T. Barrett

NCBI GEO: archive for functional genomics data sets—update

Nucleic Acids Res.

(2013)

A. Bate et al.

Quantitative signal detection using spontaneous ADR reporting

Pharmacoepidemiol. Drug Saf.

(2009)

K.G. Becker

The genetic association database

Nat. Genet.

(2004)

S. Berger et al.

Genes2Networks: connecting lists of gene symbols using mammalian protein interactions databases

BMC Bioinform.

(2007)

B.E. Bernstein

The NIH Roadmap Epigenomics Mapping Consortium

Nat. Biotechnol.

(2010)

C.M. Bishop

(2006)

J.A. Blake

The Mouse Genome Database: integration of and access to knowledge about the laboratory mouse

Nucleic Acids Res.

(2014)

J.S. Boehm et al.

Towards systematic functional characterization of cancer genomes

Nat. Rev. Genet.

(2011)

L. Breiman

Random forests

Mach. Learn.

(2001)

L.O. Bryzgalov

Detection of regulatory SNPs in human genome using ChIP-seq ENCODE data

PLoS One

(2013)

M. Campillos