Traveling on discrete embeddings of gene expression

https://doi.org/10.1016/j.artmed.2016.05.002Get rights and content

Highlights

  • We comprehensively assess the counting grid (CG) model for gene expression analysis.

  • Similar expression patterns are visualized and arranged close on a 2D discrete grid.

  • The model permits to identify the genes that distinguish between classes.

  • Effective signatures can be extracted for clustering and classification of samples.

Abstract

Objective

High-throughput technologies have generated an unprecedented amount of high-dimensional gene expression data. Algorithmic approaches could be extremely useful to distill information and derive compact interpretable representations of the statistical patterns present in the data. This paper proposes a mining approach to extract an informative representation of gene expression profiles based on a generative model called the Counting Grid (CG).

Method

Using the CG model, gene expression values are arranged on a discrete grid, learned in a way that “similar” co-expression patterns are arranged in close proximity, thus resulting in an intuitive visualization of the dataset. More than this, the model permits to identify the genes that distinguish between classes (e.g. different types of cancer). Finally, each sample can be characterized with a discriminative signature – extracted from the model – that can be effectively employed for classification.

Results

A thorough evaluation on several gene expression datasets demonstrate the suitability of the proposed approach from a twofold perspective: numerically, we reached state-of-the-art classification accuracies on 5 datasets out of 7, and similar results when the approach is tested in a gene selection setting (with a stability always above 0.87); clinically, by confirming that many of the genes highlighted by the model as significant play also a key role for cancer biology.

Conclusion

The proposed framework can be successfully exploited to meaningfully visualize the samples; detect medically relevant genes; properly classify samples.

Introduction

Technologies such as gene expression microarrays and RNA-seq provide scientists with a way to measure the expression levels of thousands of genes simultaneously. Computational approaches are increasingly needed to manage this amount of data, and are effectively helping researchers to unravel the complexity of biological systems. Examples of computational problems related to the analysis of a gene expression matrix (a matrix containing the expression level of different genes under different experimental conditions) are classification of samples [1], [2], [3], [4], clustering of genes or pathological subtypes [5], [6], and selection of differentially expressed or discriminative genes [7].

Other than sophisticated methods of quantitative analysis, high-throughput experiments brought also the need for visualization, thoughtful validation, and, more generally, a deeper understanding of the phenomenon under investigation. For these reasons, interpretable models are required. In this context, generative models (in particular, topic models and latent process models [8]) have been shown to provide highly interpretable solutions, more than achieving high accuracy for classification tasks [9], [10]. Within this literature, topic models have been either designed ad hoc for gene expression analysis [11], [12], or exported from Natural Language Processing by postulating an analogy between textual documents and microarray samples [10], [13]. In the latter case, the starting point is to see a gene expression profile (i.e. a sample) as a “bag of words” vector [14] – a numerical vector in which every entry counts how many times each “word” of a pre-defined dictionary occurs in the considered document. Similarly to text documents, a gene expression profile can be seen as a bag of words vector – genes now represent the words – since each entry measures the intensity of expression of each gene (which indirectly reflects the amount of mRNA transcripts). This analogy also permits to exploit topic models in this context [10], [13], which, by introducing the concept of “topic”, allow to model co-occurrence (or co-expression) patterns within the data. Topics are latent distributions that assign high probability to co-occurring “words”, and act as intermediate descriptors of samples (in the gene expression case, they can be associated to biological processes, as shown in [10], [13]).

However, a common assumption of most topic models is that the topics act independently of each other. While this assumption is often needed to simplify computations and inference, it may be too simplistic in the gene expression scenario, where it is known that biological processes are tightly co-regulated and interdependent in a complex way. In this paper we make a step forward along this research line – pursuing the topic model philosophy, but coping with the afore-described limitation – presenting a novel strategy to extract an informative representation for a set of experimental samples through a recent generative model called Counting Grid (CG – [15]). The Counting Grid represents a probabilistic model for objects represented as “bag of words”, that was recently introduced for text mining [15] and image processing [16]. The idea behind the model is that the topics are arranged on a discrete grid, learned in a way that “similar” topics are closely arranged. Similar biological samples, i.e. sharing topics and active genes, are mapped close on the grid, allowing for an intuitive visualization of the data set. More specifically, the CG seems to be very suitable in the gene expression scenario for the following reasons:

  • The CG provides a powerful representation, which permits to capture evolution of patterns in experiments, and can be clearly visualized.

  • The CG is well suited for data that exhibit smooth variation between samples. Expression values are biologically constrained to lie within certain bounds by purifying selection [17] and variation in only a few expression values can cause a pathology. This specific property of the data is captured well by the model.

  • The CG permits a principled and founded way to extract the most relevant genes that are associated with a disease [18].

  • Last, but not least, it is possible to achieve a better classification accuracy with respect to other topic model approaches, as well as to the recent state of the art.

In this paper, we comprehensively evaluate the CG model for mining and modeling gene expression data; we start from the preliminary findings which appeared in the literature [18], [19], but we thoroughly evaluate the capabilities of the model with respect to the following novel aspects:

  • 1.

    By visualizing different data sets, we show that samples belonging to different biological conditions (such as different types of cancer) cluster together on the grid, supporting this claim with a numerical validation (Section 4.1).

  • 2.

    We systematically tested the accuracy of the CG model both in a gene selection and in a classification setting, experimenting on 7 different benchmark datasets, obtaining results comparable with the recent state-of-the-art.

  • 3.

    We prove that the model is able to highlight genes that are involved in the pathology or in the phenomenon which motivated the experiment; moreover, the selected genes have a beneficial effect when used for classification, quantitatively comparable with other gene selection techniques.

  • 4.

    We evaluate the sensitivity of the model to parameters such as grid and window size and the robustness of the model to overfitting.

Section snippets

The Counting Grid model

In machine learning research, a data point is often represented as a “bag of words”: the representation is obtained by counting how many times each “word” (i.e. constituting feature) occurs in the object. This paradigm can represent in a vector space many types of objects, even ones that are non-vectorial in nature. However, one drawback is that in some domains and applications it destroys the possible structure of objects. A clear example can be found in the Natural Language Processing domain

An illustrative example: mining yeast expression

To illustrate the main features of the proposed framework we present a simple example, where we studied a dataset by DeRisi et al. [21], measuring the gene expression of 6400 genes in Saccharomyces cerevisiae during the metabolic shift from fermentation to respiration. Expression values have been measured at 7 different time points. From our point of view, each time point is a bag st={gzt},z=1,,6400. As done in other applications, we performed a filtering of the genes,2

Experimental evaluation

The merits of the proposed framework has been extensively tested to solve a wide range of tasks, from both a quantitative and a qualitative perspective. In the following, we first show that the model is able to properly embed the samples on separated parts of the grid, where different areas reflect different sample class/conditions – this shows that the framework well captures the differences in gene expressions related to different classes; then, we extract the most relevant genes with the

Conclusions

This paper proposed an approach based on the Counting Grid model for the analysis of gene expression samples. We have shown with different experiments that the proposed framework can be successfully exploited to (i) meaningfully visualize the samples; (ii) detect medically relevant genes, and (iii) properly classify samples, thus representing a valid alternative to classical gene expression analysis strategies. The proposed approach also finds a very promising application in analyzing

References (69)

  • J. Kim et al.

    Association between phosphorylated amp-activated protein kinase and mapk3/1 expression and prognosis for patients with gastric cancer

    Oncology

    (2013)
  • Y.-J. Zhang et al.

    Silencing of hint1, a novel tumor suppressor gene, by promoter hypermethylation in hepatocellular carcinoma

    Cancer Lett

    (2009)
  • D. Singh et al.

    Gene expression correlates of clinical prostate cancer behavior

    Cancer Cell

    (2002)
  • H. Liu et al.

    Ensemble gene selection by grouping for microarray data classification

    J Biomed Inform

    (2010)
  • V. Bolón-Canedo et al.

    An ensemble of filters and classifiers for microarray data classification

    Pattern Recogn

    (2012)
  • A. Statnikov et al.

    A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis

    Bioinformatics

    (2005)
  • M.-Y. Wu et al.

    Biomarker identification and cancer classification based on microarray data using Laplace naive Bayes model with mean shrinkage

    IEEE/ACM Trans Comput Biol Bioinform

    (2012)
  • M. de Souto et al.

    Clustering cancer gene expression data: a comparative study

    BMC Bioinformatics

    (2008)
  • C. Lazar et al.

    A survey on filter techniques for feature selection in gene expression microarray analysis

    IEEE/ACM Trans Comput Biol Bioinform

    (2012)
  • T. Hofmann

    Unsupervised learning by probabilistic latent semantic analysis

    Mach Learning

    (2001)
  • M. Fasoli et al.

    The grapevine expression atlas reveals a deep transcriptome shift driving the entire plant into a maturation program

    Plant Cell Online

    (2012)
  • M. Bicego et al.

    Investigating topic models’ capabilities in expression microarray data classification

    IEEE/ACM Trans Comput Biol Bioinform

    (2012)
  • S. Rogers et al.

    The latent process decomposition of cDNA microarray data sets

    IEEE/ACM Trans Comput Biol Bioinform

    (2005)
  • A. Perina et al.

    Biologically-aware latent Dirichlet allocation (BALDA) for the classification of expression microarray

  • M. Bicego et al.

    Biclustering of expression microarray data with topic models

  • T. Joachims

    Text categorization with support vector machines: learning with many relevant features

  • N. Jojic et al.

    Multidimensional counting grids: Inferring word order from disordered bags of words

    Uncertainty in artificial intelligence

    (2011)
  • A. Perina et al.

    Image analysis by counting on a grid

  • P. Lovato et al.

    Feature selection using counting grids: application to microarray data

  • A. Perina et al.

    Expression microarray data classification using counting grids and fisher kernel

  • B.J. Frey et al.

    A comparison of algorithms for inference and learning in probabilistic graphical models

    IEEE Trans Pattern Anal Mach Intell

    (2005)
  • J.L. DeRisi et al.

    Exploring the metabolic and genetic control of gene expression on a genomic scale

    Science

    (1997)
  • T. Rossignol et al.

    Genome-wide monitoring of wine yeast gene expression during alcoholic fermentation

    Yeast

    (2003)
  • M. Ashburner et al.

    Gene ontology: tool for the unification of biology

    Nat Genet

    (2000)
  • Cited by (0)

    View full text