Traveling on discrete embeddings of gene expression
Introduction
Technologies such as gene expression microarrays and RNA-seq provide scientists with a way to measure the expression levels of thousands of genes simultaneously. Computational approaches are increasingly needed to manage this amount of data, and are effectively helping researchers to unravel the complexity of biological systems. Examples of computational problems related to the analysis of a gene expression matrix (a matrix containing the expression level of different genes under different experimental conditions) are classification of samples [1], [2], [3], [4], clustering of genes or pathological subtypes [5], [6], and selection of differentially expressed or discriminative genes [7].
Other than sophisticated methods of quantitative analysis, high-throughput experiments brought also the need for visualization, thoughtful validation, and, more generally, a deeper understanding of the phenomenon under investigation. For these reasons, interpretable models are required. In this context, generative models (in particular, topic models and latent process models [8]) have been shown to provide highly interpretable solutions, more than achieving high accuracy for classification tasks [9], [10]. Within this literature, topic models have been either designed ad hoc for gene expression analysis [11], [12], or exported from Natural Language Processing by postulating an analogy between textual documents and microarray samples [10], [13]. In the latter case, the starting point is to see a gene expression profile (i.e. a sample) as a “bag of words” vector [14] – a numerical vector in which every entry counts how many times each “word” of a pre-defined dictionary occurs in the considered document. Similarly to text documents, a gene expression profile can be seen as a bag of words vector – genes now represent the words – since each entry measures the intensity of expression of each gene (which indirectly reflects the amount of mRNA transcripts). This analogy also permits to exploit topic models in this context [10], [13], which, by introducing the concept of “topic”, allow to model co-occurrence (or co-expression) patterns within the data. Topics are latent distributions that assign high probability to co-occurring “words”, and act as intermediate descriptors of samples (in the gene expression case, they can be associated to biological processes, as shown in [10], [13]).
However, a common assumption of most topic models is that the topics act independently of each other. While this assumption is often needed to simplify computations and inference, it may be too simplistic in the gene expression scenario, where it is known that biological processes are tightly co-regulated and interdependent in a complex way. In this paper we make a step forward along this research line – pursuing the topic model philosophy, but coping with the afore-described limitation – presenting a novel strategy to extract an informative representation for a set of experimental samples through a recent generative model called Counting Grid (CG – [15]). The Counting Grid represents a probabilistic model for objects represented as “bag of words”, that was recently introduced for text mining [15] and image processing [16]. The idea behind the model is that the topics are arranged on a discrete grid, learned in a way that “similar” topics are closely arranged. Similar biological samples, i.e. sharing topics and active genes, are mapped close on the grid, allowing for an intuitive visualization of the data set. More specifically, the CG seems to be very suitable in the gene expression scenario for the following reasons:
- •
The CG provides a powerful representation, which permits to capture evolution of patterns in experiments, and can be clearly visualized.
- •
The CG is well suited for data that exhibit smooth variation between samples. Expression values are biologically constrained to lie within certain bounds by purifying selection [17] and variation in only a few expression values can cause a pathology. This specific property of the data is captured well by the model.
- •
The CG permits a principled and founded way to extract the most relevant genes that are associated with a disease [18].
- •
Last, but not least, it is possible to achieve a better classification accuracy with respect to other topic model approaches, as well as to the recent state of the art.
In this paper, we comprehensively evaluate the CG model for mining and modeling gene expression data; we start from the preliminary findings which appeared in the literature [18], [19], but we thoroughly evaluate the capabilities of the model with respect to the following novel aspects:
- 1.
By visualizing different data sets, we show that samples belonging to different biological conditions (such as different types of cancer) cluster together on the grid, supporting this claim with a numerical validation (Section 4.1).
- 2.
We systematically tested the accuracy of the CG model both in a gene selection and in a classification setting, experimenting on 7 different benchmark datasets, obtaining results comparable with the recent state-of-the-art.
- 3.
We prove that the model is able to highlight genes that are involved in the pathology or in the phenomenon which motivated the experiment; moreover, the selected genes have a beneficial effect when used for classification, quantitatively comparable with other gene selection techniques.
- 4.
We evaluate the sensitivity of the model to parameters such as grid and window size and the robustness of the model to overfitting.
Section snippets
The Counting Grid model
In machine learning research, a data point is often represented as a “bag of words”: the representation is obtained by counting how many times each “word” (i.e. constituting feature) occurs in the object. This paradigm can represent in a vector space many types of objects, even ones that are non-vectorial in nature. However, one drawback is that in some domains and applications it destroys the possible structure of objects. A clear example can be found in the Natural Language Processing domain
An illustrative example: mining yeast expression
To illustrate the main features of the proposed framework we present a simple example, where we studied a dataset by DeRisi et al. [21], measuring the gene expression of 6400 genes in Saccharomyces cerevisiae during the metabolic shift from fermentation to respiration. Expression values have been measured at 7 different time points. From our point of view, each time point is a bag . As done in other applications, we performed a filtering of the genes,2
Experimental evaluation
The merits of the proposed framework has been extensively tested to solve a wide range of tasks, from both a quantitative and a qualitative perspective. In the following, we first show that the model is able to properly embed the samples on separated parts of the grid, where different areas reflect different sample class/conditions – this shows that the framework well captures the differences in gene expressions related to different classes; then, we extract the most relevant genes with the
Conclusions
This paper proposed an approach based on the Counting Grid model for the analysis of gene expression samples. We have shown with different experiments that the proposed framework can be successfully exploited to (i) meaningfully visualize the samples; (ii) detect medically relevant genes, and (iii) properly classify samples, thus representing a valid alternative to classical gene expression analysis strategies. The proposed approach also finds a very promising application in analyzing
References (69)
- et al.
An extensive comparison of recent classification tools applied to microarray data
Comput Stat Data Anal
(2005) - et al.
An ensemble of SVM classifiers based on gene pairs
Comput Biol Med
(2013) - et al.
Techniques for clustering gene expression data
Comput Biol Med
(2008) - et al.
Evolutionary significance of gene expression divergence
Gene
(2005) - et al.
The forkhead transcription factor Hcm1 promotes mitochondrial biogenesis and stress resistance in yeast
J Biol Chem
(2010) - et al.
Functions and mechanisms of action of CCN matricellular proteins
Int J Biochem Cell Biol
(2009) - et al.
Activating transcription factor 3, a stress-inducible gene, suppresses ras-stimulated tumorigenesis
J Biol Chem
(2006) - et al.
Stress effects on FosB- and interleukin-8 (IL8)-driven ovarian cancer growth and metastasis
J Biol Chem
(2010) - et al.
Matrilysin (matrix metalloproteinase-7): a new promising drug target in cancer and inflammation?
Cytokine Growth Factor Rev
(2004) - et al.
Glyceraldehyde-3-phosphate dehydrogenase gene expression in human breast cancer
Eur J Cancer
(2000)