Software note
GeneMCL in microarray analysis

https://doi.org/10.1016/j.compbiolchem.2005.07.002Get rights and content

Abstract

Accurately and reliably identifying the actual number of clusters present with a dataset of gene expression profiles, when no additional information on cluster structure is available, is a problem addressed by few algorithms. GeneMCL transforms microarray analysis data into a graph consisting of nodes connected by edges, where the nodes represent genes, and the edges represent the similarity in expression of those genes, as given by a proximity measurement. This measurement is taken to be the Pearson correlation coefficient combined with a local non-linear rescaling step. The resulting graph is input to the Markov Cluster (MCL) algorithm, which is an elegant, deterministic, non-specific and scalable method, which models stochastic flow through the graph. The algorithm is inherently affected by any cluster structure present, and rapidly decomposes a graph into cohesive clusters. The potential of the GeneMCL algorithm is demonstrated with a 5730 gene subset (IGS) of the Van’t Veer breast cancer database, for which the clusterings are shown to reflect underlying biological mechanisms.

Introduction

Cluster analysis is a technique whereby large volumes of data can be condensed into smaller numbers of meaningful groups, providing a concise description of patterns of similarities and differences present in the dataset. It has become increasingly important in the biological sciences as a means of handling ever growing gene expression datasets. An automated clustering procedure that yields reliable clusterings provides a high-level view of datasets that may in turn yield insights into the functioning of individual elements and the relationships within these groups of elements. Central to all cluster analysis techniques is the quantification of proximity, where proximity is measured between pairs of elements in the data. Gene expression profiles can be described as real vectors whose elements are the different measurements of the expression of a single gene, over a range of samples or conditions. In the case of gene expression studies, genes that share a similar expressional propensity over samples/conditions are also likely to share a commonality of functional or regulatory mechanism.

Large scale gene expression studies, or microarrays, have generated a wide range of significantly large numerical datasets which have been used extensively to test and develop an equally diverse range of clustering methodologies (e.g. Lattimore and Crabbe, 2003). Approaches for their analysis fall into two broad categories of cluster analysis, namely supervised and un-supervised. Supervised clustering algorithms have information provided about the expected cluster structure, usually by defining the number of clusters, or by specifying reference vectors to be used as input to the classification method. In un-supervised clustering approaches there is no information provided.

Examples of supervised algorithms are k-means clustering and Self Organising Maps, which require both the number of clusters to be predefined. Most often, it is impossible to give a reasonable estimate for the expected number of clusters, and frequently it is the actual purpose of the cluster analysis. Hierarchical clustering approaches construct a tree over the dataset, in which at the most outermost level, all elements are leaves attached to separate branches. Branches are then successively merged until all branches are eventually merged into a single large trunk. Most often each merge event is characterized by the next highest proximity score available between the current set of branches. This method is bidirectional, either all n elements grouped in to one large cluster (agglomerative), or in the opposite direction, resulting in n clusters, each containing one of the original element (divisive). Descrete clusterings are obtained by cutting the tree at a certain proximity levels; each branch that is cut is identified as a clustering by taking all leaves attached to the branch.

A well known characteristic of linkage based methods is the phenomenon of chaining, caused by the fact that pairwise similarities have significance global to the dataset. Chaining often has the negative impact that clusters grow larger than one would ideally expect them to be, and different parts of the dataset will often succumb to chaining at different merge levels.

The clustering problem is elusive in nature. Different applications often bring about varying paradigms, and different domains or dataset types bring forth data-objects with differing structural characteristics. This implies that the clustering strategy for a given application must be specifically tailored to the purpose and intent of the expected cluster structure. One must have a clear idea of the question the clustering algorithm is required to answer, and use the algorithm with the most appropriate property, whilst seeking to exploit an underlying biological mechanism. To this aim, we have adapted the Markov Cluster algorithm (MCL) for the clustering of large scale gene expression data, namely GeneMCL, and compared it to the existing Adaptive quality-based cluster algorithm. In GeneMCL two transformations are applied to microarray analysis data, resulting in a graph that is fed to the MCL algorithm. The transformations are: (a) application of the well known Pearson correlation coefficient and (b) a local non-linear rescaling called pre-inflation.

The MCL algorithm has been previously employed in the field of bioinformatics for example in TribeMCL (Enright et al., 2002). This application of MCL relies first on the representation of the data in a weighted graph, before using the MCL algorithm to detect cluster structure.

Section snippets

Dataset

Van’t Veer et al. (2002) utilised cDNA microarray technology to correlate gene expression profiles with the clinical outcome of breast cancer. A total of 5 μg of total RNA was isolated from each sample and used in two hybridisations for each tumour using a fluorescent dye reversal technique on microarrays containing approximately 25,000 human genes, synthesised by ink-jet technology.

The resultant dataset characterized the expression of a total of 117 patients, including 98 primary breast

Gene-wise clustering

The potential of the geneMCL algorithm is demonstrated with the dataset using the Pearson's correlation coefficient in constructing the transitional probabilities to weight the connections between nodes (genes) in the graph with.

This pre-inflation value was determined to be the key to successful clustering of gene expression data. This is because of the nature of gene expression studies, whereby there are large numbers of low-level correlations that add noise to the resulting graphs, and go

Discussion

There was a clear pattern within the clusters, such that redundant genes clustered together, which is reassuring given that their expression was being measured multiple times on the array, and that given the noise tolerance of the algorithm, they should always cluster together (unless there was a variation in the quality of the respective microarray spots). Despite the relatively poor functional annotation of human genes, there was an obvious tendency for genes involved in a similar cellular

References (3)

There are more references available in the full text version of this article.
1

They contributed equally to the manuscript.

View full text