Software noteGeneMCL in microarray analysis
Introduction
Cluster analysis is a technique whereby large volumes of data can be condensed into smaller numbers of meaningful groups, providing a concise description of patterns of similarities and differences present in the dataset. It has become increasingly important in the biological sciences as a means of handling ever growing gene expression datasets. An automated clustering procedure that yields reliable clusterings provides a high-level view of datasets that may in turn yield insights into the functioning of individual elements and the relationships within these groups of elements. Central to all cluster analysis techniques is the quantification of proximity, where proximity is measured between pairs of elements in the data. Gene expression profiles can be described as real vectors whose elements are the different measurements of the expression of a single gene, over a range of samples or conditions. In the case of gene expression studies, genes that share a similar expressional propensity over samples/conditions are also likely to share a commonality of functional or regulatory mechanism.
Large scale gene expression studies, or microarrays, have generated a wide range of significantly large numerical datasets which have been used extensively to test and develop an equally diverse range of clustering methodologies (e.g. Lattimore and Crabbe, 2003). Approaches for their analysis fall into two broad categories of cluster analysis, namely supervised and un-supervised. Supervised clustering algorithms have information provided about the expected cluster structure, usually by defining the number of clusters, or by specifying reference vectors to be used as input to the classification method. In un-supervised clustering approaches there is no information provided.
Examples of supervised algorithms are k-means clustering and Self Organising Maps, which require both the number of clusters to be predefined. Most often, it is impossible to give a reasonable estimate for the expected number of clusters, and frequently it is the actual purpose of the cluster analysis. Hierarchical clustering approaches construct a tree over the dataset, in which at the most outermost level, all elements are leaves attached to separate branches. Branches are then successively merged until all branches are eventually merged into a single large trunk. Most often each merge event is characterized by the next highest proximity score available between the current set of branches. This method is bidirectional, either all n elements grouped in to one large cluster (agglomerative), or in the opposite direction, resulting in n clusters, each containing one of the original element (divisive). Descrete clusterings are obtained by cutting the tree at a certain proximity levels; each branch that is cut is identified as a clustering by taking all leaves attached to the branch.
A well known characteristic of linkage based methods is the phenomenon of chaining, caused by the fact that pairwise similarities have significance global to the dataset. Chaining often has the negative impact that clusters grow larger than one would ideally expect them to be, and different parts of the dataset will often succumb to chaining at different merge levels.
The clustering problem is elusive in nature. Different applications often bring about varying paradigms, and different domains or dataset types bring forth data-objects with differing structural characteristics. This implies that the clustering strategy for a given application must be specifically tailored to the purpose and intent of the expected cluster structure. One must have a clear idea of the question the clustering algorithm is required to answer, and use the algorithm with the most appropriate property, whilst seeking to exploit an underlying biological mechanism. To this aim, we have adapted the Markov Cluster algorithm (MCL) for the clustering of large scale gene expression data, namely GeneMCL, and compared it to the existing Adaptive quality-based cluster algorithm. In GeneMCL two transformations are applied to microarray analysis data, resulting in a graph that is fed to the MCL algorithm. The transformations are: (a) application of the well known Pearson correlation coefficient and (b) a local non-linear rescaling called pre-inflation.
The MCL algorithm has been previously employed in the field of bioinformatics for example in TribeMCL (Enright et al., 2002). This application of MCL relies first on the representation of the data in a weighted graph, before using the MCL algorithm to detect cluster structure.
Section snippets
Dataset
Van’t Veer et al. (2002) utilised cDNA microarray technology to correlate gene expression profiles with the clinical outcome of breast cancer. A total of 5 μg of total RNA was isolated from each sample and used in two hybridisations for each tumour using a fluorescent dye reversal technique on microarrays containing approximately 25,000 human genes, synthesised by ink-jet technology.
The resultant dataset characterized the expression of a total of 117 patients, including 98 primary breast
Gene-wise clustering
The potential of the geneMCL algorithm is demonstrated with the dataset using the Pearson's correlation coefficient in constructing the transitional probabilities to weight the connections between nodes (genes) in the graph with.
This pre-inflation value was determined to be the key to successful clustering of gene expression data. This is because of the nature of gene expression studies, whereby there are large numbers of low-level correlations that add noise to the resulting graphs, and go
Discussion
There was a clear pattern within the clusters, such that redundant genes clustered together, which is reassuring given that their expression was being measured multiple times on the array, and that given the noise tolerance of the algorithm, they should always cluster together (unless there was a variation in the quality of the respective microarray spots). Despite the relatively poor functional annotation of human genes, there was an obvious tendency for genes involved in a similar cellular
References (3)
- et al.
Expression profiles in the progression of ductal carcinoma in the breast
Comput. Biol. Chem.
(2003)
Cited by (10)
Estimation of directed subnetworks in ultra high dimensional data for gene network problems
2017, Statistics and its InterfaceFunctional module identification by block modeling using Simulated Annealing with Path Relinking
2012, 2012 ACM Conference on Bioinformatics, Computational Biology and Biomedicine, BCB 2012Using MCL to extract clusters from networks
2012, Methods in Molecular BiologyComputational Biology Approaches to Plant Metabolism and Photosynthesis: Applications for Corals in Times of Climate Change and Environmental Stress
2010, Journal of Integrative Plant Biology
- 1
They contributed equally to the manuscript.