A spectral clustering method for microarray data

doi:10.1016/j.csda.2004.04.010

Computational Statistics & Data Analysis

Volume 49, Issue 1, 15 April 2005, Pages 63-76

https://doi.org/10.1016/j.csda.2004.04.010 Get rights and content

Abstract

This paper considers a clustering method motivated by a multivariate analysis of variance model and computationally based on eigenanalysis (thus the term “spectral” in the title). Our focus is on large problems, and we present the method in the context of clustering genes using microarray expression data. We provide an efficient computational algorithm and discuss its properties and interpretation in statistical and geometric terms. Leukemia and Melanoma data sets are analyzed to demonstrate the use of the method, and simulations are carried out to compare our method with two other clustering algorithms. We extend the method to enable supervision by either gene or array characteristics.

Introduction

Microarray technology is emerging as a powerful tool in molecular biology with far-reaching implications for clinical practice and research. While the traditional way of studying one gene at a time has been around for quite some time, the microarray approach, which allows investigation of gene expressions of thousands of genes simultaneously, is still in its infancy. Biologists are faced with the unprecedented and daunting task of interpreting experimental results that are inherently complex and very high-dimensional. Computational and statistical approaches are needed to reduce the dimension of the data and discern meaningful patterns. One commonly used dimension reduction technique is cluster analysis in which the goal is to divide or partition the data into few groups such that observations within clusters are as homogeneous as possible. Cluster analysis is exploratory in nature and can help generate focused hypotheses.

In this paper, we consider a clustering method motivated by a multivariate analysis of variance model and computationally based on eigenanalysis (thus the term “spectral” in the title). Our focus is on large problems, and we present the method in the context of clustering genes using microarray expression data. We provide a computational algorithm and discuss its properties and interpretation in statistical and geometric terms. Leukemia and Melanoma data sets are analyzed to demonstrate the use of the method, and simulations are carried out to compare our method with two other clustering algorithms. We extend the method to enable supervision by either gene or array characteristics.

Section snippets

Spectral clustering

We are studying N genes g₁,…,g_N whose expression relative to some baseline is measured at T times t₁,…,t_T which correspond to replicates, cell lines or experimental conditions. The data is represented as the N×T matrix $X^{∗} =(x_{ij}^{∗})$ , where the rows correspond to genes and columns to the time points. We will work with a transformed matrix X formed by subtracting the row mean from each row of $X^{∗}$ , so that $1 T XX′$ is the covariance matrix for the set of genes and the transformed expression of each gene

Predictive clusters

We may require our clusters to be associated with some outcome measure, in the sense that the mean expression of the clusters will be predictive of outcome. Such supervision by outcome can help to construct meaningful clusters, which may be useful in discrimination procedures. Suppose that an outcome y_j is associated with sample j, j=1,…,T. Let a bipartition of a set of n genes with expression matrix W predict tissue status, so that if sample j has an outcome greater than average the jth

Clustering supervised by a gene categorization

There may be a subset F of genes that we expect, on biological grounds, to vary together and thus belong to the same cluster. For example, suppose that we want genes in a specific functional category to tend to cluster together. We then use prior knowledge of gene function to supervise our clustering procedure. Define the vector f such that f_i=1 when the ith gene is in F, zero otherwise. Then we desire $v ′(f − f ̄ 1)$ to be large, i.e. that genes in F tend to associate with positive elements of v.

Examples

In this section we use two publicly available gene expression data sets to illustrate our methodology. The first data set is from Golub et al. (1999) and the second data is from the study by Bittner et al. (2000).

Using oligonucleotide arrays, Golub et al. (1999) studied the gene expression in two types of acute leukemia: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). To demonstrate the use of our method, we analyzed a pre-processed subset of their data consisting of 3571

A simulation study

In this section we report results of simulation experiments carried out to compare our method with two commonly used clustering algorithms, hierarchical and k-means clustering. The performance of the clustering algorithms is assessed using the adjusted R and statistic of Hubert and Arabie (1985). This statistic can range from 0 to 1, with 1 being perfect agreement.

Data for four clusters in eight dimensions were generated for 50 random realizations. We designed the simulation to conform to

Discussion

Other methods can be classed as spectral clustering methods for microarray data. Hastie et al. (2000) base their gene shaving method on the largest eigenvalue of a matrix closely related to the one we use. One class of spectral methods is based on the properties of eigenvectors of the Laplacian of an association matrix (Chung, 1997). Ding (2003) and Xing and Karp (2001) introduce Laplacian related formulations for clustering microarray data. Ding et al. (2001) give a clustering algorithm and a

Acknowledgements

This work was partially supported by grants from the Natural Sciences and Engineering Research Council of Canada and the Network of Centres of Excellence (MITACS). We appreciate the helpful comments of Dr. Xiang Sun and Dr. Rafal Kustra. Sebastian Hirjoghe led the development of a JAVA implementation of the algorithm. We thank the reviewers for helpful suggestions that greatly improved the paper.

References (18)

M. Bittner et al.
Molecular classification of cutaneous malignant melanoma by gene expression profiling
Nature
(2000)
Chung, F.R.K., 1997. Spectral Graph Theory, CBMS Lecture Notes, AMS publication,...
C.H. Ding
Unsupervised feature selection via two-way ordering in gene expression analysis
Bioinformatics
(2003)
Ding, C., He, X., Zha, H., Gu, M., Simon, H., 2001. A min-max cut algorithm for graph partitioning and data clustering....
Fallah, S., 2004. Spectral Clustering Methods. Ph.D. Thesis, University of Toronto, in...
W. Gander
Least squares with a quadratic constraint
Numer. Math.
(1981)
Gander, W., Golub, G.H., Urs von Matt, 1989. A constrained eigenvalue problem. Linear Algebra Appl. 114/115,...
T.R. Golub et al.
Molecular classification of cancerclass discovery and class prediction by gene expression monitoring
Science
(1999)
T. Hastie et al.
Identifying distinct sets of genes with similar expression patterns via “Gene Shaving”
Genome Biol.
(2000)

There are more references available in the full text version of this article.

Cited by (14)

Functional grouping of similar genes using eigenanalysis on minimum spanning tree based neighborhood graph
2016, Computers in Biology and Medicine
Citation Excerpt :
Small changes in these parameters may lead to different clustering structure and thus choosing the suitable values of these parameters demands domain expertise [15,17]. Tritchler et al. [26] presented a spectral clustering algorithm for microarray data using recursive bi-section of the constrained co-variance matrix along the principal direction. The added constraints make the partitions of the co-variance matrix with high values in the diagonal blocks and low off-diagonal elements.
Gene expression data clustering is an important biological process in DNA microarray analysis. Although there have been many clustering algorithms for gene expression analysis, finding a suitable and effective clustering algorithm is always a challenging problem due to the heterogeneous nature of gene profiles. Minimum Spanning Tree (MST) based clustering algorithms have been successfully employed to detect clusters of varying shapes and sizes. This paper proposes a novel clustering algorithm using Eigenanalysis on Minimum Spanning Tree based neighborhood graph (E-MST). As MST of a set of points reflects the similarity of the points with their neighborhood, the proposed algorithm employs a similarity graph obtained from $k^{'}$ rounds of MST ( $k^{'}$ -MST neighborhood graph). By studying the spectral properties of the similarity matrix obtained from $k^{'}$ -MST graph, the proposed algorithm achieves improved clustering results. We demonstrate the efficacy of the proposed algorithm on 12 gene expression datasets. Experimental results show that the proposed algorithm performs better than the standard clustering algorithms.
Model-based clustering of high-dimensional data: A review
2014, Computational Statistics and Data Analysis
Citation Excerpt :
Nowadays, the measured observations in many scientific domains are frequently high-dimensional and clustering such data is a challenging problem (Tran et al., 2006; von Borries and Wang, 2009; Tritchler et al., 2005), particularly for model-based methods.
Model-based clustering is a popular tool which is renowned for its probabilistic foundations and its flexibility. However, high-dimensional data are nowadays more and more frequent and, unfortunately, classical model-based clustering techniques show a disappointing behavior in high-dimensional spaces. This is mainly due to the fact that model-based clustering methods are dramatically over-parametrized in this case. However, high-dimensional spaces have specific characteristics which are useful for clustering and recent techniques exploit those characteristics. After having recalled the bases of model-based clustering, dimension reduction approaches, regularization-based techniques, parsimonious modeling, subspace clustering methods and clustering methods based on variable selection are reviewed. Existing softwares for model-based clustering of high-dimensional data will be also reviewed and their practical use will be illustrated on real-world data sets.
Estimation of directed subnetworks in ultra high dimensional data for gene network problems
2017, Statistics and its Interface
A fuzzy logic-based clustering algorithm for network optimisation
2016, International Journal of Systems, Control and Communications
Geometry of data and biology
2015, Notices of the American Mathematical Society
Spectral clustering on gene expression profile to identify cancer types or subtypes
2015, Jurnal Teknologi

View all citing articles on Scopus

View full text

A spectral clustering method for microarray data

Abstract

Introduction

Section snippets

Spectral clustering

Predictive clusters

Clustering supervised by a gene categorization

Examples

A simulation study

Discussion

Acknowledgements

Molecular classification of cutaneous malignant melanoma by gene expression profiling

Nature

Unsupervised feature selection via two-way ordering in gene expression analysis

Bioinformatics

Least squares with a quadratic constraint

Numer. Math.

Molecular classification of cancerclass discovery and class prediction by gene expression monitoring

Science

Identifying distinct sets of genes with similar expression patterns via “Gene Shaving”

Genome Biol.