A spectral clustering method for microarray data

https://doi.org/10.1016/j.csda.2004.04.010Get rights and content

Abstract

This paper considers a clustering method motivated by a multivariate analysis of variance model and computationally based on eigenanalysis (thus the term “spectral” in the title). Our focus is on large problems, and we present the method in the context of clustering genes using microarray expression data. We provide an efficient computational algorithm and discuss its properties and interpretation in statistical and geometric terms. Leukemia and Melanoma data sets are analyzed to demonstrate the use of the method, and simulations are carried out to compare our method with two other clustering algorithms. We extend the method to enable supervision by either gene or array characteristics.

Introduction

Microarray technology is emerging as a powerful tool in molecular biology with far-reaching implications for clinical practice and research. While the traditional way of studying one gene at a time has been around for quite some time, the microarray approach, which allows investigation of gene expressions of thousands of genes simultaneously, is still in its infancy. Biologists are faced with the unprecedented and daunting task of interpreting experimental results that are inherently complex and very high-dimensional. Computational and statistical approaches are needed to reduce the dimension of the data and discern meaningful patterns. One commonly used dimension reduction technique is cluster analysis in which the goal is to divide or partition the data into few groups such that observations within clusters are as homogeneous as possible. Cluster analysis is exploratory in nature and can help generate focused hypotheses.

In this paper, we consider a clustering method motivated by a multivariate analysis of variance model and computationally based on eigenanalysis (thus the term “spectral” in the title). Our focus is on large problems, and we present the method in the context of clustering genes using microarray expression data. We provide a computational algorithm and discuss its properties and interpretation in statistical and geometric terms. Leukemia and Melanoma data sets are analyzed to demonstrate the use of the method, and simulations are carried out to compare our method with two other clustering algorithms. We extend the method to enable supervision by either gene or array characteristics.

Section snippets

Spectral clustering

We are studying N genes g1,…,gN whose expression relative to some baseline is measured at T times t1,…,tT which correspond to replicates, cell lines or experimental conditions. The data is represented as the N×T matrix X=(xij), where the rows correspond to genes and columns to the time points. We will work with a transformed matrix X formed by subtracting the row mean from each row of X, so that 1TXX′ is the covariance matrix for the set of genes and the transformed expression of each gene

Predictive clusters

We may require our clusters to be associated with some outcome measure, in the sense that the mean expression of the clusters will be predictive of outcome. Such supervision by outcome can help to construct meaningful clusters, which may be useful in discrimination procedures. Suppose that an outcome yj is associated with sample j, j=1,…,T. Let a bipartition of a set of n genes with expression matrix W predict tissue status, so that if sample j has an outcome greater than average the jth

Clustering supervised by a gene categorization

There may be a subset F of genes that we expect, on biological grounds, to vary together and thus belong to the same cluster. For example, suppose that we want genes in a specific functional category to tend to cluster together. We then use prior knowledge of gene function to supervise our clustering procedure. Define the vector f such that fi=1 when the ith gene is in F, zero otherwise. Then we desire v′(ff̄1) to be large, i.e. that genes in F tend to associate with positive elements of v.

Examples

In this section we use two publicly available gene expression data sets to illustrate our methodology. The first data set is from Golub et al. (1999) and the second data is from the study by Bittner et al. (2000).

Using oligonucleotide arrays, Golub et al. (1999) studied the gene expression in two types of acute leukemia: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). To demonstrate the use of our method, we analyzed a pre-processed subset of their data consisting of 3571

A simulation study

In this section we report results of simulation experiments carried out to compare our method with two commonly used clustering algorithms, hierarchical and k-means clustering. The performance of the clustering algorithms is assessed using the adjusted R and statistic of Hubert and Arabie (1985). This statistic can range from 0 to 1, with 1 being perfect agreement.

Data for four clusters in eight dimensions were generated for 50 random realizations. We designed the simulation to conform to

Discussion

Other methods can be classed as spectral clustering methods for microarray data. Hastie et al. (2000) base their gene shaving method on the largest eigenvalue of a matrix closely related to the one we use. One class of spectral methods is based on the properties of eigenvectors of the Laplacian of an association matrix (Chung, 1997). Ding (2003) and Xing and Karp (2001) introduce Laplacian related formulations for clustering microarray data. Ding et al. (2001) give a clustering algorithm and a

Acknowledgements

This work was partially supported by grants from the Natural Sciences and Engineering Research Council of Canada and the Network of Centres of Excellence (MITACS). We appreciate the helpful comments of Dr. Xiang Sun and Dr. Rafal Kustra. Sebastian Hirjoghe led the development of a JAVA implementation of the algorithm. We thank the reviewers for helpful suggestions that greatly improved the paper.

References (18)

  • M. Bittner et al.

    Molecular classification of cutaneous malignant melanoma by gene expression profiling

    Nature

    (2000)
  • Chung, F.R.K., 1997. Spectral Graph Theory, CBMS Lecture Notes, AMS publication,...
  • C.H. Ding

    Unsupervised feature selection via two-way ordering in gene expression analysis

    Bioinformatics

    (2003)
  • Ding, C., He, X., Zha, H., Gu, M., Simon, H., 2001. A min-max cut algorithm for graph partitioning and data clustering....
  • Fallah, S., 2004. Spectral Clustering Methods. Ph.D. Thesis, University of Toronto, in...
  • W. Gander

    Least squares with a quadratic constraint

    Numer. Math.

    (1981)
  • Gander, W., Golub, G.H., Urs von Matt, 1989. A constrained eigenvalue problem. Linear Algebra Appl. 114/115,...
  • T.R. Golub et al.

    Molecular classification of cancerclass discovery and class prediction by gene expression monitoring

    Science

    (1999)
  • T. Hastie et al.

    Identifying distinct sets of genes with similar expression patterns via “Gene Shaving”

    Genome Biol.

    (2000)
There are more references available in the full text version of this article.

Cited by (14)

  • Functional grouping of similar genes using eigenanalysis on minimum spanning tree based neighborhood graph

    2016, Computers in Biology and Medicine
    Citation Excerpt :

    Small changes in these parameters may lead to different clustering structure and thus choosing the suitable values of these parameters demands domain expertise [15,17]. Tritchler et al. [26] presented a spectral clustering algorithm for microarray data using recursive bi-section of the constrained co-variance matrix along the principal direction. The added constraints make the partitions of the co-variance matrix with high values in the diagonal blocks and low off-diagonal elements.

  • Model-based clustering of high-dimensional data: A review

    2014, Computational Statistics and Data Analysis
    Citation Excerpt :

    Nowadays, the measured observations in many scientific domains are frequently high-dimensional and clustering such data is a challenging problem (Tran et al., 2006; von Borries and Wang, 2009; Tritchler et al., 2005), particularly for model-based methods.

  • A fuzzy logic-based clustering algorithm for network optimisation

    2016, International Journal of Systems, Control and Communications
  • Geometry of data and biology

    2015, Notices of the American Mathematical Society
View all citing articles on Scopus
View full text