Brief communication
A Poisson-based adaptive affinity propagation clustering for SAGE data

https://doi.org/10.1016/j.compbiolchem.2009.11.001Get rights and content

Abstract

Serial analysis of gene expression (SAGE) is a powerful tool to obtain gene expression profiles. Clustering analysis is a valuable technique for analyzing SAGE data. In this paper, we propose an adaptive clustering method for SAGE data analysis, namely, PoissonAPS. The method incorporates a novel clustering algorithm, Affinity Propagation (AP). While AP algorithm has demonstrated good performance on many different data sets, it also faces several limitations. PoissonAPS overcomes the limitations of AP using the clustering validation measure as a cost function of merging and splitting, and as a result, it can automatically cluster SAGE data without user-specified parameters. We evaluated PoissonAPS and compared its performance with other methods on several real life SAGE datasets. The experimental results show that PoissonAPS can produce meaningful and interpretable clusters for SAGE data.

Introduction

Serial analysis of gene expression (SAGE) is a powerful tool for obtaining global gene expression profiling at mRNA level. The original technique was provided by Dr. Victor Velculescu (Velculescu et al., 1995) in 1995. SAGE has been used to study the transcriptome of a variety of tissue and cell types from a diverse set of organisms (Zuyderduyn, 2007). Unlike microarray technique, SAGE detects unknown transcripts without requiring prior knowledge of what is present in the sample under analysis (Wang, 2007), and can provide a statistical description of the mRNA population present in a cell. Wang (Wang, 2007) presented a review of the features of the SAGE data, including the specificity of SAGE tags with respect to their original transcripts, the quantitative nature of SAGE data for differentially expressed genes, the reproducibility, the comparability of SAGE with microarray and the future potential of SAGE.

To analyze the significant amounts of SAGE data produced from this technology, a number of methods have been developed to deal with them. A SAGE data set can be viewed as an n × d matrix, where n is the number of tags and d is the number of SAGE libraries, and each horizontal row of matrix represents a SAGE tag and each vertical column represents different development stages of a biological process or various biological conditions (Wang et al., 2008). Clustering algorithms (Cai et al., 2004, Huang et al., 2008, Sander et al., 2005, Tzanis and Vlahava, 2007, Wang et al., 2007, Wang et al., 2008) provide a useful tool to explore the potentially novel and significant transcript or gene groups in SAGE data. The basic concept of clustering is to divide patterns into different groups (clusters). The patterns in the same share more similarity comparing with the patterns in other clusters. A nice review on the clustering methods for SAGE data can be found in Wang et al. (2008). As described earlier, unlike microarray expression technology, SAGE produces profiles consisting of a digital output that is quantitative in nature. Therefore, traditional distances or similarity measures, such as Pearson correlation coefficient and Euclidean distance, may not be suitable for SAGE data analysis. Evidently, clustering method for SAGE data should employ a reliable statistical model (Cai et al., 2004, Lu et al., 2005, Wang et al., 2008, Zuyderduyn, 2007). For this purpose, Cai et al. (Cai et al., 2004, Huang et al., 2008) modeled SAGE data by Poisson statistics and developed two Poisson-based distances, and implemented the k-means clustering using two Poisson-based distances. Afterwards, Thygesen and Zwinderman (2006) proposed a hierarchical Poisson model with a gamma prior and three different algorithms for estimating the parameters in the model. Furthermore, Wang et al. (2007) proposed two new clustering algorithms, namely, PoissonS and PoissonHC based on the adaptation and improvement of Self-Organizing Maps (SOM) and hierarchical clustering techniques. Lu et al. (2005) employed an overdispersed log-linear model approach to analyze SAGE libraries. Vencio et al. (2004) proposed a Bayesian model of mixtures to account for within-class variability.

Although a number of clustering methods for SAGE data have been proposed (Cai et al., 2004, Huang et al., 2008, Sander et al., 2005, Tzanis and Vlahava, 2007, Wang et al., 2007, Wang et al., 2008), most of these methods employ some user-defined parameters, therefore the results may highly depend on such parameters. For example, k-means and k-medoids face the problem of prescribing the number of clusters in advance. In addition, many clustering algorithms for SAGE data start with a randomly initial selection, such as k-means and PoissonC. Consequently, the clustering result cannot be reproduced. Clearly, except for specific situations when we have complete knowledge about the data set to ensure the validity of chosen parameters, the choice of the parameters can only be determined by empirical methods.

Recently, Frey and Dueck (2007) proposed a powerful algorithm named Affinity Propagation (AP) based on message passing. AP algorithm has attracted increasing attention. However, the question how selection suitable parameters should be made has received only little attention in the original literature. Based on AP and Poisson statistics, this paper proposes an adaptive clustering method for SAGE data analysis using clustering validation measure as a cost function of merging and splitting, namely, PoissonAPS. The key characteristics of the proposed methods are as follows: (1) the method overcomes some limitations of AP; (2) the method is a non-parametric clustering method.

The organization of this paper is as follows. In Section 2, we give a brief review of AP clustering algorithm and statistical properties of SAGE data. Then, we introduce the proposed method in Section 3. Section 4 presents the experimental results. Discussions and conclusions are given in Section 5.

Section snippets

Description of AP

Traditional clustering analysis method, such as the popular k-means and k-centers clustering method, often start with a randomly initial selection, such as selection centers or data points (“exemplars”), and then iteratively give clustering result. As a result, they are quite sensitive to the initial selection. Therefore, the clustering result cannot be reproduced. In contrast, AP simultaneously considers all data points as potential exemplars. AP views each data point as a node in a network,

Method

The main idea behind PoissonAPS is that method attempt to adjust clustering result using clustering validation measure as a cost function of merging and splitting, and then it can cluster the tags quickly and automatically. The validation index used here is silhouette index, its definition is as follows (Rousseeuw, 1987): for a given clustering result C = {c1, c2, …, cT}, and sample xi (1  i  N) is assigned to the cluster ct, s(i) (i = 1, …, N) is a confidence indicator on the membership of the ith

Results and discussion

This section uses two real life SAGE datasets to evaluate the performance of PoissonAPS, Mouse Retinal SAGE Data and Human Cancer SAGE Data.

Conclusions

Serial analysis of gene expression is a powerful tool for the comprehensive and quantitative measurement of gene expression and for identifying novel genes. Clustering analysis is a valuable technique for analysis of SAGE data. Most clustering algorithms for SAGE data are parameters-dependent, such as simply threshold, cluster numbers, etc. Affinity Propagation is a new powerful tool for unsupervised clustering. It has many advantages. However, the question how selection suitable parameters

Acknowledgements

This research was supported by the National Natural Science Foundation of China under Grant No. 60671033. We thank Dr. George Tzanis for helpful discussions about SAGE data.

References (21)

  • P. Rousseeuw

    Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

    J. Comput. Appl. Math.

    (1987)
  • S.M. Wang

    Understanding SAGE data

    Trends Genet.

    (2007)
  • S. Blackshaw et al.

    Genomic analysis of mouse retinal development

    PLoS Biol.

    (2004)
  • M.J. Brusco et al.

    Comment on “Clustering by Passing Messages Between Data Points”

    Science

    (2008)
  • P. Buckhaults et al.

    Identifying tumor origin using a gene expression-based classification map

    Cancer Res.

    (2003)
  • L. Cai et al.

    Clustering analysis of SAGE data using a Poisson approach

    Genome Biol.

    (2004)
  • B.J. Frey et al.

    Clustering by passing messages between data points

    Science

    (2007)
  • B.J. Frey et al.

    Response to Comment on “Clustering by Passing Messages Between Data Points”

    Science

    (2008)
  • H. Huang et al.

    Clustering analysis of SAGE transcription profiles using a Poisson approach

    Methods Mol. Biol.

    (2008)
  • B. Larsen et al.

    Fast and effective text mining using linear-time document clustering

There are more references available in the full text version of this article.

Cited by (19)

  • Fault diagnosis model based on Granular Computing and Echo State Network

    2020, Engineering Applications of Artificial Intelligence
  • pySAPC, a python package for sparse affinity propagation clustering: Application to odontogenesis whole genome time series gene-expression data

    2016, Biochimica et Biophysica Acta - General Subjects
    Citation Excerpt :

    Compared with classical clustering methods such as k-means, AP has several advantages such as a lower clustering error, automatic determination of number of clusters, identification of exemplars (cluster centers), support of similarities that are not symmetric and deterministic clustering result (k-means clustering result depends on initialization, and hence requires multiple runs to achieves global optimization). AP has been successfully applied to many domains such as microarray expression data analysis [13,17,30], image clustering [12,35], structural biology [5,23] and network analysis [26,32,34]. Current implementation of Affinity propagation in Python (http://scikit-learn.org) requires full dense similarity matrix (all pair-wise similarity), which grows quadratically.

  • A self-adaptive weighted affinity propagation clustering for key frames extraction on human action recognition

    2015, Journal of Visual Communication and Image Representation
    Citation Excerpt :

    They utilized the dimension reduction method based on spectral graph theory to map the original data points to a low-dimensional Eigen space and proposed a density-adaptive AP clustering algorithm based on spectral dimension reduction. Tang et al. proposed an adaptive clustering method for SAGE data analysis, namely Poisson APS [21]. By incorporating the Poisson statistic character of SAGE data, clustering validation measure is used as a cost function of merging and splitting to adjust the clustering result.

View all citing articles on Scopus
View full text