A Poisson-based adaptive affinity propagation clustering for SAGE data

doi:10.1016/j.compbiolchem.2009.11.001

Computational Biology and Chemistry

Volume 34, Issue 1, February 2010, Pages 63-70

https://doi.org/10.1016/j.compbiolchem.2009.11.001 Get rights and content

Abstract

Serial analysis of gene expression (SAGE) is a powerful tool to obtain gene expression profiles. Clustering analysis is a valuable technique for analyzing SAGE data. In this paper, we propose an adaptive clustering method for SAGE data analysis, namely, PoissonAPS. The method incorporates a novel clustering algorithm, Affinity Propagation (AP). While AP algorithm has demonstrated good performance on many different data sets, it also faces several limitations. PoissonAPS overcomes the limitations of AP using the clustering validation measure as a cost function of merging and splitting, and as a result, it can automatically cluster SAGE data without user-specified parameters. We evaluated PoissonAPS and compared its performance with other methods on several real life SAGE datasets. The experimental results show that PoissonAPS can produce meaningful and interpretable clusters for SAGE data.

Introduction

Serial analysis of gene expression (SAGE) is a powerful tool for obtaining global gene expression profiling at mRNA level. The original technique was provided by Dr. Victor Velculescu (Velculescu et al., 1995) in 1995. SAGE has been used to study the transcriptome of a variety of tissue and cell types from a diverse set of organisms (Zuyderduyn, 2007). Unlike microarray technique, SAGE detects unknown transcripts without requiring prior knowledge of what is present in the sample under analysis (Wang, 2007), and can provide a statistical description of the mRNA population present in a cell. Wang (Wang, 2007) presented a review of the features of the SAGE data, including the specificity of SAGE tags with respect to their original transcripts, the quantitative nature of SAGE data for differentially expressed genes, the reproducibility, the comparability of SAGE with microarray and the future potential of SAGE.

To analyze the significant amounts of SAGE data produced from this technology, a number of methods have been developed to deal with them. A SAGE data set can be viewed as an n × d matrix, where n is the number of tags and d is the number of SAGE libraries, and each horizontal row of matrix represents a SAGE tag and each vertical column represents different development stages of a biological process or various biological conditions (Wang et al., 2008). Clustering algorithms (Cai et al., 2004, Huang et al., 2008, Sander et al., 2005, Tzanis and Vlahava, 2007, Wang et al., 2007, Wang et al., 2008) provide a useful tool to explore the potentially novel and significant transcript or gene groups in SAGE data. The basic concept of clustering is to divide patterns into different groups (clusters). The patterns in the same share more similarity comparing with the patterns in other clusters. A nice review on the clustering methods for SAGE data can be found in Wang et al. (2008). As described earlier, unlike microarray expression technology, SAGE produces profiles consisting of a digital output that is quantitative in nature. Therefore, traditional distances or similarity measures, such as Pearson correlation coefficient and Euclidean distance, may not be suitable for SAGE data analysis. Evidently, clustering method for SAGE data should employ a reliable statistical model (Cai et al., 2004, Lu et al., 2005, Wang et al., 2008, Zuyderduyn, 2007). For this purpose, Cai et al. (Cai et al., 2004, Huang et al., 2008) modeled SAGE data by Poisson statistics and developed two Poisson-based distances, and implemented the k-means clustering using two Poisson-based distances. Afterwards, Thygesen and Zwinderman (2006) proposed a hierarchical Poisson model with a gamma prior and three different algorithms for estimating the parameters in the model. Furthermore, Wang et al. (2007) proposed two new clustering algorithms, namely, PoissonS and PoissonHC based on the adaptation and improvement of Self-Organizing Maps (SOM) and hierarchical clustering techniques. Lu et al. (2005) employed an overdispersed log-linear model approach to analyze SAGE libraries. Vencio et al. (2004) proposed a Bayesian model of mixtures to account for within-class variability.

Although a number of clustering methods for SAGE data have been proposed (Cai et al., 2004, Huang et al., 2008, Sander et al., 2005, Tzanis and Vlahava, 2007, Wang et al., 2007, Wang et al., 2008), most of these methods employ some user-defined parameters, therefore the results may highly depend on such parameters. For example, k-means and k-medoids face the problem of prescribing the number of clusters in advance. In addition, many clustering algorithms for SAGE data start with a randomly initial selection, such as k-means and PoissonC. Consequently, the clustering result cannot be reproduced. Clearly, except for specific situations when we have complete knowledge about the data set to ensure the validity of chosen parameters, the choice of the parameters can only be determined by empirical methods.

Recently, Frey and Dueck (2007) proposed a powerful algorithm named Affinity Propagation (AP) based on message passing. AP algorithm has attracted increasing attention. However, the question how selection suitable parameters should be made has received only little attention in the original literature. Based on AP and Poisson statistics, this paper proposes an adaptive clustering method for SAGE data analysis using clustering validation measure as a cost function of merging and splitting, namely, PoissonAPS. The key characteristics of the proposed methods are as follows: (1) the method overcomes some limitations of AP; (2) the method is a non-parametric clustering method.

The organization of this paper is as follows. In Section 2, we give a brief review of AP clustering algorithm and statistical properties of SAGE data. Then, we introduce the proposed method in Section 3. Section 4 presents the experimental results. Discussions and conclusions are given in Section 5.

Section snippets

Description of AP

Traditional clustering analysis method, such as the popular k-means and k-centers clustering method, often start with a randomly initial selection, such as selection centers or data points (“exemplars”), and then iteratively give clustering result. As a result, they are quite sensitive to the initial selection. Therefore, the clustering result cannot be reproduced. In contrast, AP simultaneously considers all data points as potential exemplars. AP views each data point as a node in a network,

Method

The main idea behind PoissonAPS is that method attempt to adjust clustering result using clustering validation measure as a cost function of merging and splitting, and then it can cluster the tags quickly and automatically. The validation index used here is silhouette index, its definition is as follows (Rousseeuw, 1987): for a given clustering result C = {c₁, c₂, …, c_T}, and sample x_i (1 ≤ i ≤ N) is assigned to the cluster c_t, s(i) (i = 1, …, N) is a confidence indicator on the membership of the ith

Results and discussion

This section uses two real life SAGE datasets to evaluate the performance of PoissonAPS, Mouse Retinal SAGE Data and Human Cancer SAGE Data.

Conclusions

Serial analysis of gene expression is a powerful tool for the comprehensive and quantitative measurement of gene expression and for identifying novel genes. Clustering analysis is a valuable technique for analysis of SAGE data. Most clustering algorithms for SAGE data are parameters-dependent, such as simply threshold, cluster numbers, etc. Affinity Propagation is a new powerful tool for unsupervised clustering. It has many advantages. However, the question how selection suitable parameters

Acknowledgements

This research was supported by the National Natural Science Foundation of China under Grant No. 60671033. We thank Dr. George Tzanis for helpful discussions about SAGE data.

References (21)

P. Rousseeuw
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
J. Comput. Appl. Math.
(1987)
S.M. Wang
Understanding SAGE data
Trends Genet.
(2007)
S. Blackshaw et al.
Genomic analysis of mouse retinal development
PLoS Biol.
(2004)
M.J. Brusco et al.
Comment on “Clustering by Passing Messages Between Data Points”
Science
(2008)
P. Buckhaults et al.
Identifying tumor origin using a gene expression-based classification map
Cancer Res.
(2003)
L. Cai et al.
Clustering analysis of SAGE data using a Poisson approach
Genome Biol.
(2004)
B.J. Frey et al.
Clustering by passing messages between data points
Science
(2007)
B.J. Frey et al.
Response to Comment on “Clustering by Passing Messages Between Data Points”
Science
(2008)
H. Huang et al.
Clustering analysis of SAGE transcription profiles using a Poisson approach
Methods Mol. Biol.
(2008)
B. Larsen et al.
Fast and effective text mining using linear-time document clustering

There are more references available in the full text version of this article.

Cited by (19)

Fault diagnosis model based on Granular Computing and Echo State Network
2020, Engineering Applications of Artificial Intelligence
In order to improve the efficiency and accuracy of electronic equipment fault diagnosis, a fault diagnosis model based on Granular Computing and Echo State Network (ESN) is proposed. Firstly, the attribute reduction of test index is carried out based on granular computing model. An attribute distinguishing ability index is defined based on attribute value influence degree. As the basis of similarity measure, a number of attribute granules of similar distinguish are obtained through affinity propagation clustering algorithm, then fault attribute reduction was completed by selecting clustering center attributes. In the stage of fault identification by ESN, in order to improve the dynamic adaptability of ESN reservoir to samples, Bienenstock–Cooper–Munro(BCM) rule is introduced into the reservoir construction to train the connection weight matrix. Meanwhile, the $L_{1 ∕ 2}$ -norm penalty term is added to the objective function in order to improve the sparsification efficiency, and a smoothing $L_{1 ∕ 2}$ -norm regularization term is used to overcome the iterative numerical oscillation problem, the model is solved by using the half threshold iteration method at last. The effectiveness and superiority of the proposed method are verified by a fault diagnosis example of terminal guidance radar signal processing module.
pySAPC, a python package for sparse affinity propagation clustering: Application to odontogenesis whole genome time series gene-expression data
2016, Biochimica et Biophysica Acta - General Subjects
Citation Excerpt :
Compared with classical clustering methods such as k-means, AP has several advantages such as a lower clustering error, automatic determination of number of clusters, identification of exemplars (cluster centers), support of similarities that are not symmetric and deterministic clustering result (k-means clustering result depends on initialization, and hence requires multiple runs to achieves global optimization). AP has been successfully applied to many domains such as microarray expression data analysis [13,17,30], image clustering [12,35], structural biology [5,23] and network analysis [26,32,34]. Current implementation of Affinity propagation in Python (http://scikit-learn.org) requires full dense similarity matrix (all pair-wise similarity), which grows quadratically.
Developmental dental anomalies are common forms of congenital defects. The molecular mechanisms of dental anomalies are poorly understood. Systematic approaches such as clustering genes based on similar expression patterns could identify novel genes involved in dental anomalies and provide a framework for understanding molecular regulatory mechanisms of these genes during tooth development (odontogenesis).
A python package (pySAPC) of sparse affinity propagation clustering algorithm for large datasets was developed. Whole genome pair-wise similarity was calculated based on expression pattern similarity based on 45 microarrays of several stages during odontogenesis.
pySAPC identified 743 gene clusters based on expression pattern similarity during mouse tooth development. Three clusters are significantly enriched for genes associated with dental anomalies (with FDR < 0.1). The three clusters of genes have distinct expression patterns during odontogenesis.
Clustering genes based on similar expression profiles recovered several known regulatory relationships for genes involved in odontogenesis, as well as many novel genes that may be involved with the same genetic pathways as genes that have already been shown to contribute to dental defects.
By using sparse similarity matrix, pySAPC use much less memory and CPU time compared with the original affinity propagation program that uses a full similarity matrix. This python package will be useful for many applications where dataset(s) are too large to use full similarity matrix. This article is part of a Special Issue entitled “System Genetics” Guest Editor: Dr. Yudong Cai and Dr. Tao Huang.
A self-adaptive weighted affinity propagation clustering for key frames extraction on human action recognition
2015, Journal of Visual Communication and Image Representation
Citation Excerpt :
They utilized the dimension reduction method based on spectral graph theory to map the original data points to a low-dimensional Eigen space and proposed a density-adaptive AP clustering algorithm based on spectral dimension reduction. Tang et al. proposed an adaptive clustering method for SAGE data analysis, namely Poisson APS [21]. By incorporating the Poisson statistic character of SAGE data, clustering validation measure is used as a cost function of merging and splitting to adjust the clustering result.
In this paper, we propose a novel approach for key frames extraction on human action recognition from 3D video sequences. To represent human actions, an Energy Feature (EF), combining kinetic energy and potential energy, is extracted from 3D video sequences. A Self-adaptive Weighted Affinity Propagation (SWAP) algorithm is then proposed to extract the key frames. Finally, we employ SVM to recognize human actions on the EFs of selected key frames. The experiments show the information including whole action course can be effectively extracted by our method, and we obtain good recognition performance without losing classification accuracy. Moreover, the recognition speed is greatly improved.
Clustering Similar Ungauged Hydrologic Basins in Saudi Arabia by Message Passing Algorithms
2024, Earth Systems and Environment
Fear in a handful of dust: The epidemiological, environmental and economic drivers of death by PM2.5 pollution
2021, International Journal of Environmental Research and Public Health
Lung cancer classification and gene selection by combining affinity propagation clustering and sparse group lasso
2020, Current Bioinformatics

View all citing articles on Scopus

View full text

Brief communicationA Poisson-based adaptive affinity propagation clustering for SAGE data

Abstract

Introduction

Section snippets

Description of AP

Method

Results and discussion

Conclusions

Acknowledgements

J. Comput. Appl. Math.

Trends Genet.

Genomic analysis of mouse retinal development

PLoS Biol.

Comment on “Clustering by Passing Messages Between Data Points”

Science

Identifying tumor origin using a gene expression-based classification map

Cancer Res.

Clustering analysis of SAGE data using a Poisson approach

Genome Biol.

Clustering by passing messages between data points

Science

Response to Comment on “Clustering by Passing Messages Between Data Points”

Science

Clustering analysis of SAGE transcription profiles using a Poisson approach

Methods Mol. Biol.

Fast and effective text mining using linear-time document clustering

Brief communication
A Poisson-based adaptive affinity propagation clustering for SAGE data