A Bayesian mixture model to quantify parameters of spatial clustering
Introduction
We propose a Bayesian mixture model for quantifying spatial clustering in the presence of non-clustering objects. The focus lies on estimating the proportion of points lying in clusters as well as the cluster size and radius. The approach is designed to answer questions arising in the context of cutting-edge biophysical research on nanoclusters of Ras proteins that cannot be answered by standard statistical approaches. It does not depend on prior knowledge of the parameters.
The most widely-used cluster approaches are likely hierarchical clustering (Johnson, 1967) and partitioning algorithms such as k-means or partitioning around medoids (PAM) (MacQueen, 1967, Kaufman and Rousseeuw, 1990). These approaches require the analyst to define a cutoff in a dendrogram, pre-define the number of clusters, or infer it by some heuristic. They also do not explicitly consider non-clustering objects, although both approaches can produce clusters of size one. These drawbacks have been tackled by more recent methods. Maitra and Ramler (2009), e.g., proposed a generalization of the k-means algorithm that explicitly considers scattered points. Some sophisticated grouping algorithms were proposed that only require specifying, e.g., a maximal cluster size (Scharl and Leisch, 2006), a minimal cluster size (Manley et al., 2008, relying on point trajectories over time) or both a minimal cluster size and an effective maximal cluster radius (Ester et al., 1996, Ankerst et al., 1999). Although such values appear somewhat easier to specify, they require important prior knowledge of the problem and their choices may strongly influence the results. Sensitivity analyses for the parameter choices, mandatory in all mentioned algorithms, cannot completely rule out an unwanted bias.
In model-based clustering, the distribution giving rise to the observations in a given parameter space is modeled by a mixture of distributions, usually Gaussian ones (McLachlan and Peel, 2000, Fraley and Raftery, 2007). Arbitrary parameter choices are avoided in this framework. While each cluster is generally represented by one mixture component, propositions have been recently made to model each cluster by several mixture components to give the model more flexibility (Baudry et al., 2010). Non-clustering observations, if considered, are generally viewed as noise rather than as a component of interest (see, e.g., Dasgupta and Raftery, 1998, Hennig and Coretto, 2008). A case can be made for fitting mixture models in a Bayesian framework (Frühwirth-Schnatter, 2006, Fritsch and Ickstadt, 2009). Among the advantages are a smaller susceptibility to problematic likelihood shapes such as local maxima or unboundedness, applicability in case of small sample sizes, and a straightforward generalization to an infinite number of mixture components (Lo, 1984). However, in the Ras application, fully model-based approaches directly modeling the protein locations are not able to identify all three parameters of interest (proportion of clustered proteins, mean cluster size and mean cluster radius) due to identifiability problems resulting from, e.g., dependences between cluster size and cluster radius.
We present GAMMICS (GAMma Mixtures for Inference on Cluster Structures), a novel compromise approach accomplishing inference for all three parameters. The core of the approach is a Bayesian model for the squared distances between proteins and their second nearest neighbors, classifying each point as clustered or non-clustered. Estimates for all parameters are obtained by incorporating some algorithmic aspects, inspired by the idea of density-based clustering originally advocated by the DBSCAN algorithm (Ester et al., 1996).
Our method is designed for the needs of biophysical research on clusters of the small GTPase Ras, a protein playing an important role in signal transduction at the plasma membrane. The spatial patterns of Ras proteins present complex and dynamic clustering behavior at the nano scale and are important for cell growth and, thus, for the development of tumors (Hancock, 2006). This makes them a target of biomedical research, an essential goal being to assess differences in the clustering behavior between distinct experimental conditions, such as healthy cells and tumor cells. A first step towards this goal is to quantify the parameters of the clustering behavior, as estimates of those parameters subsequently may be compared across different experimental conditions.
In biophysical literature, generally the K-function and derivatives have been used to estimate the mean cluster radius (see, e.g., Plowman et al., 2005, Tian et al., 2007, Kiskowski et al., 2009). However, the approach only allows to estimate this one parameter and, moreover, this approach is susceptible to severe biases. Experimental biophysical arguments based on gold-labeled particles have been used in some cases to derive estimates for the proportion of clustered proteins and the cluster size as well.
In a simulation study, our approach favorably compares to a K-function analysis, the DBSCAN algorithm, a Bayesian model-based cluster approach, as well as a mixture of the latter two, both in terms of the misclassification rate and the accuracy in parameter estimation.
This paper is organized as follows: In Section 2, we give a description of the experimental data as well as the simulation of point patterns via the double Matérn cluster process. In Section 3, we describe the GAMMICS method as well as the competing approaches. In Section 4, results of all approaches are compared, while in Section 5, some remaining challenges for future work are discussed.
Section snippets
Experimental data
Ras proteins are small, measuring only 2–3 nm in diameter. In order to visualize Ras nanoclusters at the plasma membrane, fluorescence microscopy is employed. We focus on cells with an above normal expression level typical for cancer cells by overexpressing Ras tagged to a fluorescent protein. Due to the high expression level, proteins cannot be visualized and detected as single molecules by convenient fluorescence microscopy techniques. Therefore, photoactivated localization microscopy (PALM)
Methods
In this section we describe GAMMICS as well as the competing approaches of the H-function, Bayesian model-based clustering, DBSCAN, and a Bayesian version of DBSCAN. Here, it is assumed that is a random point pattern in a finite region corresponding to proteins in a part of the cell membrane.
Simulation study
A comprehensive simulation study is conducted to demonstrate the performance of the GAMMICS method in comparison to DBSCAN, model-based clustering, the Bayesian DBSCAN proposed by Argiento et al. (2013), and (w.r.t. estimating the cluster radius) the H-function. The values considered for each parameter of the simulation study can be seen in Table 1. We simulate point patterns in regions of 4000 nm in square for each setting, five random replicates are generated, and eventually the median
Discussion
We have introduced GAMMICS, a novel approach to estimate parameters in spatial clustering in presence of singletons, specifically the proportion of clustered points, mean cluster size and radius. It builds on frequentist modeling ideas proposed by Byers and Raftery (1998) for distinguishing large objects from noise and employs them for an analysis of small clusters in a Bayesian framework, relying on a gamma mixture model for the squared distances of points to their second nearest neighbors. It
Acknowledgments
Financial support of the Deutsche Forschungsgemeinschaft (DFG; Research Training Group 1032/2 ‘Statistical Modeling’, grant no. IC 5/3-1 to K.I., grant no. VE 579/3-1 to P.J.V., grant no. SCHW 1508/3-1 to H.S.) and the German Ministry of Education and Research (BMBF; FORSYS initiative, grant no. 0315257 to P.J.V.) is gratefully acknowledged. We thank Alan E. Gelfand for helpful discussions on concepts and the presentation of the method as well as the editors and three anonymous reviewers for
References (38)
- et al.
Ultra-high resolution imaging by fluorescence photoactivation localization microscopy
Biophys. J.
(2006) - et al.
On the use of Ripley’s K-function and its derivatives to analyze domain size
Biophys. J.
(2009) - et al.
Lighting up the cell surface with evanescent wave microscopy
Trends Cell Biol.
(2001) - et al.
OPTICS: Ordering points to identify the clustering structure
- et al.
A ‘density-based’ algorithm for cluster analysis using species sampling gaussian mixture models
J. Comput. Graph. Statist.
(2013) - et al.
Non- and semiparametric estimation of interaction in inhomogeneous point patterns
Stat. Neerl.
(2000) - et al.
Combining mixture components for clustering
J. Comput. Graph. Statist.
(2010) - et al.
Imaging intracellular fluorescent proteins at nanometer resolution
Science
(2006) Bayesian cluster analysis
Biometrika
(1978)- et al.
Nearest-neighbour clutter removal for estimating features in spatial point processes
J. Amer. Statist. Assoc.
(1998)