Abstract
In this paper we introduce a new method to the cluster analysis of longitudinal data focusing on the determination of uncertainty levels for cluster memberships. The method uses the Dirichlet-t distribution which notably utilizes the robustness feature of the student-t distribution in the framework of a Bayesian semi-parametric approach together with robust clustering of subjects evaluates the uncertainty level of subjects memberships to their clusters. We let the number of clusters and the uncertainty levels be unknown while fitting Dirichlet process mixture models. Two simulation studies are conducted to demonstrate the proposed methodology. The method is applied to cluster a real data set taken from gene expression studies.
Similar content being viewed by others
Notes
The probability density function of a Dirichlet random vector \(\mathbf {u}=(u_1,\ldots ,u_k)^\prime \) with \(u_1,\ldots ,u_k\ge 0\), and \(\sum _{i=1}^k u_i=1\), is given by \(f(\mathbf {u})\propto \prod _{i=1}^k u_i^{\alpha _i-1}\) where \(\alpha _1,\ldots ,\alpha _k>0\), (see, e.g., Sorensen and Gianola 2002).
The probability density function of an inverse-Wishart random matrix \(\mathbf {U}\) of order T is given by \(f(\mathbf {U})\propto |\mathbf {U}|^{-\frac{\tau +T+1}{2} } e^{-\frac{1}{2}tr\left( \varPsi \mathbf {U}^{-1}\right) }\) where \(\tau \) and \(\varPsi \) are the shape and the scale parameters, respectively (see, e.g., Sorensen and Gianola 2002).
The probability density function of an Inverse-Gamma random variable u is given by \(f(u)\propto u^{-(a+1) } e^{-b/u}\) where a and b are the shape and the scale parameters, respectively (see, e.g., Sorensen and Gianola 2002).
References
Andrews JL, McNicholas PD (2011a) Extending mixtures of multivariate \(t\)-factor analyzers. Stat Comput 21(3):361–373
Andrews JL, McNicholas PD (2011b) Mixtures of modified \(t\)-factor analyzers for model-based clustering, classification, and discriminant analysis. J Stat Plan Inference 141(4):1479–1486
Baek J, McLachlan GJ (2011) Mixtures of common \(t\)-factor analyzers for clustering high-dimensional microarray data. Bioinformatics 27(9):1269–1276
Bai X, Chen K, Yao W (2016) Mixture of linear mixed models using multivariate t distribution. J Stat Comput Simul 86(4):771–787
Chen L, Brown SD (2014) Bayesian estimation of membership uncertainty in model-based clustering. J Chemometr 28(5):358–369
Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown P, Herskowitz I (1998) The transcriptional program of sporulation in budding yeast. Science 282:699–705
Damien P, Wakefield J, Walker S (1999) Gibbs sampling for Bayesian non-conjugate and hierarchical models by using auxiliary variables. J R Stat Soc B 61:331–344
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc B 39:1–38
Dorazio RM (2009) On selecting a prior for the precision parameter of Dirichlet process mixture models. J Stat Plan Inference 139:3384–3390
Escobar MD (1994) Estimating normal means with a Dirichlet process prior. J Am Stat Assoc 89(425):268–277
Ferguson TS (1973) A Bayesian analysis of some nonparametric problems. Ann Stat 1:209–230
Finegold M, Drton M (2014) Robust bayesian graphical modeling using dirichlet t-distributions. Bayesian Anal 9(3):521–550
Fraley C, Raftery AE (1999) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41:578–588
Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6:721–741
Gilks WR, Wild P (1992) Adaptive rejection sampling for Gibbs sampling. Appl Stat 41(2):337–348
Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97–109
Heinzl F, Tutz G (2013) Clustering in linear mixed models with approximate Dirichlet process mixtures using EM algorithm. Stat Model 13:41–67
Heinzl F, Fahrmeir L, Kneib T (2012) Additive mixed models with Dirichlet process mixture and P-spline priors. Adv Stat Anal 96:47–68
Ishwaran H, James LF (2001) Gibbs sampling methods for stick-breaking priors. J Am Stat Assoc 96(453):161–173
Ishwaran H, James LF (2002) Approximate Dirichlet process computing in finite normal mixtures: smoothing and prior information. Comput Gr Stat 11:508–532
Ismail MMB, Frigui H (2010) Possibilistic clustering based on robust modeling of finite generalized Dirichlet mixture. In: The 20th international conference on pattern recognition, pp 573–576
Ismail MMB, Frigui H (2014) Unsupervised clustering and feature weighting based on generalized Dirichlet mixture modeling. Inf Sci 274:35–54
Laird NM, Ware JH (1982) Random effects models for longitudinal data. Biometrics 38:963–974
Li Y, Müller P, Lin X (2011) Center-adjusted inference for a nonparametric Bayesian random effect distribution. Stat Sinica 21(3):1201–1223
Lin TI (2014) Learning from incomplete data via parameterized t mixture models through eigenvalue decomposition. Comput Stat Data Anal 71:183–195
Lin TI, Ho HJ, Chen CL (2009) Analysis of multivariate skew normal models with incomplete data. J Multivar Anal 100:2337–2351
Lin TI, McNicholas PD, Hsiu JH (2014) Capturing patterns via parsimonious t mixture models. Stat Probab Lett 88:80–87
Lunn D, Spiegelhalter D, Thomas A, Best N (2009) The BUGS project: evolution, critique and future directions (with discussion). Stat Med 28:3049–3082
MacEachern SN (1994) Estimating normal means with a conjugate style Dirichlet process prior. Commun Stat 23:727–741
McNicholas PD (2013) Model-based clustering and classification via mixtures of multivariate \(t\)-distributions. In: Giudici P, Ingrassia S, Vichi M (eds) Statistical models for data analysis, studies in classification, data analysis, and knowledge organization. Springer International Publishing, Heidelberg
McNicholas PD, Subedi S (2012) Clustering gene expression time course data using mixtures of multivariate \(t\)-distributions. J Stat Plan Inference 142:1114–1127
Morris K, McNicholas PD, Scrucca L (2013) Dimension reduction for model-based clustering via mixtures of multivariate \(t\)-distributions. Adv Data Anal Classif 7(3):321–338
Munoz A, Carey V, Schouten JP, Segal M, Rosner B (1992) A parametric family of correlation structures for the analysis of longitudinal data. Biometrics 48(3):733–742
Rasmussen CE, de la Cruz BJ, Ghahramani Z, Wild DL (2009) Modeling and visualizing uncertainty in gene expression clusters using Dirichlet process mixtures. IEEE/ACM Trans Comput Biol Bioinform 6:615–627
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Sethuraman J (1994) A constructive definition of Dirichlet priors. Stat Sinica 4:639–650
Sorensen D, Gianola D (2002) Likelihood, Bayesian and MCMC methods in quantitative genetics. Springer, New York
Steane MA, McNicholas PD, Yada R (2012) Model-based classification via mixtures of multivariate \(t\)-factor analyzers. Commun Stat Simul Comput 41(4):510–523
Wakefield JC, Zhou C, Self SG (2003) Modelling gene expression over time: curve clustering with informative prior distributions. In: Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, West M (eds) Bayesian statistics, vol 7. Oxford University Press, Oxford, pp 721–732
Wang WL (2013) Multivariate t linear mixed models for irregularly observed multiple repeated measures with missing outcomes. Biometr J 55:554–571
Wang WL, Fan TH (2011) Estimation in multivariate t linear mixed models for multiple longitudinal data. Stat Sinica 21:1857–1880
Wang WL, Lin TI (2014) Multivariate t nonlinear mixed-effects models for multi-outcome longitudinal data with missing values. Stat Med 33:3029–3046
Wang WL, Lin TI (2015) Robust model-based clustering via mixtures of skew-t distributions with missing information. Adv Data Anal Classif 9(4):423–445
Wang L, Wang X (2013) Hierarchical Dirichlet process model for gene expression clustering. EURASIP J Bioinform Syst Biol 2013:5
Acknowledgments
The authors gratefully acknowledge two reviewers for their valuable comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rikhtehgaran, R., Kazemi, I. The determination of uncertainty levels in robust clustering of subjects with longitudinal observations using the Dirichlet process mixture. Adv Data Anal Classif 10, 541–562 (2016). https://doi.org/10.1007/s11634-016-0262-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-016-0262-x