Skip to main content
Log in

The determination of uncertainty levels in robust clustering of subjects with longitudinal observations using the Dirichlet process mixture

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

In this paper we introduce a new method to the cluster analysis of longitudinal data focusing on the determination of uncertainty levels for cluster memberships. The method uses the Dirichlet-t distribution which notably utilizes the robustness feature of the student-t distribution in the framework of a Bayesian semi-parametric approach together with robust clustering of subjects evaluates the uncertainty level of subjects memberships to their clusters. We let the number of clusters and the uncertainty levels be unknown while fitting Dirichlet process mixture models. Two simulation studies are conducted to demonstrate the proposed methodology. The method is applied to cluster a real data set taken from gene expression studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. The probability density function of a Dirichlet random vector \(\mathbf {u}=(u_1,\ldots ,u_k)^\prime \) with \(u_1,\ldots ,u_k\ge 0\), and  \(\sum _{i=1}^k u_i=1\), is given by \(f(\mathbf {u})\propto \prod _{i=1}^k u_i^{\alpha _i-1}\) where \(\alpha _1,\ldots ,\alpha _k>0\), (see, e.g., Sorensen and Gianola 2002).

  2. The probability density function of an inverse-Wishart random matrix \(\mathbf {U}\) of order T is given by \(f(\mathbf {U})\propto |\mathbf {U}|^{-\frac{\tau +T+1}{2} } e^{-\frac{1}{2}tr\left( \varPsi \mathbf {U}^{-1}\right) }\) where \(\tau \) and \(\varPsi \) are the shape and the scale parameters, respectively (see, e.g., Sorensen and Gianola 2002).

  3. The probability density function of an Inverse-Gamma random variable u is given by \(f(u)\propto u^{-(a+1) } e^{-b/u}\) where a and b are the shape and the scale parameters, respectively (see, e.g., Sorensen and Gianola 2002).

References

  • Andrews JL, McNicholas PD (2011a) Extending mixtures of multivariate \(t\)-factor analyzers. Stat Comput 21(3):361–373

    Article  MathSciNet  MATH  Google Scholar 

  • Andrews JL, McNicholas PD (2011b) Mixtures of modified \(t\)-factor analyzers for model-based clustering, classification, and discriminant analysis. J Stat Plan Inference 141(4):1479–1486

    Article  MathSciNet  MATH  Google Scholar 

  • Baek J, McLachlan GJ (2011) Mixtures of common \(t\)-factor analyzers for clustering high-dimensional microarray data. Bioinformatics 27(9):1269–1276

    Article  Google Scholar 

  • Bai X, Chen K, Yao W (2016) Mixture of linear mixed models using multivariate t distribution. J Stat Comput Simul 86(4):771–787

    Article  MathSciNet  Google Scholar 

  • Chen L, Brown SD (2014) Bayesian estimation of membership uncertainty in model-based clustering. J Chemometr 28(5):358–369

    Article  MathSciNet  Google Scholar 

  • Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown P, Herskowitz I (1998) The transcriptional program of sporulation in budding yeast. Science 282:699–705

    Article  Google Scholar 

  • Damien P, Wakefield J, Walker S (1999) Gibbs sampling for Bayesian non-conjugate and hierarchical models by using auxiliary variables. J R Stat Soc B 61:331–344

    Article  MathSciNet  MATH  Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc B 39:1–38

    MathSciNet  MATH  Google Scholar 

  • Dorazio RM (2009) On selecting a prior for the precision parameter of Dirichlet process mixture models. J Stat Plan Inference 139:3384–3390

    Article  MathSciNet  MATH  Google Scholar 

  • Escobar MD (1994) Estimating normal means with a Dirichlet process prior. J Am Stat Assoc 89(425):268–277

    Article  MathSciNet  MATH  Google Scholar 

  • Ferguson TS (1973) A Bayesian analysis of some nonparametric problems. Ann Stat 1:209–230

    Article  MathSciNet  MATH  Google Scholar 

  • Finegold M, Drton M (2014) Robust bayesian graphical modeling using dirichlet t-distributions. Bayesian Anal 9(3):521–550

    Article  MathSciNet  MATH  Google Scholar 

  • Fraley C, Raftery AE (1999) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41:578–588

    Article  MATH  Google Scholar 

  • Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6:721–741

    Article  MATH  Google Scholar 

  • Gilks WR, Wild P (1992) Adaptive rejection sampling for Gibbs sampling. Appl Stat 41(2):337–348

    Article  MATH  Google Scholar 

  • Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97–109

    Article  MathSciNet  MATH  Google Scholar 

  • Heinzl F, Tutz G (2013) Clustering in linear mixed models with approximate Dirichlet process mixtures using EM algorithm. Stat Model 13:41–67

    Article  MathSciNet  Google Scholar 

  • Heinzl F, Fahrmeir L, Kneib T (2012) Additive mixed models with Dirichlet process mixture and P-spline priors. Adv Stat Anal 96:47–68

    Article  MathSciNet  Google Scholar 

  • Ishwaran H, James LF (2001) Gibbs sampling methods for stick-breaking priors. J Am Stat Assoc 96(453):161–173

    Article  MathSciNet  MATH  Google Scholar 

  • Ishwaran H, James LF (2002) Approximate Dirichlet process computing in finite normal mixtures: smoothing and prior information. Comput Gr Stat 11:508–532

  • Ismail MMB, Frigui H (2010) Possibilistic clustering based on robust modeling of finite generalized Dirichlet mixture. In: The 20th international conference on pattern recognition, pp 573–576

  • Ismail MMB, Frigui H (2014) Unsupervised clustering and feature weighting based on generalized Dirichlet mixture modeling. Inf Sci 274:35–54

    Article  MathSciNet  MATH  Google Scholar 

  • Laird NM, Ware JH (1982) Random effects models for longitudinal data. Biometrics 38:963–974

    Article  MATH  Google Scholar 

  • Li Y, Müller P, Lin X (2011) Center-adjusted inference for a nonparametric Bayesian random effect distribution. Stat Sinica 21(3):1201–1223

    Article  MathSciNet  MATH  Google Scholar 

  • Lin TI (2014) Learning from incomplete data via parameterized t mixture models through eigenvalue decomposition. Comput Stat Data Anal 71:183–195

    Article  MathSciNet  Google Scholar 

  • Lin TI, Ho HJ, Chen CL (2009) Analysis of multivariate skew normal models with incomplete data. J Multivar Anal 100:2337–2351

    Article  MathSciNet  MATH  Google Scholar 

  • Lin TI, McNicholas PD, Hsiu JH (2014) Capturing patterns via parsimonious t mixture models. Stat Probab Lett 88:80–87

    Article  MathSciNet  MATH  Google Scholar 

  • Lunn D, Spiegelhalter D, Thomas A, Best N (2009) The BUGS project: evolution, critique and future directions (with discussion). Stat Med 28:3049–3082

    Article  MathSciNet  Google Scholar 

  • MacEachern SN (1994) Estimating normal means with a conjugate style Dirichlet process prior. Commun Stat 23:727–741

    Article  MathSciNet  MATH  Google Scholar 

  • McNicholas PD (2013) Model-based clustering and classification via mixtures of multivariate \(t\)-distributions. In: Giudici P, Ingrassia S, Vichi M (eds) Statistical models for data analysis, studies in classification, data analysis, and knowledge organization. Springer International Publishing, Heidelberg

    Google Scholar 

  • McNicholas PD, Subedi S (2012) Clustering gene expression time course data using mixtures of multivariate \(t\)-distributions. J Stat Plan Inference 142:1114–1127

    Article  MathSciNet  MATH  Google Scholar 

  • Morris K, McNicholas PD, Scrucca L (2013) Dimension reduction for model-based clustering via mixtures of multivariate \(t\)-distributions. Adv Data Anal Classif 7(3):321–338

    Article  MathSciNet  MATH  Google Scholar 

  • Munoz A, Carey V, Schouten JP, Segal M, Rosner B (1992) A parametric family of correlation structures for the analysis of longitudinal data. Biometrics 48(3):733–742

    Article  Google Scholar 

  • Rasmussen CE, de la Cruz BJ, Ghahramani Z, Wild DL (2009) Modeling and visualizing uncertainty in gene expression clusters using Dirichlet process mixtures. IEEE/ACM Trans Comput Biol Bioinform 6:615–627

    Article  Google Scholar 

  • Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464

    Article  MathSciNet  MATH  Google Scholar 

  • Sethuraman J (1994) A constructive definition of Dirichlet priors. Stat Sinica 4:639–650

    MathSciNet  MATH  Google Scholar 

  • Sorensen D, Gianola D (2002) Likelihood, Bayesian and MCMC methods in quantitative genetics. Springer, New York

    Book  MATH  Google Scholar 

  • Steane MA, McNicholas PD, Yada R (2012) Model-based classification via mixtures of multivariate \(t\)-factor analyzers. Commun Stat Simul Comput 41(4):510–523

    Article  MathSciNet  MATH  Google Scholar 

  • Wakefield JC, Zhou C, Self SG (2003) Modelling gene expression over time: curve clustering with informative prior distributions. In: Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, West M (eds) Bayesian statistics, vol 7. Oxford University Press, Oxford, pp 721–732

    Google Scholar 

  • Wang WL (2013) Multivariate t linear mixed models for irregularly observed multiple repeated measures with missing outcomes. Biometr J 55:554–571

    Article  MathSciNet  MATH  Google Scholar 

  • Wang WL, Fan TH (2011) Estimation in multivariate t linear mixed models for multiple longitudinal data. Stat Sinica 21:1857–1880

    MathSciNet  MATH  Google Scholar 

  • Wang WL, Lin TI (2014) Multivariate t nonlinear mixed-effects models for multi-outcome longitudinal data with missing values. Stat Med 33:3029–3046

    Article  MathSciNet  Google Scholar 

  • Wang WL, Lin TI (2015) Robust model-based clustering via mixtures of skew-t distributions with missing information. Adv Data Anal Classif 9(4):423–445

    Article  MathSciNet  Google Scholar 

  • Wang L, Wang X (2013) Hierarchical Dirichlet process model for gene expression clustering. EURASIP J Bioinform Syst Biol 2013:5

    Article  Google Scholar 

Download references

Acknowledgments

The authors gratefully acknowledge two reviewers for their valuable comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Reyhaneh Rikhtehgaran.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rikhtehgaran, R., Kazemi, I. The determination of uncertainty levels in robust clustering of subjects with longitudinal observations using the Dirichlet process mixture. Adv Data Anal Classif 10, 541–562 (2016). https://doi.org/10.1007/s11634-016-0262-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-016-0262-x

Keywords

Mathematics Subject Classification

Navigation