Using conditional independence for parsimonious model-based Gaussian clustering

Galimberti, Giuliano; Soffritti, Gabriele

doi:10.1007/s11222-012-9336-6

Using conditional independence for parsimonious model-based Gaussian clustering

Published: 05 June 2012

Volume 23, pages 625–638, (2013)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Giuliano Galimberti¹ &
Gabriele Soffritti¹

500 Accesses
5 Citations
Explore all metrics

Abstract

In the framework of model-based cluster analysis, finite mixtures of Gaussian components represent an important class of statistical models widely employed for dealing with quantitative variables. Within this class, we propose novel models in which constraints on the component-specific variance matrices allow us to define Gaussian parsimonious clustering models. Specifically, the proposed models are obtained by assuming that the variables can be partitioned into groups resulting to be conditionally independent within components, thus producing component-specific variance matrices with a block diagonal structure. This approach allows us to extend the methods for model-based cluster analysis and to make them more flexible and versatile. In this paper, Gaussian mixture models are studied under the above mentioned assumption. Identifiability conditions are proved and the model parameters are estimated through the maximum likelihood method by using the Expectation-Maximization algorithm. The Bayesian information criterion is proposed for selecting the partition of the variables into conditionally independent groups. The consistency of the use of this criterion is proved under regularity conditions. In order to examine and compare models with different partitions of the set of variables a hierarchical algorithm is suggested. A wide class of parsimonious Gaussian models is also presented by parameterizing the component-variance matrices according to their spectral decomposition. The effectiveness and usefulness of the proposed methodology are illustrated with two examples based on real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Violating the normality assumption may be the lesser of two evils

Article Open access 07 May 2021

Ulrich Knief & Wolfgang Forstmeier

Density-Based Clustering Based on Hierarchical Density Estimates

A Systematic Review of Hidden Markov Models and Their Applications

Article 12 May 2020

Bhavya Mor, Sunita Garhwal & Ajay Kumar

References

Baek, J., McLachlan, G.J.: Mixtures of factor analyzers with common factor loadings for the clustering and visualisation of high-dimensional data. Technical report NI08018-SCH, Preprint, Series of the Isaac Newton Institute for Mathematical Sciences, Cambridge (2008)
Baek, J., McLachlan, G.J., Flack, L.: Mixtures of factor analyzers with common factor loadings: applications to the clustering and visualisation of high-dimensional data. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1298–1309 (2010)
Article Google Scholar
Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)
Article MathSciNet MATH Google Scholar
Bartholomew, D., Knott, M., Moustaki, I.: Latent Variable Models and Factor Analysis: A Unified Approach, 3rd edn. Wiley, Chichester (2011)
Book Google Scholar
Basso, R.M., Lachos, V.H., Barbosa Cabral, C.R., Ghosh, P.: Robust mixture modeling based on scale mixtures of skew-normal distributions. Comput. Stat. Data Anal. 54, 2926–2941 (2010)
Article Google Scholar
Biernacki, C., Govaert, G.: Choosing models in model-based clustering and discriminant analysis. J. Stat. Comput. Simul. 64, 49–71 (1999)
Article MATH Google Scholar
Biernacki, C., Celeux, G., Govaert, G.: Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput. Stat. Data Anal. 41, 561–575 (2003)
Article MathSciNet Google Scholar
Biernacki, C., Celeux, G., Govaert, G., Langrognet, F.: Model-based cluster and discriminant analysis with the MIXMOD software. Comput. Stat. Data Anal. 51, 587–600 (2006)
Article MathSciNet MATH Google Scholar
Böhning, D., Seidel, W.: Editorial: recent developments in mixture models. Comput. Stat. Data Anal. 41, 349–357 (2003)
Article Google Scholar
Böhning, D., Seidel, W., Alfò, M., Garel, B., Patilea, V., Walther, G.: Advances in mixture models. Comput. Stat. Data Anal. 51, 5205–5210 (2007)
Article MATH Google Scholar
Bouveyron, C., Girard, S., Schmid, C.: High-dimensional data clustering. Comput. Stat. Data Anal. 52, 502–519 (2007)
Article MathSciNet MATH Google Scholar
Branco, M.D., Dey, D.K.: A general class of multivariate skew-elliptical distributions. J. Multivar. Anal. 79, 99–113 (2001)
Article MathSciNet MATH Google Scholar
Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognit. 28, 781–793 (1995)
Article Google Scholar
Cook, R.D., Weisberg, S.: An Introduction to Regression Graphics. Wiley, New York (1994)
Book MATH Google Scholar
Coretto, P., Hennig, C.: Maximum likelihood estimation of heterogeneous mixtures of Gaussian and uniform distributions. J. Stat. Plan. Inference 141, 462–473 (2011)
Article MathSciNet MATH Google Scholar
Cutler, A., Windham, M.P.: Information-based validity functionals for mixture analysis. In: Bozdogan, H. (ed.) Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach, pp. 149–170. Kluwer Academic, Dordrecht (1994)
Chapter Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood for incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–22 (1977)
MathSciNet MATH Google Scholar
Dias, J.G.: Latent class analysis and model selection. In: Spilopoulou, M., Kruse, R., Borgelt, C., Nürnberger, A., Gaul, W. (eds.) From Data and Information Analysis to Knowledge Engineering, pp. 95–102. Springer, Berlin (2006)
Chapter Google Scholar
Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J. 41, 578–588 (1998)
Article MATH Google Scholar
Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002)
Article MathSciNet MATH Google Scholar
Fraley, C., Raftery, A.E.: Enhanced software for model-based clustering. J. Classif. 20, 263–286 (2003)
Article MathSciNet MATH Google Scholar
Fraley, C., Raftery, A.E.: MCLUST version 3 for R: normal mixture modeling and model-based clustering. Technical report No. 504, Department of Statistics, University of Washington (2006)
Frank, A., Asuncion, A.: UCI machine learning repository. School of Information and Computer Science, University of California, Irvine, CA (2010). http://archive.ics.uci.edu/ml
Galimberti, G., Soffritti, G.: Model-based methods to identify multiple cluster structures in a data set. Comput. Stat. Data Anal. 52, 520–536 (2007)
Article MathSciNet MATH Google Scholar
Galimberti, G., Montanari, A., Viroli, C.: Penalized factor mixture analysis for variable selection in clustered data. Comput. Stat. Data Anal. 53, 4301–4310 (2009)
Article MathSciNet MATH Google Scholar
Ghahramani, Z., Hinton, G.E.: The EM algorithm for factor analyzers. Technical report CRG-TR-96-1, University of Toronto (1997)
Gordon, A.D.: Classification, 2nd edn. Chapman & Hall, Boca Raton (1999)
MATH Google Scholar
Karlis, D., Santourian, A.: Model-based clustering with non-elliptically contoured distributions. Stat. Comput. 19, 73–83 (2009)
Article MathSciNet Google Scholar
Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995)
Article MATH Google Scholar
Keribin, C.: Consistent estimation of the order of mixture models. Sankhyā Ser. A 62, 49–66 (2000)
MathSciNet MATH Google Scholar
Lin, T.I.: Maximum likelihood estimation for multivariate skew normal mixture models. J. Multivar. Anal. 100, 257–265 (2009)
Article MATH Google Scholar
Lin, T.I.: Robust mixture modeling using multivariate skew t distributions. Stat. Comput. 20, 343–356 (2010)
Article MathSciNet Google Scholar
Lin, T.I., Lee, J.C., Hsieh, W.J.: Robust mixture modeling using the skew t distribution. Stat. Comput. 17, 81–92 (2007a)
Article MathSciNet Google Scholar
Lin, T.I., Lee, J.C., Yen, S.Y., Shu, Y.: Finite mixture modelling using the skew normal distribution. Stat. Sin. 17, 909–927 (2007b)
MATH Google Scholar
Lütkepohl, H.: Handbook of Matrices. Wiley, Chichester (1996)
MATH Google Scholar
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Le Cam, L.M., Neyman, J. (eds.) Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)
Google Scholar
Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection for clustering with Gaussian mixture models. Technical report RR-6211, Inria, France (2007)
Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection in model-based clustering: a general variable role modeling. Comput. Stat. Data Anal. 53, 3872–3882 (2009a)
Article MathSciNet MATH Google Scholar
Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection for clustering with Gaussian mixture models. Biometrics 65, 701–709 (2009b)
Article MathSciNet MATH Google Scholar
McColl, J.H.: Multivariate Probability. Arnold, London (2004)
MATH Google Scholar
McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions, 2nd edn. Wiley, Chichester (2008)
Book MATH Google Scholar
McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, Chichester (2000a)
Book MATH Google Scholar
McLachlan, G.J., Peel, D.: Mixtures of factor analyzers. In: Langley, P. (ed.) Proceedings of the Seventeenth International Conference on Machine Learning, pp. 599–606. Morgan Kaufmann, San Francisco (2000b)
Google Scholar
McLachlan, G.J., Peel, D., Basford, K.E., Adams, P.: The EMMIX software for the fitting of mixtures of normal and t-components. J. Stat. Softw. 4, 2 (1999)
Google Scholar
McLachlan, G.J., Peel, D., Bean, R.W.: Modelling high-dimensional data by mixtures of factor analyzers. Comput. Stat. Data Anal. 41, 379–388 (2003)
Article MathSciNet Google Scholar
McLachlan, G.J., Bean, R.W., Ben-Tovim Jones, L.: Extension of the mixture of factor analyzers model to incorporate the multivariate t-distribution. Comput. Stat. Data Anal. 51, 5327–5338 (2007)
Article MathSciNet MATH Google Scholar
McNicholas, P.D., Murphy, T.B.: Parsimonious Gaussian mixture models. Stat. Comput. 18, 285–296 (2008)
Article MathSciNet Google Scholar
McNicholas, P.D., Murphy, T.B., McDaid, A.F., Frost, D.: Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput. Stat. Data Anal. 54, 711–723 (2010)
Article MathSciNet MATH Google Scholar
Melnykov, V., Maitra, R.: Finite mixture models and model-based clustering. Stat. Surv. 4, 80–116 (2010)
Article MathSciNet MATH Google Scholar
Melnykov, V., Melnykov, I.: Initializing the EM algorithm in Gaussian mixture models with an unknown number of components. Comput. Stat. Data Anal. (2011). doi:10.1016/j.csda.2011.11.002
MATH Google Scholar
Miloslavsky, M., van der Laan, M.J.: Fitting of mixtures with unspecified number of components using cross validation distance estimate. Comput. Stat. Data Anal. 41, 413–428 (2003)
Article Google Scholar
Montanari, A., Viroli, C.: Heteroscedastic factor mixture analysis. Stat. Model. 10, 441–460 (2010a)
Article MathSciNet Google Scholar
Montanari, A., Viroli, C.: The independent factor analysis approach to latent variable modelling. Statistics 44, 397–416 (2010b)
Article MathSciNet Google Scholar
Peel, D., McLachlan, G.J.: Robust mixture modeling using the t-distribution. Stat. Comput. 10, 339–348 (2000)
Article Google Scholar
R Development Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2010). http://www.R-project.org
Raftery, A.E., Dean, N.: Variable selection for model-based cluster analysis. J. Am. Stat. Assoc. 101, 168–178 (2006)
Article MathSciNet MATH Google Scholar
Ray, S., Lindsay, B.G.: Model selection in high dimensions: a quadratic-risk-based approach. J. R. Stat. Soc. Ser. B 70, 95–118 (2008)
MathSciNet MATH Google Scholar
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
Article MATH Google Scholar
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., Johannes, R.S.: Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In: Proceedings of the Symposium on Computer Applications and Medical Care, pp. 261–265. IEEE Computer Society, Los Alamitos (1988)
Google Scholar
Teicher, H.: Identifiability of mixture models. Ann. Math. Stat. 34, 1265–1269 (1963)
Article MathSciNet MATH Google Scholar
Tipping, M.E., Bishop, C.M.: Mixture of probabilistic principal component analysers. Neural Comput. 11, 443–482 (1999)
Article Google Scholar
Titterington, D.M., Smith, A.F.M., Makov, U.E.: Statistical Analysis of Finite Mixture Distributions. Wiley, Chichester (1985)
MATH Google Scholar
Viroli, C.: Dimensionally reduced model-based clustering through mixtures of factor mixture analyzers. J. Classif. 27, 363–388 (2010)
Article MathSciNet Google Scholar
Wang, K., Ng, S.-K., McLachlan, G.J.: Multivariate skew t mixture models: applications to fluorescence-activated cell sorting data. In: Shi, H., Zhang, Y., Bottema, M.J., Lovell, B.C., Maeder, A.J. (eds.) Proceedings of the 2009 Conference of Digital Image Computing: Techniques and Applications, pp. 526–531. IEEE Computer Society, Los Alamitos (2009)
Chapter Google Scholar
Yakowitz, S.J., Spragins, J.D.: On the identifiability of finite mixtures. Ann. Math. Stat. 39, 209–214 (1968)
Article MathSciNet MATH Google Scholar
Yang, C.C.: Evaluating latent class analysis models in qualitative phenotype identification. Comput. Stat. Data Anal. 50, 1090–1104 (2006)
Article Google Scholar
Yoshida, R., Higuchi, T., Imoto, S.: A mixed factors model for dimension reduction and extraction of a group structure in gene expression data. In: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference, pp. 161–172 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, University of Bologna, via delle Belle Arti 41, 40126, Bologna, Italy
Giuliano Galimberti & Gabriele Soffritti

Authors

Giuliano Galimberti
View author publications
You can also search for this author in PubMed Google Scholar
Gabriele Soffritti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gabriele Soffritti.

Appendix A

1.1 A.1 Proof of Theorem 1

In order to prove Theorem 1 it is possible to exploit arguments similar to the ones used in Maugis et al. (2009b) to prove identifiability of Gaussian mixture models with irrelevant variables.

Given Eq. (3), both f(.|θ _M) and $f(.|\boldsymbol{\theta}^{*}_{M^{*}})$ can be written as Gaussian mixture models. Namely:

$$ f(\mathbf{x}_i|\boldsymbol{\theta}_M) = \sum _{k=1}^K \pi_k \phi_{P} (\mathbf{\tilde{x}}_i|\boldsymbol{ \mu}_{k},\boldsymbol{\varSigma}_{k} ), $$

where $\mathbf{\tilde{x}}_{i}=(\mathbf{x}_{i}^{S_{1}\top}, \ldots, \mathbf{x}_{i}^{S_{g}\top}, \ldots, \mathbf{x}_{i}^{S_{G}\top})^{\top}$ is obtained by permuting vector x _i so that the P variables are listed according to the order established by (S ₁,…,S _G), $\boldsymbol{\mu}_{k}=(\boldsymbol{\mu}_{k1}^{\top}, \ldots, \boldsymbol{\mu}_{kg}^{\top}, \ldots, \boldsymbol{\mu}_{kG}^{\top})^{\top}$, and $\boldsymbol{\varSigma}_{k}=\bigoplus_{g=1}^{G} \boldsymbol{\varSigma}_{kg}$. Analogously,

$$ f\bigl(\mathbf{x}_i|\boldsymbol{\theta}^*_{M^*} \bigr) = \sum_{k=1}^{K^*} \pi_k^* \phi_{P} \bigl(\mathbf{\tilde{x}}_i^*|\boldsymbol{ \mu}_{k}^*,\boldsymbol{\varSigma}_{k}^* \bigr). $$

Then, since the couples (μ _k,Σ _k) for k=1,…,K are distinct as well as $(\boldsymbol{\mu}_{k}^{*},\boldsymbol{\varSigma}_{k}^{*})$ for k=1,…,K ^∗, the identifiability of Gaussian mixture models gives that K=K ^∗ and, up to a permutation of mixture components and of the elements of x _i, $\pi_{k}=\pi_{k}^{*}$, $\boldsymbol{\mu}_{k}=\boldsymbol{\mu}_{k}^{*}$ and $\boldsymbol{\varSigma}_{k}=\boldsymbol{\varSigma}_{k}^{*}$ (see, for example, Yakowitz and Spragins 1968).

In order to complete the proof it is now proven by contradiction that, under the constraint (5), G=G ^∗ and each element of (S ₁,…,S _G) coincides with one and only one element of $(S_{1}^{*}, \ldots, S_{G^{*}}^{*})$.

Consider S _g∈(S ₁,…,S _G). Since both (S ₁,…,S _G) and $(S_{1}^{*}, \ldots, S_{G^{*}}^{*})$ are partitions of the variable index set $\mathcal{I}$, there exists at least one $S_{h}^{*} \in (S_{1}^{*}, \ldots, S_{G^{*}}^{*})$ such that $S_{g} \cap S_{h}^{*} \neq \emptyset$. Let $s=S_{g} \cap S_{h}^{*}$, $t=S_{g}^{c} \cap S_{h}^{*}$, and $\bar{s}=S_{g} \cap S_{h}^{*c}$.

Suppose that t≠∅. Since t∩S _g=∅, according to model M=(G,K,S ₁,…,S _G) Σ _k,ts=0 _ts ∀k. Due to identifiability of Gaussian mixture models, this implies that $\boldsymbol{\varSigma}^{*}_{k,ts}=\boldsymbol{\varSigma}^{*}_{kh,ts}=\mathbf{0}_{ts}$ ∀k. However, this result contradicts the constraint (5). Thus, t=∅.

Analogously, suppose now that $\bar{s} \neq \emptyset$. Since $\bar{s} \cap S_{h}^{*} = \emptyset$, model $M^{*}=(G^{*}, K^{*}, S_{1}^{*}, \ldots, S_{G^{*}}^{*})$ implies that $\boldsymbol{\varSigma}^{*}_{k,s\bar{s}}=\mathbf{0}_{s\bar{s}}$ ∀k. Together with the identifiability of Gaussian mixture models, this also implies that $\boldsymbol{\varSigma}_{k,s\bar{s}}=\boldsymbol{\varSigma}_{kg,s\bar{s}}=\mathbf{0}_{s\bar{s}}$ ∀k. However, since this result contradicts the constraint (5), it follows from this that $\bar{s} = \emptyset$.

These two results imply that, under the constraint (5), $S_{g} \cap S_{h}^{*} \neq \emptyset$ ⇒ $S_{g} = S_{h}^{*}$. Thus, since S _g and $S_{h}^{*}$ belong to two partitions of the variable index set $\mathcal{I}$, there exists a one-to-one correspondence between the two partitions, and G=G ^∗.

1.2 A.2 Proof of Corollary 1

In order to prove Corollary 1 it is sufficient to note that under the constraint (5) the parameter space of models in $\mathcal{M}_{I}$ is a subset of $\varTheta_{(G, K, S_{1}, \ldots, S_{G})}$ and, hence, Theorem 1 holds also for this subclass of models.

1.3 A.3 Proof of Theorem 2

A proof of Theorem 2 can be obtained by exploiting some results from Maugis et al. (2007) and by suitably modifying the proof of the consistency of the BIC criterion in selecting relevant variables for clustering with Gaussian mixture models (Maugis et al. 2009b).

Since the true number of mixture components K ⁰ is assumed to be known, by adapting the notation previously introduced to this assumption let now denote M=(G,K ⁰,S ₁,…,S _G), $M^{0} = (G^{0}, K^{0}, S^{0}_{1}, \ldots, S^{0}_{G^{0}})$, and

$$ \hat{M} = \bigl(\hat{G}, K^0, \hat{S}_1, \ldots, \hat{S}_{\hat{G}}\bigr) = \operatorname{argmax}\limits _{M \in \mathcal{M}'} \mathit{BIC}(M), $$

where $\mathcal{M}' \subset \mathcal{M}$ is the subclass of the models obtainable from Eq. (4) with K=K ⁰. Furthermore, consider Δ_BIC(M)=BIC(M ⁰)−BIC(M); after some straightforward algebra it is possible to write

$$ \Delta_{\mathit{BIC}(M)}=2n [\mathbb{D}_{nM^0}- \mathbb{D}_{nM} ]+\gamma_{M}\ln(n), $$

(8)

where $\gamma_{M}=\lambda_{M}-\lambda_{M^{0}}$, with λ _M and $\lambda_{M^{0}}$ denoting the number of free parameters of models M and M ⁰, respectively, and

$$\mathbb{D}_{nM}=\frac{1}{n} \sum_{i=1}^n \ln \biggl[\frac{f(\mathbf{x}_i | \hat{\boldsymbol{\theta}}_{M})}{h(\mathbf{x}_i)} \biggr], $$

$$\mathbb{D}_{nM^0}=\frac{1}{n} \sum_{i=1}^n \ln \biggl[\frac{f(\mathbf{x}_i | \hat{\boldsymbol{\theta}}_{M^0})}{h(\mathbf{x}_i)} \biggr]. $$

Since $P (\hat{M}=M^{0} ) = P (\Delta_{\mathit{BIC}(M)} \geq 0, \forall M \in \mathcal{M}' )$, in order to prove Theorem 2 we have to show that

$$ P (\Delta_{\mathit{BIC}(M)} < 0 ) \operatorname{ \longrightarrow}\limits _{n \rightarrow \infty} 0\quad \forall M \in \mathcal{M}'. $$

(9)

Note that, when M=M ⁰, we have $\Delta_{\mathit{BIC}(M^{0})}=0$, thus $P (\Delta_{\mathit{BIC}(M^{0})} < 0 )=0$ ∀n.

Let now consider M≠M ⁰ and, for making easier the reading of this proof, let $D_{M}=-KL[h,f(\cdot|\boldsymbol{\breve{\theta}}_{M})]$, and $T_{nM}= \mathbb{D}_{nM^{0}}-\mathbb{D}_{nM}+\frac{\gamma_{M}\ln(n)}{2n}$. Given Eq. (8), it is possible to write P(Δ_BIC(M)<0)= P(T _nM<0). This probability is also equal to

$$ P (T_{nM}- D_{M^0} + D_{M^0} - D_M + D_M< 0 ). $$

According to Lemma 5 in Maugis et al. (2007), the following inequality holds ∀ϵ>0:

According to Proposition 1 (see below), we also have $\mathbb{D}_{nM} \stackrel{P}{\rightarrow} D_{M}$ $\forall M \in \mathcal{M}'$. Thus, ∀ϵ>0

$$ P (\mathbb{D}_{nM}-D_M> \epsilon ) \leq P \bigl(| \mathbb{D}_{nM}-D_M|> \epsilon \bigr) \operatorname{ \longrightarrow}\limits _{n \rightarrow \infty} 0. $$

Furthermore, according to the assumption (H1), $D_{M^{0}}=0$, and −D _M>0 since M≠M ⁰. Then,

Taking $\epsilon = \frac{-D_{M}}{4}$, since $\frac{\gamma_{M}\ln(n)}{2n} \operatorname{\longrightarrow}\limits_{n \rightarrow \infty} 0$ we also obtain

These results imply (9), thus proving the theorem.

1.4 A.4 Proposition 1

Under assumptions (H1) and (H2) the following convergence holds $\forall M \in \mathcal{M}'$:

$$ \frac{1}{n} \sum_{i=1}^n \ln \biggl( \frac{h(\mathbf{x}_i)}{f(\mathbf{x}_i | \hat{\boldsymbol{\theta}}_{M})} \biggr) \stackrel{P}{\rightarrow} KL\bigl[h,f( \cdot|\boldsymbol{\breve{\theta}}_{M'})\bigr]. $$

(10)

Proof

According to (H2), $\varTheta'_{M}$ is a compact metric space, and ln[f(x|θ _M)] is a continuous function of θ _M ∀x∈ℝ^P. Furthermore, it is possible to show that there exists an envelope function $H \in \mathcal{H}_{M}=\{\ln[f(\cdot|\boldsymbol{\theta}_{M})]; \boldsymbol{\theta}_{M} \in \varTheta'_{M}\}$ which is h-integrable. This latter result can be proved as follows.

Since Σ _kg is positive definite, $\|\mathbf{x}^{S_{g}}-\boldsymbol{\mu}_{kg}\|^{2}_{\boldsymbol{\varSigma}_{kg}^{-1}}\geq 0$, where $\|\mathbf{x}^{S_{g}}-\boldsymbol{\mu}_{kg}\|^{2}_{\boldsymbol{\varSigma}_{kg}^{-1}}= (\mathbf{x}^{S_{g}}-\boldsymbol{\mu}_{kg} )^{\top}\boldsymbol{\varSigma}_{kg}^{-1} (\mathbf{x}^{S_{g}}-\boldsymbol{\mu}_{kg} )$. Furthermore, $|\boldsymbol{\varSigma}_{kg}|^{-\frac{1}{2}} \leq a^{-\frac{1}{2}}$ (see Maugis et al. 2007, Lemma 3). Writing f(x|θ _M)= $\sum_{k=1}^{K} \pi_{k} g(\mathbf{x}|\boldsymbol{\vartheta}_{k})$, where

$$ g(\mathbf{x}|\boldsymbol{\vartheta}_k)= \prod _{g=1}^G |2 \pi\boldsymbol{\varSigma}_{kg}|^{-\frac{1}{2}} \exp \biggl(-\frac{\|\mathbf{x}^{S_g}-\boldsymbol{\mu}_{kg}\|^2_{\boldsymbol{\varSigma}_{kg}^{-1}}}{2} \biggr), $$

with ϑ _k=(μ _k1,…,μ _kG,Σ _k1,…,Σ _kG), and recalling that $\sum_{k=1}^{K} \pi_{k} =1$, the following upper bound of ln[f(x|θ _M)] holds:

For making shorter the following equations, let $d^{2}_{kg}=\|\mathbf{x}^{S_{g}}-\boldsymbol{\mu}_{kg}\|^{2}_{\boldsymbol{\varSigma}_{kg}^{-1}}$. Using the concavity of the logarithm function we obtain:

Since $\boldsymbol{\mu}_{kg} \in \mathcal{B}(\eta,P_{g})$ and using Lemma 3 in Maugis et al. (2007) it is possible to write:

Furthermore, since $|\boldsymbol{\varSigma}_{kg}| \leq b^{P_{g}}$ (see Maugis et al. 2007, Lemma 3), the lower bound of ln[f(x|θ _M)] is given by:

Thus, since each function of the family $\mathcal{H}_{M}$ is bounded by

for all $\boldsymbol{\theta}_{M} \in \varTheta'_{M}$ and all x∈ℝ^P we have

$$ \bigl|\ln\bigl[f(\mathbf{x}|\boldsymbol{\theta}_M)\bigr] \bigr| \leq C_1(a,b,P,G,\eta)+C_2(a)\|\mathbf{x}\|^2, $$

defining the envelope function H, where C ₁(a,b,P,G,η) and C ₂(a) are two positive constants.

The h-integrability of this function can be proved by showing that ∫∥x∥² h(x)d x<∞:

where inequalities are obtained using Lemmas 3 and 4 in Maugis et al. (2007) and assumption (H2).

Hence, according to Proposition 2 in Maugis et al. (2007),

$$ \frac{1}{n} \sum_{i=1}^n \ln \bigl[f(\mathbf{x}_i | \hat{\boldsymbol{\theta}}_{M}) \bigr] \stackrel{P}{\rightarrow} \mathbb{E}_\mathbf{X}\bigl[\ln f( \mathbf{X}| \boldsymbol{\breve{\theta}}_{M})\bigr]. $$

(11)

Then, since $\ln(h) \in \mathcal{H}_{M^{0}}$, it implies that $\mathbb{E}_{\mathbf{X}}[|\ln h(\mathbf{X})|] \leq \mathbb{E}_{\mathbf{X}}[H(\mathbf{X})]<\infty$. Thus, according to the law of large numbers

$$ \frac{1}{n} \sum_{i=1}^n \ln \bigl[h(\mathbf{x}_i) \bigr] \stackrel{P}{\rightarrow} \mathbb{E}_\mathbf{X}\bigl[\ln h(\mathbf{X})\bigr]. $$

(12)

Convergences (11) and (12) imply (10), thus proving the proposition. □

1.5 A.5 Proposition 2

Let Σ ₁,…,Σ _g,…,Σ _G be G real, symmetric and definite positive matrices, whose dimensions are P _g×P _g for g=1,…,G, and let $\boldsymbol{\varSigma} = \bigoplus_{g=1}^{G} \boldsymbol{\varSigma}_{g}$. Furthermore, let $\boldsymbol{\varSigma}_{g}= \lambda_{g} \mathbf{D}_{g}\mathbf{A}_{g}\mathbf{D}_{g}^{\top}$, where $\lambda_{g}=|\boldsymbol{\varSigma}_{g}|^{1/P_{g}}$, D _g is the matrix of orthonormal eigenvectors of Σ _g, and A _g is the diagonal matrix containing the eigenvalues of Σ _g (normalized in such a way that |A _g|=1). Then, Σ=λ DAD ^⊤, where $\lambda= \prod_{g=1}^{G}\lambda_{g}^{\frac{P_{g}}{P}}$, $\mathbf{D}=\bigoplus_{g=1}^{G} \mathbf{D}_{g}$, and $\mathbf{A}= \bigoplus_{g=1}^{G} \frac{\lambda_{g}}{\prod_{g=1}^{G}\lambda_{g}^{\frac{P_{g}}{P}}}\mathbf{A}_{g}$.

Proof

Consider the spectral decomposition of Σ _g: $\boldsymbol{\varSigma}_{g}=\mathbf{D}_{g}\mathbf{L}_{g}\mathbf{D}_{g}^{\top}$, where L _g is the diagonal matrix containing the eigenvalues of Σ _g, for g=1,…,G. Then, L _g=λ _g A _g.

According to some properties of the direct sum operator, the following results hold:

1.
$|\boldsymbol{\varSigma}|=\prod_{g=1}^{G}|\boldsymbol{\varSigma}_{g}|=\prod_{g=1}^{G}\lambda_{g}^{P_{g}}$ (see, for example, Lütkepohl 1996, p. 22);
2.
Σ=DLD ^⊤, where $\mathbf{D}=\bigoplus_{g=1}^{G} \mathbf{D}_{g}$ and $\mathbf{L}=\bigoplus_{g=1}^{G} \mathbf{L}_{g}$ (see Lütkepohl 1996, p. 66).

Hence, $|\boldsymbol{\varSigma}|^{\frac{1}{P}}=\prod_{g=1}^{G}\lambda_{g}^{\frac{P_{g}}{P}}=\lambda$, $\frac{1}{|\boldsymbol{\varSigma}|^{\frac{1}{P}}}\mathbf{L}=$ $\frac{1}{|\boldsymbol{\varSigma}|^{\frac{1}{P}}}\bigoplus_{g=1}^{G} \mathbf{L}_{g}$ $=\bigoplus_{g=1}^{G} \frac{\lambda_{g}}{\prod_{g=1}^{G}\lambda_{g}^{\frac{P_{g}}{P}}}\mathbf{A}_{g}=\mathbf{A}$, thus proving the proposition. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Galimberti, G., Soffritti, G. Using conditional independence for parsimonious model-based Gaussian clustering. Stat Comput 23, 625–638 (2013). https://doi.org/10.1007/s11222-012-9336-6

Download citation

Received: 24 October 2011
Accepted: 13 May 2012
Published: 05 June 2012
Issue Date: September 2013
DOI: https://doi.org/10.1007/s11222-012-9336-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using conditional independence for parsimonious model-based Gaussian clustering

Abstract

Access this article

Similar content being viewed by others

Violating the normality assumption may be the lesser of two evils

Density-Based Clustering Based on Hierarchical Density Estimates

A Systematic Review of Hidden Markov Models and Their Applications

References