Abstract
The Gaussian mixture model (GMM) is a popular tool for multivariate analysis, in particular, cluster analysis. The expectation–maximization (EM) algorithm is generally used to perform maximum likelihood (ML) estimation for GMMs due to the M-step existing in closed form and its desirable numerical properties, such as monotonicity. However, the EM algorithm has been criticized as being slow to converge and thus computationally expensive in some situations. In this article, we introduce the linear regression characterization (LRC) of the GMM. We show that the parameters of an LRC of the GMM can be mapped back to the natural parameters, and that a minorization–maximization (MM) algorithm can be constructed, which retains the desirable numerical properties of the EM algorithm, without the use of matrix operations. We prove that the ML estimators of the LRC parameters are consistent and asymptotically normal, like their natural counterparts. Furthermore, we show that the LRC allows for simple handling of singularities in the ML estimation of GMMs. Using numerical simulations in the R programming environment, we then demonstrate that the MM algorithm can be faster than the EM algorithm in various large data situations, where sample sizes range in the tens to hundreds of thousands and for estimating models with up to 16 mixture components on multivariate data with up to 16 variables.
Similar content being viewed by others
References
Amemiya T (1985) Advanced econometrics. Harvard University Press, Cambridge
Anderson TW (2003) An introduction to multivariate statistical analysis. Wiley, New York
Andrews JL, McNicholas PD (2013) Using evolutionary algorithms for model-based clustering. Pattern Recognit Lett 34:987–992
Atienza N, Garcia-Heras J, Munoz-Pichardo JM, Villa R (2007) On the consistency of MLE in finite mixture models of exponential families. J Stat Plan Inference 137:496–505
Becker MP, Yang I, Lange K (1997) EM algorithms without missing data. Stat Methods Med Res 6:38–54
Bishop CM (2006) Pattern recognition and machine learning. Springer, New York
Botev Z, Kroese DP (2004) Global likelihood optimization via the cross-entropy method with an application to mixture models. In: Proceedings of the 36th conference on winter simulation
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14:315–332
Clarke B, Fokoue E, Zhang HH (2009) Principles and theory for data mining and machine learning. Springer, New York
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38
Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New York
Ganesalingam S, McLachlan GJ (1980) A comparison of the mixture and classification approaches to cluster analysis. Commun Stat Theory Methods 9:923–933
Greselin F, Ingrassia S (2008) A note on constrained EM algorithms for mixtures of elliptical distributions. Advances in data analysis, data handling and business intelligence In: Proceedings of the 32nd annual conference of the German classification society. vol 53
Hartigan JA (1985) Statistical theory in clustering. J Classif 2:63–76
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Springer, New York
Hathaway RJ (1985) A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Ann Stat 13:795–800
Hunter DR, Lange K (2004) A tutorial on MM algorithms. Am Stat 58:30–37
Ingrassia S (1991) Mixture decomposition via the simulated annealing algorithm. Appl Stoch Models Data Anal 7:317–325
Ingrassia S (2004) A likelihood-based constrained algorithm for multivariate normal mixture models. Stat Methods Appl 13:151–166
Ingrassia S, Rocci R (2007) Constrained monotone EM algorithms for finite mixture of multivariate Gaussians. Comput Stat Data Anal 51:5339–5351
Ingrassia S, Rocci R (2011) Degeneracy of the EM algorithm for the MLE of multivariate Gaussian mixtures and dynamic constraints. Comput Stat Data Anal 55:1714–1725
Ingrassia S, Minotti SC, Vittadini G (2012) Local statistical modeling via a cluster-weighted approach with elliptical distributions. J Classif 29:363–401
Ingrassia S, Minotti SC, Punzo A (2014) Model-based clustering via linear cluster-weighted models. Comput Stat Data Anal 71:159–182
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31:651–666
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31:264–323
Jennrich RI (1969) Asymptotic properties of non-linear least squares estimators. Ann Math Stat 40:633–643
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkley symposium on mathematical statistics and probability, University of California press, 281–297
McLachlan GJ (1982) The classification and mixture maximum likelihood approaches to cluster analysis. In: Krishnaiah PR, Kanal L (eds) Handbook of statistics, vol 2. North-Holland, Amsterdam
McLachlan GJ, Basford KE (1988) Mixture models: inference and applications to clustering. Marcel Dekker, New York
McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York
McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions. Wiley, New York
Pernkopf F, Bouchaffra D (2005) Genetic-based EM algorithm for learning Gaussian mixture models. IEEE Trans Pattern Anal Mach Intell 27:1344–1348
R Core Team (2013) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Razaviyayn M, Hong M, Luo ZQ (2013) A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM J Optim 23:1126–1153
Redner RA, Walker HF (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 26:195–239
Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge
Seber GAF (2008) A matrix handbook for statisticians. Wiley, New York
Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New York
Zhou H, Lange K (2010) Mm algorithms for some discrete multivariate distributions. J Comput Graph Stat 19:645–665
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Proof of Theorem 1
We shall show the result by construction. Firstly, set
followed by
and
for each \(k=2,\ldots ,d\), in order, to get
and \(\sigma _{k}^{2}=\varSigma _{k|1:k-1}\).
Now, by Lemma 1, and by definition of conditional densities,
for all \({\varvec{x}}\in \mathbb {R}^{d}\), which implies \(\lambda ({\varvec{x}}; {\varvec{\gamma }}, {\varvec{\sigma }}^{2})=\phi _{d}({\varvec{x}}; {\varvec{\mu }}, {\varvec{\varSigma }})\) by application of the mappings (22)–(25). Note that \({\varvec{\mu }}\) and \(\hbox {vech}({\varvec{\varSigma }})\), and \({\varvec{\gamma }}\) and \({\varvec{\sigma }}^{2}\) have equal numbers of elements, and (22)–(25) are unique for each k. Thus, there is an injective mapping between the LRC and the natural parameters. The inverse mapping can also be constructed by setting
followed by
and
for each \(k=2,\ldots ,d\), in order. The mappings (26)–(29) are also unique for each k, and thus constitutes a surjective mapping.
1.2 Proof of Theorem 2
The first and last inequalities of (19) and (20) are due to the definition of minorization [i.e. (11) and (14) are of forms (9) and (10), respectively]. The middle inequality of (19) is due to the concavity of \(Q^{\prime }({\varvec{\psi }}_{1},{\varvec{\psi }}_{2}^{(m)}; {\varvec{\psi }}^{(m)})\). This can be shown by firstly noting that
is concave in \(\log (\pi _{i})\) since \(1-\sum _{i=1}^{g-1}\exp [\log (\pi _{i})]\) is concave and \(\log \) is an increasing concave function. Secondly, note that
is negative, and thus \(Q^{\prime }({\varvec{\psi }}_{1},{\varvec{\psi }}_{2}^{(m)}; {\varvec{\psi }}^{(m)})\) is concave with respect to each \(\beta _{i,k,l}\) for each i, k and \(l=0,\ldots ,k-1\). Thus, \(Q^{\prime }({\varvec{\psi }}_{1},{\varvec{\psi }}_{2}^{(m)}; {\varvec{\psi }}^{(m)})\) is the additive composition of concave functions and is therefore concave with respect to a bijection of \({\varvec{\psi }}_{1}\). Furthermore, the system of equations
for \(i=1,\ldots ,g-1\), has a unique root that is equivalent to update (16), which always satisfies the positivity restrictions on each \(\pi _{i}\).
The middle inequality of (20) is due to the concavity of \(Q({\varvec{\psi }}_{1}^{(m)},{\varvec{\psi }}_{2}; {\varvec{\psi }}^{(m)})\). This can be shown by noting that
is concave in \(\log \sigma _{i,k}^{2}\) for each i and k, since the inverse of \(\exp (x)\) is convex. Thus, \(Q({\varvec{\psi }}_{1}^{(m)}, {\varvec{\psi }}_{2};{\varvec{\psi }}^{(m)})\) is concave with respect to a bijection of \({\varvec{\psi }}_{2}.\) Furthermore, the system of equations
has a unique root that is equivalent to update (18).
1.3 Proof of Theorem 3
This result follows from part (a) of Theorem 2 from Razaviyayn et al. (2013), which assumes that \(Q^{\prime }({\varvec{\psi }}_{1},{\varvec{\psi }}_{2}^{(m)}; {\varvec{\psi }}^{(m)})\) and \(Q({\varvec{\psi }}_{1}^{(m)},{\varvec{\psi }}_{2}; {\varvec{\psi }}^{(m)})\) both satisfy the definition of a minorizer, and are quasi-concave and have unique critical points, with respect to the parameters \({\varvec{\psi }}_{1}\) and \({\varvec{\psi }}_{2}\), respectively.
Firstly, the definition of a minorizer is satisfied via construction [i.e. (11) and (14) are of forms (9) and (10), respectively]. Secondly, from the proof of Theorem 2, both functions are concave with respect to some bijective mappings, and are therefore quasi-concave under said mappings [see Section 3.4 of Boyd and Vandenberghe (2004) regarding quasi-concavity]. Finally, since both functions are concave with respect to some bijective mapping, the critical points obtained must be unique.
1.4 Proof of Theorem 4
We show this result via induction. Firstly, using (26), we see that \(\sigma _{1}^{2}=\det (\varSigma _{1,1})>0\) is the first leading principal minor of \({\varvec{\varSigma }}\), and is positive. Now, by definition of (23), \(\sigma _{2}^{2}\) is the Schur complement of \({\varvec{\varSigma }}_{1:k,1:k}\), for \(k=2\), where
Since \(\sigma _{2}^{2}\) is positive and \(\varSigma _{1,1}\) is positive definite, we have the result that
via the partitioning of the determinant. Thus, \({\varvec{\varSigma }}_{1:2,1:2}\) is also positive definite because both the first and second leading principal minors are positive.
Now, for each \(k=3,\ldots ,d\), we assume that \({\varvec{\varSigma }}_{1:k-1,1:k-1}\) is positive-definite. Since \(\sigma _{k}^{2}>0\) is the Schur complement of the partitioning (30), we have the result that
Thus, the \(k\hbox {th}\) leading principal minor is positive, for all k. The result follows by the property of positive-definite matrices; see Chapters 10 and 14 of Seber (2008) for all relevant matrix results.
1.5 Proof of Theorem 5
Theorem 5 can be established from Theorem 4.1.2 of Amemiya (1985), which requires the validation of the assumptions,
-
A1
The parameter space \(\varPsi \) is an open subset of some Euclidean space.
-
A2
The log-likelihood \(\log \mathcal {L}_{R,n}({\varvec{\psi }})\) is a measurable function for all \({\varvec{\psi }}\in \varPsi \), \(\partial (\log \mathcal {L}_{R,n}({\varvec{\psi }}))/\partial {\varvec{\psi }}\) exist and is continuous in an open neighborhood \(N_{1}({\varvec{\psi }}^{0})\) of \({\varvec{\psi }}^{0}\).
-
A3
There exists an open neighborhood \(N_{2}({\varvec{\psi }}^{0})\) of \({\varvec{\psi }}^{0}\), where \(n^{-1}\log \mathcal {L}_{R,n}({\varvec{\psi }})\) converges to \(\mathbb {E}[\log f_{R}({\varvec{X}}; {\varvec{\psi }})]\) in probability uniformly in \({\varvec{\psi }}\) in any compact subset of \(N_{2}({\varvec{\psi }}^{0})\).
Assumptions A1, and A2 are fulfilled by noting that the parameter space \(\varPsi =(0,1)^{g-1}\times \mathbb {R}^{g(d^{2}+d)/2+gd}\) is an open subset of \(\mathbb {R}^{(g-1)+g(d^{2}+d)/2+gd}\), and that \(\log \mathcal {L}_{R,n}({\varvec{\psi }})\) is smooth with respect to the parameters \({\varvec{\psi }}\). Using Theorem 2 of Jennrich (1969), we can show that A3 holds by verifying that
where \(\bar{N}\) is a compact subset of \(N_{2}({\varvec{\psi }}^{0})\). Since \(f_{R}({\varvec{X}}; {\varvec{\psi }})\) is smooth, this is equivalent to showing that \(\mathbb {E}|f_{R}({\varvec{X}}; {\varvec{\psi }})|<\infty \), for any fixed \({\varvec{\psi }}\in \bar{N}\). This is achieved by noting that
The inequality on line 3 of (32) is due to Lemma 1 of Atienza et al. (2007). Considering that \(\log \phi _{1}(x_{k};{\varvec{\beta }}_{i,k}^{T} \tilde{{\varvec{x}}}_{k},\sigma _{i,k}^{2})\) is a polynomial function of Gaussian random variables, we have \(\mathbb {E}|\log \phi _{1}(x_{k};{\varvec{\beta }}_{i,k}^{T} \tilde{{\varvec{x}}}_{k},\sigma _{i,k}^{2})|<\infty \) for each i and k. The result then follows.
1.6 Proof of Theorem 6
Theorem 6 can be established from Theorem 4.2.4 of Amemiya (1985), which requires the validation of the assumptions,
-
B1
The Hessian \(\partial ^{2}(\log \mathcal {L}_{R,n}({\varvec{\psi }}))/ \partial {\varvec{\psi }}\partial {\varvec{\psi }}^{T}\) exists and is continuous in an open neighborhood of \({\varvec{\psi }}^{0}\).
-
B2
The equations
$$\begin{aligned} \int \frac{\partial \log f_{R}({\varvec{\psi }})}{\partial {\varvec{\psi }}}\hbox {d}{\varvec{x}}=\varvec{0}, \end{aligned}$$and
$$\begin{aligned} \int \frac{\partial ^{2}\log f_{R}({\varvec{\psi }})}{\partial {\varvec{\psi }}\partial {\varvec{\psi }}^{T}} \hbox {d}{\varvec{x}}=\varvec{0}, \end{aligned}$$hold, for any \({\varvec{\psi }}\in \varPsi \).
-
B3
The averaged Hessian satisfies
$$\begin{aligned} \frac{1}{n}\frac{\partial ^{2}\log \mathcal {L}_{R,n} ({\varvec{\psi }})}{\partial {\varvec{\psi }}\partial {\varvec{\psi }}^{T}}\overset{P}{\rightarrow }\mathbb {E} \left[ \frac{\partial ^{2}\log f_{R}({\varvec{X}}; {\varvec{\psi }})}{\partial {\varvec{\psi }} \partial {\varvec{\psi }}^{T}}\right] , \end{aligned}$$uniformly in \({\varvec{\psi }}\), in all compact subsets of an open neighborhood of \({\varvec{\psi }}^{0}\).
-
B4
The Fisher information
$$\begin{aligned} -\mathbb {E}\left[ \frac{\partial ^{2}\log f_{R} ({\varvec{x}}; {\varvec{\psi }})}{\partial {\varvec{\psi }} \partial {\varvec{\psi }}^{T}}\bigg |_{{\varvec{\psi }}= {\varvec{\psi }}^{0}}\right] ^{-1}, \end{aligned}$$is positive-definite.
Assumption B1 is validated via the smoothness of \(\log \mathcal {L}_{R,n}({\varvec{\psi }})\), and it is mechanical to check the validity of B2. Assumption B3 can be shown via Theorem 2 of Jennrich (1969). Unlike the others, B4 must be taken as given.
Rights and permissions
About this article
Cite this article
Nguyen, H.D., McLachlan, G.J. Maximum likelihood estimation of Gaussian mixture models without matrix operations. Adv Data Anal Classif 9, 371–394 (2015). https://doi.org/10.1007/s11634-015-0209-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-015-0209-7