Maximum likelihood estimation of Gaussian mixture models without matrix operations

Nguyen, Hien D.; McLachlan, Geoffrey J.

doi:10.1007/s11634-015-0209-7

Maximum likelihood estimation of Gaussian mixture models without matrix operations

Regular Article
Published: 05 June 2015

Volume 9, pages 371–394, (2015)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Hien D. Nguyen¹ &
Geoffrey J. McLachlan¹

1453 Accesses
15 Citations
Explore all metrics

Abstract

The Gaussian mixture model (GMM) is a popular tool for multivariate analysis, in particular, cluster analysis. The expectation–maximization (EM) algorithm is generally used to perform maximum likelihood (ML) estimation for GMMs due to the M-step existing in closed form and its desirable numerical properties, such as monotonicity. However, the EM algorithm has been criticized as being slow to converge and thus computationally expensive in some situations. In this article, we introduce the linear regression characterization (LRC) of the GMM. We show that the parameters of an LRC of the GMM can be mapped back to the natural parameters, and that a minorization–maximization (MM) algorithm can be constructed, which retains the desirable numerical properties of the EM algorithm, without the use of matrix operations. We prove that the ML estimators of the LRC parameters are consistent and asymptotically normal, like their natural counterparts. Furthermore, we show that the LRC allows for simple handling of singularities in the ML estimation of GMMs. Using numerical simulations in the R programming environment, we then demonstrate that the MM algorithm can be faster than the EM algorithm in various large data situations, where sample sizes range in the tens to hundreds of thousands and for estimating models with up to 16 mixture components on multivariate data with up to 16 variables.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust $$L_{2}E$$ Parameter Estimation of Gaussian Mixture Models: Comparison with Expectation Maximization

An alternative to EM for Gaussian mixture models: batch and stochastic Riemannian optimization

Article 19 March 2019

Estimation and computations for Gaussian mixtures with uniform noise under separation constraints

Article Open access 25 July 2021

References

Amemiya T (1985) Advanced econometrics. Harvard University Press, Cambridge
Google Scholar
Anderson TW (2003) An introduction to multivariate statistical analysis. Wiley, New York
MATH Google Scholar
Andrews JL, McNicholas PD (2013) Using evolutionary algorithms for model-based clustering. Pattern Recognit Lett 34:987–992
Article Google Scholar
Atienza N, Garcia-Heras J, Munoz-Pichardo JM, Villa R (2007) On the consistency of MLE in finite mixture models of exponential families. J Stat Plan Inference 137:496–505
Article MATH MathSciNet Google Scholar
Becker MP, Yang I, Lange K (1997) EM algorithms without missing data. Stat Methods Med Res 6:38–54
Article Google Scholar
Bishop CM (2006) Pattern recognition and machine learning. Springer, New York
MATH Google Scholar
Botev Z, Kroese DP (2004) Global likelihood optimization via the cross-entropy method with an application to mixture models. In: Proceedings of the 36th conference on winter simulation
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
Book MATH Google Scholar
Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14:315–332
Article MATH MathSciNet Google Scholar
Clarke B, Fokoue E, Zhang HH (2009) Principles and theory for data mining and machine learning. Springer, New York
Book MATH Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38
MATH MathSciNet Google Scholar
Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New York
MATH Google Scholar
Ganesalingam S, McLachlan GJ (1980) A comparison of the mixture and classification approaches to cluster analysis. Commun Stat Theory Methods 9:923–933
Article Google Scholar
Greselin F, Ingrassia S (2008) A note on constrained EM algorithms for mixtures of elliptical distributions. Advances in data analysis, data handling and business intelligence In: Proceedings of the 32nd annual conference of the German classification society. vol 53
Hartigan JA (1985) Statistical theory in clustering. J Classif 2:63–76
Article MATH MathSciNet Google Scholar
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Springer, New York
Book MATH Google Scholar
Hathaway RJ (1985) A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Ann Stat 13:795–800
Article MATH MathSciNet Google Scholar
Hunter DR, Lange K (2004) A tutorial on MM algorithms. Am Stat 58:30–37
Article MathSciNet Google Scholar
Ingrassia S (1991) Mixture decomposition via the simulated annealing algorithm. Appl Stoch Models Data Anal 7:317–325
Article Google Scholar
Ingrassia S (2004) A likelihood-based constrained algorithm for multivariate normal mixture models. Stat Methods Appl 13:151–166
Article MathSciNet Google Scholar
Ingrassia S, Rocci R (2007) Constrained monotone EM algorithms for finite mixture of multivariate Gaussians. Comput Stat Data Anal 51:5339–5351
Article MATH MathSciNet Google Scholar
Ingrassia S, Rocci R (2011) Degeneracy of the EM algorithm for the MLE of multivariate Gaussian mixtures and dynamic constraints. Comput Stat Data Anal 55:1714–1725
Article MathSciNet Google Scholar
Ingrassia S, Minotti SC, Vittadini G (2012) Local statistical modeling via a cluster-weighted approach with elliptical distributions. J Classif 29:363–401
Article MathSciNet Google Scholar
Ingrassia S, Minotti SC, Punzo A (2014) Model-based clustering via linear cluster-weighted models. Comput Stat Data Anal 71:159–182
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31:651–666
Article Google Scholar
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31:264–323
Article Google Scholar
Jennrich RI (1969) Asymptotic properties of non-linear least squares estimators. Ann Math Stat 40:633–643
Article MATH MathSciNet Google Scholar
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkley symposium on mathematical statistics and probability, University of California press, 281–297
McLachlan GJ (1982) The classification and mixture maximum likelihood approaches to cluster analysis. In: Krishnaiah PR, Kanal L (eds) Handbook of statistics, vol 2. North-Holland, Amsterdam
McLachlan GJ, Basford KE (1988) Mixture models: inference and applications to clustering. Marcel Dekker, New York
McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York
Book MATH Google Scholar
McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions. Wiley, New York
Book MATH Google Scholar
Pernkopf F, Bouchaffra D (2005) Genetic-based EM algorithm for learning Gaussian mixture models. IEEE Trans Pattern Anal Mach Intell 27:1344–1348
Article Google Scholar
R Core Team (2013) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Razaviyayn M, Hong M, Luo ZQ (2013) A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM J Optim 23:1126–1153
Article MATH MathSciNet Google Scholar
Redner RA, Walker HF (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 26:195–239
Article MATH MathSciNet Google Scholar
Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge
Book MATH Google Scholar
Seber GAF (2008) A matrix handbook for statisticians. Wiley, New York
Google Scholar
Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New York
MATH Google Scholar
Zhou H, Lange K (2010) Mm algorithms for some discrete multivariate distributions. J Comput Graph Stat 19:645–665
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics, School of Mathematics and Physics, University of Queensland, St. Lucia, 4072, Australia
Hien D. Nguyen & Geoffrey J. McLachlan

Authors

Hien D. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Geoffrey J. McLachlan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hien D. Nguyen.

Appendix

1.1 Proof of Theorem 1

We shall show the result by construction. Firstly, set

$$\begin{aligned} \beta _{0,1}=\mu _{1}\quad \hbox {and}\quad \sigma _{1}^{2}=\varSigma _{1,1}, \end{aligned}$$

(22)

followed by

$$\begin{aligned}&\beta _{k,0}=\mu _{k}-{\varvec{\varSigma }}_{k,1:k-1} {\varvec{\varSigma }}_{1:k-1,1:k-1}^{-1}{\varvec{\mu }}_{1:k-1},\end{aligned}$$

(23)

$$\begin{aligned}&(\beta _{k,1},\ldots ,\beta _{k,k-1})={\varvec{\varSigma }}_{k,1:k-1} {\varvec{\varSigma }}_{1:k-1,1:k-1}^{-1}, \end{aligned}$$

(24)

and

$$\begin{aligned} \sigma _{k}^{2}=\varSigma _{k,k}-{\varvec{\varSigma }}_{k,1:k-1} {\varvec{\varSigma }}_{1:k-1,1:k-1}^{-1} {\varvec{\varSigma }}_{k,1:k-1}^{T}, \end{aligned}$$

(25)

for each $k=2,\ldots ,d$, in order, to get

$$\begin{aligned} {\varvec{\beta }}_{k}^{T}\tilde{{\varvec{x}}}_{k}= & {} \beta _{k,0}+(\beta _{k,1},\ldots ,\beta _{k,k-1}){\varvec{x}}_{1}\\= & {} \mu _{k}+{\varvec{\varSigma }}_{k,1:k-1} {\varvec{\varSigma }}_{1:k-1,1:k-1}^{-1}({\varvec{x}}_{1:k-1}- {\varvec{\mu }}_{1:k-1})\\= & {} \mu _{k|1:k-1}({\varvec{x}}_{1:k-1}), \end{aligned}$$

and $\sigma _{k}^{2}=\varSigma _{k|1:k-1}$.

Now, by Lemma 1, and by definition of conditional densities,

$$\begin{aligned} \phi _{1}(x_{1};\mu _{1},\varSigma _{1,1}) \prod _{k=2}^{d}\phi _{1}(x_{k};\mu _{k|1:k-1}( {\varvec{x}}_{1:k-1}),\varSigma _{k|1:k-1})= \phi _{d}({\varvec{x}}; {\varvec{\mu }},{\varvec{\varSigma }}), \end{aligned}$$

for all ${\varvec{x}}\in \mathbb {R}^{d}$, which implies $\lambda ({\varvec{x}}; {\varvec{\gamma }}, {\varvec{\sigma }}^{2})=\phi _{d}({\varvec{x}}; {\varvec{\mu }}, {\varvec{\varSigma }})$ by application of the mappings (22)–(25). Note that ${\varvec{\mu }}$ and $\hbox {vech}({\varvec{\varSigma }})$, and ${\varvec{\gamma }}$ and ${\varvec{\sigma }}^{2}$ have equal numbers of elements, and (22)–(25) are unique for each k. Thus, there is an injective mapping between the LRC and the natural parameters. The inverse mapping can also be constructed by setting

$$\begin{aligned} \mu _{1}=\beta _{0,1}\quad \hbox {and}\quad {\varSigma _{1,1}=\sigma _{1}^{2}}, \end{aligned}$$

(26)

followed by

$$\begin{aligned}&{\varvec{\varSigma }}_{k,1:k-1}=(\beta _{k,1},\ldots , \beta _{k,k-1}){\varvec{\varSigma }}_{1:k-1,1:k-1}^{-1},\end{aligned}$$

(27)

$$\begin{aligned}&\varSigma _{k,k}=\sigma _{k}^{2}+{\varvec{\varSigma }}_{k,1:k-1} {\varvec{\varSigma }}_{1:k-1,1:k-1}^{-1} {\varvec{\varSigma }}_{k,1:k-1}^{T}, \end{aligned}$$

(28)

and

$$\begin{aligned} \mu _{k}=\beta _{k,0}+{\varvec{\varSigma }}_{k,1:k-1} {\varvec{\varSigma }}_{1:k-1,1:k-1}^{-1} {\varvec{\mu }}_{1:k-1}, \end{aligned}$$

(29)

for each $k=2,\ldots ,d$, in order. The mappings (26)–(29) are also unique for each k, and thus constitutes a surjective mapping.

1.2 Proof of Theorem 2

The first and last inequalities of (19) and (20) are due to the definition of minorization [i.e. (11) and (14) are of forms (9) and (10), respectively]. The middle inequality of (19) is due to the concavity of $Q^{\prime }({\varvec{\psi }}_{1},{\varvec{\psi }}_{2}^{(m)}; {\varvec{\psi }}^{(m)})$. This can be shown by firstly noting that

$$\begin{aligned} \sum _{i=1}^{g-1}\sum _{j=1}^{n}\tau _{i} ({\varvec{x}}_{j};{\varvec{\psi }}^{(m)}) \log (\pi _{i})+\sum _{j=1}^{n}\tau _{g}({\varvec{x}}_{j}; {\varvec{\psi }}^{(m)})\log \left( 1-\sum _{i=1}^{g-1}\exp [\log (\pi _{i})]\right) \end{aligned}$$

is concave in $\log (\pi _{i})$ since $1-\sum _{i=1}^{g-1}\exp [\log (\pi _{i})]$ is concave and $\log $ is an increasing concave function. Secondly, note that

$$\begin{aligned} \frac{\partial ^{2}Q^{\prime }({\varvec{\psi }}_{1}, {\varvec{\psi }}_{2}^{(m)};{\varvec{\psi }}^{(m)})}{\partial \beta _{i,k,l}^{2}}=-\frac{1}{2\sigma _{i,k}^{(m)2}} \sum _{j=1}^{n}\frac{x_{j,l}^{2}\tau _{i}( {\varvec{x}}_{j};{\varvec{\psi }}^{(m)})}{\alpha _{j,l}} \end{aligned}$$

is negative, and thus $Q^{\prime }({\varvec{\psi }}_{1},{\varvec{\psi }}_{2}^{(m)}; {\varvec{\psi }}^{(m)})$ is concave with respect to each $\beta _{i,k,l}$ for each i, k and $l=0,\ldots ,k-1$. Thus, $Q^{\prime }({\varvec{\psi }}_{1},{\varvec{\psi }}_{2}^{(m)}; {\varvec{\psi }}^{(m)})$ is the additive composition of concave functions and is therefore concave with respect to a bijection of ${\varvec{\psi }}_{1}$. Furthermore, the system of equations

$$\begin{aligned} \frac{\partial Q^{\prime }({\varvec{\psi }}_{1}, {\varvec{\psi }}_{2}^{(m)};{\varvec{\psi }}^{(m)})}{\partial \log (\pi _{i})}=0, \end{aligned}$$

for $i=1,\ldots ,g-1$, has a unique root that is equivalent to update (16), which always satisfies the positivity restrictions on each $\pi _{i}$.

The middle inequality of (20) is due to the concavity of $Q({\varvec{\psi }}_{1}^{(m)},{\varvec{\psi }}_{2}; {\varvec{\psi }}^{(m)})$. This can be shown by noting that

$$\begin{aligned} -\frac{1}{2}\log \sigma _{i,k}^{2}\sum _{j=1}^{n}\tau ({\varvec{x}}_{j};{\varvec{\psi }}^{(m)})- \frac{1}{2\exp [\log \sigma _{i,k}^{2}]} \sum _{j=1}^{n}\tau _{i}({\varvec{x}}_{j}; {\varvec{\psi }}^{(m)})Q_{2,i,j,k} ({\varvec{\beta }}_{i,k};{\varvec{\beta }}_{i,k}^{(m)}) \end{aligned}$$

is concave in $\log \sigma _{i,k}^{2}$ for each i and k, since the inverse of $\exp (x)$ is convex. Thus, $Q({\varvec{\psi }}_{1}^{(m)}, {\varvec{\psi }}_{2};{\varvec{\psi }}^{(m)})$ is concave with respect to a bijection of ${\varvec{\psi }}_{2}.$ Furthermore, the system of equations

$$\begin{aligned} \frac{\partial Q({\varvec{\psi }}_{1}^{(m)}, {\varvec{\psi }}_{2};{\varvec{\psi }}^{(m)})}{\partial \log \sigma _{i,k}^{2}}=0 \end{aligned}$$

has a unique root that is equivalent to update (18).

1.3 Proof of Theorem 3

This result follows from part (a) of Theorem 2 from Razaviyayn et al. (2013), which assumes that $Q^{\prime }({\varvec{\psi }}_{1},{\varvec{\psi }}_{2}^{(m)}; {\varvec{\psi }}^{(m)})$ and $Q({\varvec{\psi }}_{1}^{(m)},{\varvec{\psi }}_{2}; {\varvec{\psi }}^{(m)})$ both satisfy the definition of a minorizer, and are quasi-concave and have unique critical points, with respect to the parameters ${\varvec{\psi }}_{1}$ and ${\varvec{\psi }}_{2}$, respectively.

Firstly, the definition of a minorizer is satisfied via construction [i.e. (11) and (14) are of forms (9) and (10), respectively]. Secondly, from the proof of Theorem 2, both functions are concave with respect to some bijective mappings, and are therefore quasi-concave under said mappings [see Section 3.4 of Boyd and Vandenberghe (2004) regarding quasi-concavity]. Finally, since both functions are concave with respect to some bijective mapping, the critical points obtained must be unique.

1.4 Proof of Theorem 4

We show this result via induction. Firstly, using (26), we see that $\sigma _{1}^{2}=\det (\varSigma _{1,1})>0$ is the first leading principal minor of ${\varvec{\varSigma }}$, and is positive. Now, by definition of (23), $\sigma _{2}^{2}$ is the Schur complement of ${\varvec{\varSigma }}_{1:k,1:k}$, for $k=2$, where

$$\begin{aligned} {\varvec{\varSigma }}_{1:k,1:k}=\left[ \begin{array}{c@{\quad }c} {\varvec{\varSigma }}_{1:k-1,1:k-1} &{} {\varvec{\varSigma }}_{k,1:k-1}^{T}\\ {\varvec{\varSigma }}_{k,1:k-1} &{} \varSigma _{k,k} \end{array}\right] . \end{aligned}$$

(30)

Since $\sigma _{2}^{2}$ is positive and $\varSigma _{1,1}$ is positive definite, we have the result that

$$\begin{aligned} \det ({\varvec{\varSigma }}_{1:2,1:2})=\det (\varSigma _{1,1}) \sigma _{2}^{2}>0 \end{aligned}$$

via the partitioning of the determinant. Thus, ${\varvec{\varSigma }}_{1:2,1:2}$ is also positive definite because both the first and second leading principal minors are positive.

Now, for each $k=3,\ldots ,d$, we assume that ${\varvec{\varSigma }}_{1:k-1,1:k-1}$ is positive-definite. Since $\sigma _{k}^{2}>0$ is the Schur complement of the partitioning (30), we have the result that

$$\begin{aligned} \det ({\varvec{\varSigma }}_{1:k,1:k})=\det ({\varvec{\varSigma }}_{1:k-1,1:k-1}) \sigma _{k}^{2}>0. \end{aligned}$$

Thus, the $k\hbox {th}$ leading principal minor is positive, for all k. The result follows by the property of positive-definite matrices; see Chapters 10 and 14 of Seber (2008) for all relevant matrix results.

1.5 Proof of Theorem 5

Theorem 5 can be established from Theorem 4.1.2 of Amemiya (1985), which requires the validation of the assumptions,

A1
The parameter space $\varPsi $ is an open subset of some Euclidean space.
A2
The log-likelihood $\log \mathcal {L}_{R,n}({\varvec{\psi }})$ is a measurable function for all ${\varvec{\psi }}\in \varPsi $, $\partial (\log \mathcal {L}_{R,n}({\varvec{\psi }}))/\partial {\varvec{\psi }}$ exist and is continuous in an open neighborhood $N_{1}({\varvec{\psi }}^{0})$ of ${\varvec{\psi }}^{0}$.
A3
There exists an open neighborhood $N_{2}({\varvec{\psi }}^{0})$ of ${\varvec{\psi }}^{0}$, where $n^{-1}\log \mathcal {L}_{R,n}({\varvec{\psi }})$ converges to $\mathbb {E}[\log f_{R}({\varvec{X}}; {\varvec{\psi }})]$ in probability uniformly in ${\varvec{\psi }}$ in any compact subset of $N_{2}({\varvec{\psi }}^{0})$.

Assumptions A1, and A2 are fulfilled by noting that the parameter space $\varPsi =(0,1)^{g-1}\times \mathbb {R}^{g(d^{2}+d)/2+gd}$ is an open subset of $\mathbb {R}^{(g-1)+g(d^{2}+d)/2+gd}$, and that $\log \mathcal {L}_{R,n}({\varvec{\psi }})$ is smooth with respect to the parameters ${\varvec{\psi }}$. Using Theorem 2 of Jennrich (1969), we can show that A3 holds by verifying that

$$\begin{aligned} \mathbb {E}\sup _{{\varvec{\psi }}\in \bar{N}}|\log f_{R}({\varvec{X}}; {\varvec{\psi }})|<\infty , \end{aligned}$$

(31)

where $\bar{N}$ is a compact subset of $N_{2}({\varvec{\psi }}^{0})$. Since $f_{R}({\varvec{X}}; {\varvec{\psi }})$ is smooth, this is equivalent to showing that $\mathbb {E}|f_{R}({\varvec{X}}; {\varvec{\psi }})|<\infty $, for any fixed ${\varvec{\psi }}\in \bar{N}$. This is achieved by noting that

$$\begin{aligned} \mathbb {E}|\log f_{R}({\varvec{X}}; {\varvec{\psi }})|= & {} \mathbb {E}|\log f_{R}({\varvec{X}}; {\varvec{\psi }})|\nonumber \\= & {} \mathbb {E}\left| \log \sum _{i=1}^{g}\pi _{i}\lambda ({\varvec{x}}; {\varvec{\gamma }}_{i}, {\varvec{\sigma }}_{i}^{2})\right| \nonumber \\\le & {} \sum _{i=1}^{g}\mathbb {E}|\log \lambda ({\varvec{x}}; {\varvec{\gamma }}_{i},{\varvec{\sigma }}_{i}^{2})|\nonumber \\= & {} \sum _{i=1}^{g}\mathbb {E}\left| \sum _{k=1}^{d}\log \phi _{1}(x_{k};{\varvec{\beta }}_{k}^{T} \tilde{{\varvec{x}}}_{k},\sigma _{k}^{2})\right| \nonumber \\\le & {} \sum _{i=1}^{g}\sum _{k=1}^{d}\mathbb {E} |\log \phi _{1}(x_{k};{\varvec{\beta }}_{i,k}^{T} \tilde{{\varvec{x}}}_{k},\sigma _{i,k}^{2})|. \end{aligned}$$

(32)

The inequality on line 3 of (32) is due to Lemma 1 of Atienza et al. (2007). Considering that $\log \phi _{1}(x_{k};{\varvec{\beta }}_{i,k}^{T} \tilde{{\varvec{x}}}_{k},\sigma _{i,k}^{2})$ is a polynomial function of Gaussian random variables, we have $\mathbb {E}|\log \phi _{1}(x_{k};{\varvec{\beta }}_{i,k}^{T} \tilde{{\varvec{x}}}_{k},\sigma _{i,k}^{2})|<\infty $ for each i and k. The result then follows.

1.6 Proof of Theorem 6

Theorem 6 can be established from Theorem 4.2.4 of Amemiya (1985), which requires the validation of the assumptions,

B1
The Hessian $\partial ^{2}(\log \mathcal {L}_{R,n}({\varvec{\psi }}))/ \partial {\varvec{\psi }}\partial {\varvec{\psi }}^{T}$ exists and is continuous in an open neighborhood of ${\varvec{\psi }}^{0}$.
B2
The equations
$$\begin{aligned} \int \frac{\partial \log f_{R}({\varvec{\psi }})}{\partial {\varvec{\psi }}}\hbox {d}{\varvec{x}}=\varvec{0}, \end{aligned}$$
and
$$\begin{aligned} \int \frac{\partial ^{2}\log f_{R}({\varvec{\psi }})}{\partial {\varvec{\psi }}\partial {\varvec{\psi }}^{T}} \hbox {d}{\varvec{x}}=\varvec{0}, \end{aligned}$$
hold, for any ${\varvec{\psi }}\in \varPsi $.
B3
The averaged Hessian satisfies
$$\begin{aligned} \frac{1}{n}\frac{\partial ^{2}\log \mathcal {L}_{R,n} ({\varvec{\psi }})}{\partial {\varvec{\psi }}\partial {\varvec{\psi }}^{T}}\overset{P}{\rightarrow }\mathbb {E} \left[ \frac{\partial ^{2}\log f_{R}({\varvec{X}}; {\varvec{\psi }})}{\partial {\varvec{\psi }} \partial {\varvec{\psi }}^{T}}\right] , \end{aligned}$$
uniformly in ${\varvec{\psi }}$, in all compact subsets of an open neighborhood of ${\varvec{\psi }}^{0}$.
B4
The Fisher information
$$\begin{aligned} -\mathbb {E}\left[ \frac{\partial ^{2}\log f_{R} ({\varvec{x}}; {\varvec{\psi }})}{\partial {\varvec{\psi }} \partial {\varvec{\psi }}^{T}}\bigg |_{{\varvec{\psi }}= {\varvec{\psi }}^{0}}\right] ^{-1}, \end{aligned}$$
is positive-definite.

Assumption B1 is validated via the smoothness of $\log \mathcal {L}_{R,n}({\varvec{\psi }})$, and it is mechanical to check the validity of B2. Assumption B3 can be shown via Theorem 2 of Jennrich (1969). Unlike the others, B4 must be taken as given.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nguyen, H.D., McLachlan, G.J. Maximum likelihood estimation of Gaussian mixture models without matrix operations. Adv Data Anal Classif 9, 371–394 (2015). https://doi.org/10.1007/s11634-015-0209-7

Download citation

Received: 28 September 2014
Revised: 03 May 2015
Accepted: 22 May 2015
Published: 05 June 2015
Issue Date: December 2015
DOI: https://doi.org/10.1007/s11634-015-0209-7

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Maximum likelihood estimation of Gaussian mixture models without matrix operations

Abstract

Access this article

Similar content being viewed by others

Robust $$L_{2}E$$ Parameter Estimation of Gaussian Mixture Models: Comparison with Expectation Maximization

An alternative to EM for Gaussian mixture models: batch and stochastic Riemannian optimization

Estimation and computations for Gaussian mixtures with uniform noise under separation constraints

References

Author information

Authors and Affiliations

Corresponding author

Appendix

1.1 Proof of Theorem 1

1.2 Proof of Theorem 2

1.3 Proof of Theorem 3

1.4 Proof of Theorem 4

1.5 Proof of Theorem 5

1.6 Proof of Theorem 6

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Maximum likelihood estimation of Gaussian mixture models without matrix operations

Abstract

Access this article

Similar content being viewed by others

Robust $$L_{2}E$$ Parameter Estimation of Gaussian Mixture Models: Comparison with Expectation Maximization

An alternative to EM for Gaussian mixture models: batch and stochastic Riemannian optimization

Estimation and computations for Gaussian mixtures with uniform noise under separation constraints

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 Proof of Theorem 1

1.2 Proof of Theorem 2

1.3 Proof of Theorem 3

1.4 Proof of Theorem 4

1.5 Proof of Theorem 5

1.6 Proof of Theorem 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation