Abstract
In this paper, we present a theoretical and an experimental comparison of EM and SEM algorithms for different mixture models. The SEM algorithm is a stochastic variant of the EM algorithm. The qualitative intuition behind the SEM algorithm is simple: If the number of observations is large enough, then we expect that an update step of the stochastic SEM algorithm is similar to the corresponding update step of the deterministic EM algorithm. In this paper, we quantify this intuition. We show that with high probability the update equations of any EM-like algorithm and its stochastic variant are similar, given that the input set satisfies certain properties. For instance, this result applies to the well-known EM and SEM algorithm for Gaussian mixture models and EM-like and SEM-like heuristics for multivariate power exponential distributions. Our experiments confirm that our theoretical results also hold for a large number of successive update steps. Thereby we complement the known asymptotic results for the SEM algorithm. We also show that, for multivariate Gaussian and multivariate Laplacian mixture models, an update step of SEM runs nearly twice as fast as an EM update set.
Similar content being viewed by others
References
Bilmes J (1998) A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technical report, Computer Science Division, Department of Electrical Engineering and Computer Science, U.C. Berkeley
Bishop CM (2006) Pattern recognition and machine learning (information science and statistics). Springer, New York
Blömer J, Bujna K, Kuntze D (2014) A theoretical and experimental comparison of the EM and SEM algorithm. In: 2014 22nd international conference on pattern recognition, pp 1419–1424. https://doi.org/10.1109/icpr.2014.253
Celeux G, Diebolt J (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the em algorithm for the mixture problem. Comput Stat Q 2:73–82
Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332. https://doi.org/10.1016/0167-9473(92)90042-E
Celeux G, Chauveau D, Diebolt J (1995) On stochastic versions of the EM algorithm. Research report RR-2514, INRIA Paris-Rocquencourt. https://hal.inria.fr/inria-00074164. Accessed 4 July 2019
Celeux G, Chauveau D, Diebolt J (1996) Stochastic versions of the EM algorithm: an experimental study in the mixture case. J Stat Comput Simul 55(4):287–314. https://doi.org/10.1080/00949659608811772
Dang UJ, Browne RP, McNicholas PD (2015) Mixtures of multivariate power exponential distributions. Biometrics 71(4):1081–1089. https://doi.org/10.1111/biom.12351
Dasgupta S, Schulman L (2007) A probabilistic analysis of EM for mixtures of separated, spherical Gaussians. J Mach Learn Res 8:203–226
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B Stat Methodol 39(1):1–38
Dias JG, Wedel M (2004) An empirical comparison of EM, SEM and MCMC performance for problematic Gaussian mixture likelihoods. Stat Comput 14(4):323–332. https://doi.org/10.1023/B:STCO.0000039481.32211.5a
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 4 July 2019
Gómez E, Gomez-Viilegas MA, Marín JM (1998) A multivariate generalization of the power exponential family of distributions. Commun Stat Theory Methods 27(3):589–600. https://doi.org/10.1080/03610929808832115
Ip EHS (1994) A stochastic EM estimator in the presence of missing data—theory and applications. PhD thesis, Stanford University
ISO (2012) ISO/IEC 14882:2011 information technology—programming languages—C++. International Organization for Standardization, Geneva, Switzerland
McDiarmid C (1998) Concentration. In: Habib M, McDiarmid C, Ramirez-Alfonsin J, Reed B (eds) Probabilistic Methods for Algorithmic Discrete Mathematics, Springer Berlin Heidelberg, Berlin, Heidelberg, pp 195–248, https://doi.org/10.1007/978-3-662-12788-9_6
McLachlan GJ, Krishnan T (2007) The EM algorithm and extensions (Wiley series in probability and statistics). Wiley, Hoboken. https://doi.org/10.1002/9780470191613
Nielsen SF (2000a) On simulated EM algorithms. J Econom 96(2):267–292. https://doi.org/10.1016/S0304-4076(99)00060-3
Nielsen SF (2000b) The stochastic EM algorithm: estimation and asymptotic results. Bernoulli 6(3):457–489. https://doi.org/10.2307/3318671
Zhang J, Liang F (2010) Robust clustering using exponential power mixtures. Biometrics 66(4):1078–1086. https://doi.org/10.1111/j.1541-0420.2010.01389.x
Acknowledgements
Funding was provide by Deutsche Forschungsgemeinschaft (DE) (Grant No. BL 314/8-1).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Full proof of the main theorem
Full proof of the main theorem
In the following, we provide the full proof of Theorem 1. A key tool in our proofs are the following Chernoff-type bounds.
Lemma 2
(McDiarmid (1998)) Let \(X_1,\ldots ,X_n\) be independent random variables in [0, 1] and \(Y=\sum _{i=1}^n X_i\). Then, for all \(\lambda \in [0,1]\) we have
Corollary 3
Let \(X_1,\ldots ,X_n\) be independent random variables in [0, 1] and \(Y=\sum _{i=1}^n X_i\). Let \(\delta \in (0,1)\). If we have
then with probability at least \(1-\delta \) we have
Proof
Due to \({{\,\mathrm{E}\,}}[Y]\ge 3\ln (2/\delta )\), we have \(\lambda :=\sqrt{3\ln (2/\delta )/E[Y]}\in [0,1]\). Applying Lemma 2 yields the claim. \(\square \)
Lemma 4
(McDiarmid (1998)) Let \(C\ge 0\) be some constant, \(X_1,\ldots ,X_n\) be independent random variables, such that \(\forall i\in [n]: {{\,\mathrm{E}\,}}[X_i] = 0\) and \(\left| X_i \right| \le C\), and let \(Y=\sum _{i=1}^n X_i\). Then, for all \(\lambda \ge 0\) we have
Corollary 5
Let \(C\ge 0\) be some constant, \(X_1,\ldots ,X_n\) be independent random variables, such that \(\forall i\in [n]: {{\,\mathrm{E}\,}}[X_i] = 0\) and \(\left| X_i \right| \le C\), and let \(Y=\sum _{i=1}^n X_i\). Then, for all \(\delta \in (0,1)\) and all \(\lambda \) with
we have
Proof
We obtain the upper and the lower bound, by applying Lemma 4 with \(t = \lambda \sqrt{\text {Var}(Y)}\) to the variables \(Y = \sum _{i=1}^n X_i\) and \(Y' = \sum _{i=1}^n -X_i\), respectively. Then, applying the union bound yields the claim. \(\square \)
1.1 Proximity of the parameter updates
In the following we fix some solution \(\theta ^{old}\) and compare a single update step of the \(\hbox {EM}^*\) algorithm with a single update step of the \(\hbox {SEM}^*\) algorithm, both given \(\theta ^{old}\). To this end, we make extensive use of the variables defined in Algorithms 3 and 4, the quantities defined in Sect. 3, and the following notation.
Notation
Throughout the rest of this section, we fix the following parameters. We let
We fix some component \(k\in [K]\), dimensions \(d,i,j\in [D]\), and a probability of success \(\delta \in (0,1)\). Moreover, as in Theorem 1, we use
First, we bound the difference on the weights and the \(\zeta \)-responsibilities.
Theorem 6
(Proximity of Masses) If we have
then with probability at least \(1-\delta \)
Proof
Recall that \({{\,\mathrm{E}\,}}[R_k]=r_k\) and \(R_k=\sum _{n=1}^N Z_{nk}\) with \(Z_{nk}\in \{0,1\}\). Applying Corollary 3 yields the claim. \(\square \)
Remark 7
(Proximity of Mixture Weights) If Eq. (6) of Theorem 6 is satisfied, then by definition
Theorem 8
(Proximity of \(\zeta \)-Masses) If we have
then with probability at least \(1-\delta \)
Proof
Recall that \({{\,\mathrm{E}\,}}[U_k]=u_k\) and \(U_k=\sum _{n=1}^N \zeta _{nk}Z_{nk}\) with \(\zeta _{nk}Z_{nk}\in [0,\zeta ^{max}_k]\). Applying Corollary 3 to the (scaled) random variables \(\zeta _{nk}Z_{nk}/ \zeta ^{max}_k\in [0,1]\) yields the claim. \(\square \)
By using the standard deviations \(\tau _{kd}\) and \(\rho _{kij}\), one can bound the difference between the numerator of the mean and scale updates, respectively.
Theorem 9
With probability at least \(1-\delta \), we have
for any \(\lambda ^{(\mu )}_{ kd}\) with
Proof
For each \(n\in [N]\), define the real random variable
Since \({{\,\mathrm{E}\,}}[Z_{nk}]=p_{nk}\), we get that \({{\,\mathrm{E}\,}}\left[ \tilde{M}_{kdn}\right] =0\) and
Furthermore, since each \(\mu _k\) is a convex combination of \(x_1,\ldots ,x_N\), we obtain
Note that by definition of \(\mu _k\), we have \(\sum _{n=1}^N p_{nk}\zeta _{nk}\left( x_n-\mu _k\right) _d=0\). Thus, for the random variable
we get \({{\,\mathrm{E}\,}}[\tilde{M}_{kd}]=0\) and
Applying Corollary 5 yields
\(\square \)
Lemma 10
We have
where \(y_{nk} = (x_n-\mu _k)(x_n-\mu _k)^T\) and \(\nu _k= (\mu _k-M_k)(\mu _k-M_k)^T\).
Proof
Observe that
where the second to last equality is due to the fact that
This yields the claim. \(\square \)
Theorem 11
With probability at least \(1-\delta \), we have
where
and for any \(\lambda ^{(\varSigma )}_{ kij}\) with
Proof
For each \(n\in [N]\), define the real random variable
Using the definitions, we obtain \(|\tilde{S}_{kijn}|\le \zeta _k^{max} \varDelta _i\varDelta _j\), \({{\,\mathrm{E}\,}}[\tilde{S}_{kijn}]=0\), and \(\text {Var}(\tilde{S}_{kijn})=p_{nk}(1-p_{nk})\left( \zeta _{nk} y_{nk}-\varSigma _k\right) _{ij}^2\). Then, for the random variable
we get \({{\,\mathrm{E}\,}}[\tilde{S}_{kij}]=0\) and \(\text {Var}(\tilde{S}_{kij}) = \rho ^2_{kij}\). Hence, we can apply Corollary 5 to obtain
\(\square \)
Finally, we can combine the theorems using the union bound to obtain bounds on the proximity of all update equations.
Proof
(Theorem 1) We combine Theorem 6 through 11 by taking the union bound. Hence, with probability \(1-K\cdot (2+ D + D^2)\cdot \delta \), Eqs. (6) through (10) hold for all \(d,i,j\in [D]\) and \(k\in [K]\).
Fix some \(d,i,j\in [D]\) and \(k\in [K]\). To prove Eq. (2), just observe that due to Eq. (7) and Eq. (8) we have
Recall that \(\nu _k= (\mu _k-M_k)(\mu _k-M_k)^T\) and \(y_{nk} = (x_n-\mu _k)(x_n-\mu _k)^T\). By Lemma 10 and by the triangle inequality, we can conclude
Due to Eq. (11), we have
Moreover, due to Eqs. (6) and (7), we know
By combining all these inequalities with Eqs. (6) and (10), we obtain
These equations hold, if \(\lambda \in \{\lambda ^{(\mu )}_{ki},\lambda ^{(\varSigma )}_{kij}\}\) fulfills
for \(C\in \{\zeta _k^{max}\varDelta _d,\zeta _k^{max}\varDelta _i\varDelta _j\}\) and \(\text {Var}(Y)\in \{\tau _{ki}^2,\rho _{kij}^2\}\), respectively. Assume that
Observe, that \(\tau _{ki}^2\) and \(\rho _{kij}^2\) grow as the number of points increase. Again, the EM algorithm should be applied under the assumption that there is a process generating data according to a mixture of the family of distributions the EM algorithm estimates. In that case, \(\zeta _k^{max}\varDelta _d\) and \(\zeta _k^{max}\varDelta _i\varDelta _j\) do not strictly increase with the number of points. These quantities depend on the position of the points to each other and not on the number of points. Hence, while they might increase with the generation of additional points, asymptotically we expect them to be constant. Hence, we expect that (12) holds for all sufficiently large data sets. We set \(\lambda = \sqrt{4\ln (2/\delta )}\) and obtain
This yields the claim. \(\square \)
Rights and permissions
About this article
Cite this article
Blömer, J., Brauer, S., Bujna, K. et al. How well do SEM algorithms imitate EM algorithms? A non-asymptotic analysis for mixture models. Adv Data Anal Classif 14, 147–173 (2020). https://doi.org/10.1007/s11634-019-00366-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-019-00366-7