How well do SEM algorithms imitate EM algorithms? A non-asymptotic analysis for mixture models

Blömer, Johannes; Brauer, Sascha; Bujna, Kathrin; Kuntze, Daniel

doi:10.1007/s11634-019-00366-7

How well do SEM algorithms imitate EM algorithms? A non-asymptotic analysis for mixture models

Regular Article
Published: 10 July 2019

Volume 14, pages 147–173, (2020)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Johannes Blömer¹,
Sascha Brauer¹,
Kathrin Bujna¹ &
…
Daniel Kuntze²

531 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

In this paper, we present a theoretical and an experimental comparison of EM and SEM algorithms for different mixture models. The SEM algorithm is a stochastic variant of the EM algorithm. The qualitative intuition behind the SEM algorithm is simple: If the number of observations is large enough, then we expect that an update step of the stochastic SEM algorithm is similar to the corresponding update step of the deterministic EM algorithm. In this paper, we quantify this intuition. We show that with high probability the update equations of any EM-like algorithm and its stochastic variant are similar, given that the input set satisfies certain properties. For instance, this result applies to the well-known EM and SEM algorithm for Gaussian mixture models and EM-like and SEM-like heuristics for multivariate power exponential distributions. Our experiments confirm that our theoretical results also hold for a large number of successive update steps. Thereby we complement the known asymptotic results for the SEM algorithm. We also show that, for multivariate Gaussian and multivariate Laplacian mixture models, an update step of SEM runs nearly twice as fast as an EM update set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Modified EM Algorithms for Parameter Estimation in Finite Mixture Models

Comparison of the EM, CEM and SEM algorithms in the estimation of finite mixtures of linear mixed models: a simulation study

Article 27 February 2021

A Variational Approximations-DIC Rubric for Parameter Estimation and Mixture Model Selection Within a Family Setting

Article 04 March 2020

References

Bilmes J (1998) A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technical report, Computer Science Division, Department of Electrical Engineering and Computer Science, U.C. Berkeley
Bishop CM (2006) Pattern recognition and machine learning (information science and statistics). Springer, New York
MATH Google Scholar
Blömer J, Bujna K, Kuntze D (2014) A theoretical and experimental comparison of the EM and SEM algorithm. In: 2014 22nd international conference on pattern recognition, pp 1419–1424. https://doi.org/10.1109/icpr.2014.253
Celeux G, Diebolt J (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the em algorithm for the mixture problem. Comput Stat Q 2:73–82
Google Scholar
Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332. https://doi.org/10.1016/0167-9473(92)90042-E
Article MathSciNet MATH Google Scholar
Celeux G, Chauveau D, Diebolt J (1995) On stochastic versions of the EM algorithm. Research report RR-2514, INRIA Paris-Rocquencourt. https://hal.inria.fr/inria-00074164. Accessed 4 July 2019
Celeux G, Chauveau D, Diebolt J (1996) Stochastic versions of the EM algorithm: an experimental study in the mixture case. J Stat Comput Simul 55(4):287–314. https://doi.org/10.1080/00949659608811772
Article MATH Google Scholar
Dang UJ, Browne RP, McNicholas PD (2015) Mixtures of multivariate power exponential distributions. Biometrics 71(4):1081–1089. https://doi.org/10.1111/biom.12351
Article MathSciNet MATH Google Scholar
Dasgupta S, Schulman L (2007) A probabilistic analysis of EM for mixtures of separated, spherical Gaussians. J Mach Learn Res 8:203–226
MathSciNet MATH Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B Stat Methodol 39(1):1–38
MathSciNet MATH Google Scholar
Dias JG, Wedel M (2004) An empirical comparison of EM, SEM and MCMC performance for problematic Gaussian mixture likelihoods. Stat Comput 14(4):323–332. https://doi.org/10.1023/B:STCO.0000039481.32211.5a
Article MathSciNet Google Scholar
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 4 July 2019
Gómez E, Gomez-Viilegas MA, Marín JM (1998) A multivariate generalization of the power exponential family of distributions. Commun Stat Theory Methods 27(3):589–600. https://doi.org/10.1080/03610929808832115
Article MathSciNet MATH Google Scholar
Ip EHS (1994) A stochastic EM estimator in the presence of missing data—theory and applications. PhD thesis, Stanford University
ISO (2012) ISO/IEC 14882:2011 information technology—programming languages—C++. International Organization for Standardization, Geneva, Switzerland
McDiarmid C (1998) Concentration. In: Habib M, McDiarmid C, Ramirez-Alfonsin J, Reed B (eds) Probabilistic Methods for Algorithmic Discrete Mathematics, Springer Berlin Heidelberg, Berlin, Heidelberg, pp 195–248, https://doi.org/10.1007/978-3-662-12788-9_6
Google Scholar
McLachlan GJ, Krishnan T (2007) The EM algorithm and extensions (Wiley series in probability and statistics). Wiley, Hoboken. https://doi.org/10.1002/9780470191613
Book Google Scholar
Nielsen SF (2000a) On simulated EM algorithms. J Econom 96(2):267–292. https://doi.org/10.1016/S0304-4076(99)00060-3
Article MATH Google Scholar
Nielsen SF (2000b) The stochastic EM algorithm: estimation and asymptotic results. Bernoulli 6(3):457–489. https://doi.org/10.2307/3318671
Article MathSciNet MATH Google Scholar
Zhang J, Liang F (2010) Robust clustering using exponential power mixtures. Biometrics 66(4):1078–1086. https://doi.org/10.1111/j.1541-0420.2010.01389.x
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

Funding was provide by Deutsche Forschungsgemeinschaft (DE) (Grant No. BL 314/8-1).

Author information

Authors and Affiliations

Paderborn University, Warburger Str. 100, 33098, Paderborn, Germany
Johannes Blömer, Sascha Brauer & Kathrin Bujna
SAP SE, Walldorf, Germany
Daniel Kuntze

Authors

Johannes Blömer
View author publications
You can also search for this author in PubMed Google Scholar
Sascha Brauer
View author publications
You can also search for this author in PubMed Google Scholar
Kathrin Bujna
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Kuntze
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sascha Brauer.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (bz2 57 KB)

Full proof of the main theorem

In the following, we provide the full proof of Theorem 1. A key tool in our proofs are the following Chernoff-type bounds.

Lemma 2

(McDiarmid (1998)) Let $X_1,\ldots ,X_n$ be independent random variables in [0, 1] and $Y=\sum _{i=1}^n X_i$. Then, for all $\lambda \in [0,1]$ we have

$$\begin{aligned} \text {Pr}\left( \left| Y-{{\,\mathrm{E}\,}}[Y] \right| \ge \lambda \cdot {{\,\mathrm{E}\,}}[Y]\right) \le 2e^{-{{\,\mathrm{E}\,}}[Y]\frac{\lambda ^2}{3}}. \end{aligned}$$

Corollary 3

Let $X_1,\ldots ,X_n$ be independent random variables in [0, 1] and $Y=\sum _{i=1}^n X_i$. Let $\delta \in (0,1)$. If we have

$$\begin{aligned} {{\,\mathrm{E}\,}}[Y]\ge 3\ln (2/\delta )\ , \end{aligned}$$

then with probability at least $1-\delta $ we have

$$\begin{aligned} \text {Pr}\left( |Y-{{\,\mathrm{E}\,}}[Y]|\ge \sqrt{3\ln (\nicefrac {2}{\delta })} \sqrt{{{\,\mathrm{E}\,}}[Y]} \right) \le \delta . \end{aligned}$$

Proof

Due to ${{\,\mathrm{E}\,}}[Y]\ge 3\ln (2/\delta )$, we have $\lambda :=\sqrt{3\ln (2/\delta )/E[Y]}\in [0,1]$. Applying Lemma 2 yields the claim. $\square $

Lemma 4

(McDiarmid (1998)) Let $C\ge 0$ be some constant, $X_1,\ldots ,X_n$ be independent random variables, such that $\forall i\in [n]: {{\,\mathrm{E}\,}}[X_i] = 0$ and $\left| X_i \right| \le C$, and let $Y=\sum _{i=1}^n X_i$. Then, for all $\lambda \ge 0$ we have

$$\begin{aligned} \text {Pr}\left( Y \ge t \right) \le \exp \left( -\frac{t^2}{2\text {Var}(Y)(1+(tC/3\text {Var}(Y)))}\right) \ . \end{aligned}$$

Corollary 5

Let $C\ge 0$ be some constant, $X_1,\ldots ,X_n$ be independent random variables, such that $\forall i\in [n]: {{\,\mathrm{E}\,}}[X_i] = 0$ and $\left| X_i \right| \le C$, and let $Y=\sum _{i=1}^n X_i$. Then, for all $\delta \in (0,1)$ and all $\lambda $ with

$$\begin{aligned} \frac{\lambda ^2}{1+\lambda /3 \cdot C/\sqrt{\text {Var}(Y)}} \ge 2\ln (2/\delta ) \end{aligned}$$

we have

$$\begin{aligned} \text {Pr}\left( \left| Y \right| \ge \lambda \sqrt{\text {Var}(Y)} \right) \le \delta \ . \end{aligned}$$

Proof

We obtain the upper and the lower bound, by applying Lemma 4 with $t = \lambda \sqrt{\text {Var}(Y)}$ to the variables $Y = \sum _{i=1}^n X_i$ and $Y' = \sum _{i=1}^n -X_i$, respectively. Then, applying the union bound yields the claim. $\square $

1.1 Proximity of the parameter updates

In the following we fix some solution $\theta ^{old}$ and compare a single update step of the $\hbox {EM}^*$ algorithm with a single update step of the $\hbox {SEM}^*$ algorithm, both given $\theta ^{old}$. To this end, we make extensive use of the variables defined in Algorithms 3 and 4, the quantities defined in Sect. 3, and the following notation.

Notation

Throughout the rest of this section, we fix the following parameters. We let

$$\begin{aligned} \theta ^{old}=\left\{ \left( w^{old}_k,\mu ^{old}_k,\varSigma ^{old}_k\right) \right\} _{k\in [K]}\ . \end{aligned}$$

We fix some component $k\in [K]$, dimensions $d,i,j\in [D]$, and a probability of success $\delta \in (0,1)$. Moreover, as in Theorem 1, we use

$$\begin{aligned} a:=\sqrt{3\ln (2/\delta )} \quad \text{ and } \quad b:=\sqrt{2e\ln (2/\delta )}\ . \end{aligned}$$

First, we bound the difference on the weights and the $\zeta $-responsibilities.

Theorem 6

(Proximity of Masses) If we have

$$\begin{aligned} r_k \ge a^2 \ , \end{aligned}$$

then with probability at least $1-\delta $

$$\begin{aligned} \left| R_k - r_k \right| \le a\cdot \sqrt{r_k}\ . \end{aligned}$$

(6)

Proof

Recall that ${{\,\mathrm{E}\,}}[R_k]=r_k$ and $R_k=\sum _{n=1}^N Z_{nk}$ with $Z_{nk}\in \{0,1\}$. Applying Corollary 3 yields the claim. $\square $

Remark 7

(Proximity of Mixture Weights) If Eq. (6) of Theorem 6 is satisfied, then by definition

$$\begin{aligned} \left| W_k - w_k \right| \le \frac{a}{\sqrt{r_k}} \cdot w_k\ . \end{aligned}$$

Theorem 8

(Proximity of $\zeta $-Masses) If we have

$$\begin{aligned} u_k \ge a^2 \zeta ^{max}_k \ , \end{aligned}$$

then with probability at least $1-\delta $

$$\begin{aligned} \left|u_k - U_k \right|\le a\sqrt{\zeta _k^{max}} \cdot \sqrt{u_k}\ . \end{aligned}$$

(7)

Proof

Recall that ${{\,\mathrm{E}\,}}[U_k]=u_k$ and $U_k=\sum _{n=1}^N \zeta _{nk}Z_{nk}$ with $\zeta _{nk}Z_{nk}\in [0,\zeta ^{max}_k]$. Applying Corollary 3 to the (scaled) random variables $\zeta _{nk}Z_{nk}/ \zeta ^{max}_k\in [0,1]$ yields the claim. $\square $

By using the standard deviations $\tau _{kd}$ and $\rho _{kij}$, one can bound the difference between the numerator of the mean and scale updates, respectively.

Theorem 9

With probability at least $1-\delta $, we have

$$\begin{aligned} \left| \sum _{n=1}^NZ_{nk}\zeta _{nk}\left( x_n-\mu _k\right) _d \right| \le \lambda ^{(\mu )}_{ kd}\cdot \tau _{kd} \end{aligned}$$

(8)

for any $\lambda ^{(\mu )}_{ kd}$ with

$$\begin{aligned} \frac{(\lambda ^{(\mu )}_{ kd})^2}{1+\lambda ^{(\mu )}_{ kd}/3 \cdot \zeta _k^{max}\varDelta _d/\tau _{kd}} \ge 2\ln (2/\delta ) \ . \end{aligned}$$

Proof

For each $n\in [N]$, define the real random variable

$$\begin{aligned} \tilde{M}_{kdn} :=(Z_{nk}-p_{nk})\zeta _{nk}\left( x_n-\mu _k\right) _d. \end{aligned}$$

Since ${{\,\mathrm{E}\,}}[Z_{nk}]=p_{nk}$, we get that ${{\,\mathrm{E}\,}}\left[ \tilde{M}_{kdn}\right] =0$ and

$$\begin{aligned} \text {Var}(\tilde{M}_{kdn})=p_{nk}(1-p_{nk})\zeta _{nk}^{2}(x_n-\mu _k)_d^2 \end{aligned}$$

Furthermore, since each $\mu _k$ is a convex combination of $x_1,\ldots ,x_N$, we obtain

$$\begin{aligned} |\tilde{M}_{kdn}|\le |Z_{nk}-p_{nk}|\cdot \left| \zeta _{nk} \right| \cdot \left| \left( x_{n}-\mu _k\right) _d \right| \le \varDelta _d\cdot \zeta _k^{max}. \end{aligned}$$

Note that by definition of $\mu _k$, we have $\sum _{n=1}^N p_{nk}\zeta _{nk}\left( x_n-\mu _k\right) _d=0$. Thus, for the random variable

$$\begin{aligned} \tilde{M}_{kd}= \sum _{n=1}^N \tilde{M}_{kdn}= \sum _{n=1}^N Z_{nk}\zeta _{nk}\left( x_n-\mu _k\right) _d. \end{aligned}$$

we get ${{\,\mathrm{E}\,}}[\tilde{M}_{kd}]=0$ and

$$\begin{aligned} \text {Var}(\tilde{M}_{kd}) :=\sum _{n=1}^N \text {Var}(\tilde{M}_{kdn}) =\sum _{n=1}^N p_{nk}(1-p_{nk})\zeta _{nk}^{2}(x_n-\mu _k)_d^2=\tau ^2_{kd}. \end{aligned}$$

Applying Corollary 5 yields

$$\begin{aligned} \text {Pr}\left( \left| \sum _{n=1}^NZ_{nk}\zeta _{nk}\left( x_n-\mu _k\right) _d \right| \ge \lambda ^{(\mu )}_{ kd}\cdot \tau _{kd}\right) \le \delta \ . \end{aligned}$$

$\square $

Lemma 10

We have

$$\begin{aligned} S_k = \frac{\sum _{n=1}^NZ_{nk}\zeta _{nk}y_{nk}}{\sum _{n=1}^NZ_{nk}} - \frac{\sum _{n=1}^N Z_{nk}\zeta _{nk}}{\sum _{n=1}^N Z_{nk}} \nu _k \ , \end{aligned}$$

(9)

where $y_{nk} = (x_n-\mu _k)(x_n-\mu _k)^T$ and $\nu _k= (\mu _k-M_k)(\mu _k-M_k)^T$.

Proof

Observe that

$$\begin{aligned}&\sum _{n=1}^NZ_{nk}\zeta _{nk}y_{nk} \\&\quad = \sum _{n=1}^NZ_{nk}\zeta _{nk} (x_n-\mu _k)(x_n-\mu _k)^T \\&\quad = \sum _{n=1}^NZ_{nk}\zeta _{nk} (x_n-M_k+M_k-\mu _k)(x_n-M_k+M_k-\mu _k)^T \\&\quad = \sum _{n=1}^NZ_{nk}\zeta _{nk}(x_n-M_k)(x_n-M_k)^T - \left( \sum _{n=1}^NZ_{nk}\zeta _{nk}(x_n-M_k)\right) (\mu _k-M_k)^T \\&\qquad - (\mu _k-M_k) \left( \sum _{n=1}^NZ_{nk}\zeta _{nk}(x_n-M_k)\right) ^T + \left( \sum _{n=1}^N Z_{nk}\zeta _{nk}\right) \nu _k \\&\quad = \sum _{n=1}^NZ_{nk}\zeta _{nk}(x_n-M_k)(x_n-M_k)^T + \left( \sum _{n=1}^N Z_{nk}\zeta _{nk}\right) \nu _k\\&\quad = \left( \sum _{n=1}^NZ_{nk}\right) S_k + \left( \sum _{n=1}^N Z_{nk}\zeta _{nk}\right) \nu _k\ , \end{aligned}$$

where the second to last equality is due to the fact that

$$\begin{aligned} \sum _{n=1}^NZ_{nk}\zeta _{nk}(x_n-M_k)&= \left( \sum _{n=1}^NZ_{nk}\zeta _{nk} x_n\right) - \left( \sum _{n=1}^NZ_{nk}\zeta _{nk}\right) M_k = \mathbf {0}. \end{aligned}$$

This yields the claim. $\square $

Theorem 11

With probability at least $1-\delta $, we have

$$\begin{aligned} \left| \sum _{n=1}^N Z_{nk}\left( \zeta _{nk}y_{nk}-\varSigma _k\right) _{ij} \right| \le \lambda ^{(\varSigma )}_{ kij}\cdot \rho _{kij} \end{aligned}$$

(10)

where

$$\begin{aligned} y_{nk} = (x_n - \mu _k)(x_n-\mu _k)^T \end{aligned}$$

and for any $\lambda ^{(\varSigma )}_{ kij}$ with

$$\begin{aligned} \frac{(\lambda ^{(\varSigma )}_{ kij})^2}{1+\lambda ^{(\varSigma )}_{ kij}/3 \cdot \zeta _k^{max}\varDelta _i\varDelta _j/\rho _{kij}} \ge 2\ln (2/\delta ) \end{aligned}$$

Proof

For each $n\in [N]$, define the real random variable

$$\begin{aligned} \tilde{S}_{kijn} :=(Z_{nk}-p_{nk})\left( \zeta _{nk} y_{nk}-\varSigma _k \right) _{ij}\ . \end{aligned}$$

Using the definitions, we obtain $|\tilde{S}_{kijn}|\le \zeta _k^{max} \varDelta _i\varDelta _j$, ${{\,\mathrm{E}\,}}[\tilde{S}_{kijn}]=0$, and $\text {Var}(\tilde{S}_{kijn})=p_{nk}(1-p_{nk})\left( \zeta _{nk} y_{nk}-\varSigma _k\right) _{ij}^2$. Then, for the random variable

$$\begin{aligned} \tilde{S}_{kij} :=\sum _{n=1}^N \tilde{S}_{kijn} = \sum _{n=1}^N Z_{nk}\left( \zeta _{nk} y_{nk}-\varSigma _k\right) _{ij}\ , \end{aligned}$$

we get ${{\,\mathrm{E}\,}}[\tilde{S}_{kij}]=0$ and $\text {Var}(\tilde{S}_{kij}) = \rho ^2_{kij}$. Hence, we can apply Corollary 5 to obtain

$$\begin{aligned} \text {Pr}\left( \left| \sum _{n=1}^N Z_{nk}\left( \zeta _{nk} y_{nk}-\varSigma _k\right) _{ij} \right| \ge \lambda ^{(\varSigma )}_{ kij}\cdot \rho _{kij} \right) \le \delta \ . \end{aligned}$$

$\square $

Finally, we can combine the theorems using the union bound to obtain bounds on the proximity of all update equations.

Proof

(Theorem 1) We combine Theorem 6 through 11 by taking the union bound. Hence, with probability $1-K\cdot (2+ D + D^2)\cdot \delta $, Eqs. (6) through (10) hold for all $d,i,j\in [D]$ and $k\in [K]$.

Fix some $d,i,j\in [D]$ and $k\in [K]$. To prove Eq. (2), just observe that due to Eq. (7) and Eq. (8) we have

$$\begin{aligned} \left| \left( M_k-\mu _k\right) _d \right| =\frac{\left| \sum _{n=1}^NZ_{nk}\zeta _{nk} \left( x_n-\mu _k\right) _d \right| }{\left| U_k \right| } \le \frac{\lambda ^{(\mu )}_{kd}}{(\sqrt{u_k}-a\zeta ^{max}_k)}\frac{\tau _{kd}}{\sqrt{u_k}} \ . \end{aligned}$$

(11)

Recall that $\nu _k= (\mu _k-M_k)(\mu _k-M_k)^T$ and $y_{nk} = (x_n-\mu _k)(x_n-\mu _k)^T$. By Lemma 10 and by the triangle inequality, we can conclude

$$\begin{aligned}&\left| \left( S_k-\varSigma _k\right) _{ij} \right| \\&\quad =\left| \left( \frac{\sum _{n=1}^NZ_{nk}\zeta _{nk}y_{nk}}{\sum _{n=1}^NZ_{nk}} +\frac{\sum _{n=1}^N Z_{nk}\zeta _{nk}}{\sum _{n=1}^N Z_{nk}} \nu _k -\varSigma _k \right) _{ij} \right| \\&\quad \le \frac{ \left| \left( \sum _{n=1}^N Z_{nk}\left( \zeta _{nk}y_{nk}-\varSigma _k\right) \right) _{ij} \right| }{R_k } + \frac{U_k}{R_k } \cdot \left| \nu _k \right| . \end{aligned}$$

Due to Eq. (11), we have

$$\begin{aligned} \left| \nu _k \right| = \left| \left( \mu _k-M_k\right) _i \right| \cdot \left| \left( \mu _k-M_k\right) _j \right| \le \frac{\lambda ^{(\mu )}_{ki}\lambda ^{(\mu )}_{kj}}{(\sqrt{u_k}-a\zeta ^{max}_k)^2}\frac{\tau _{ki}\tau _{kj}}{u_k} \ . \end{aligned}$$

Moreover, due to Eqs. (6) and (7), we know

$$\begin{aligned} \frac{U_k}{R_k}&\le \frac{\sqrt{u_k}+a\zeta ^{max}}{\sqrt{r_k}-a}\cdot \frac{\sqrt{u_k}}{\sqrt{r_k}} \ . \end{aligned}$$

By combining all these inequalities with Eqs. (6) and (10), we obtain

$$\begin{aligned}&\left| \left( S_k-\varSigma _k\right) _{ij} \right| \\&\quad \le \frac{\lambda ^{(\varSigma )}_{ kij}}{\sqrt{r_k} - a}\cdot \frac{\rho _{kij}}{\sqrt{r_k}} + \frac{\sqrt{u_k}+a\zeta ^{max}}{\sqrt{r_k}-a} \cdot \frac{\sqrt{u_k}}{\sqrt{r_k}} \cdot \frac{\lambda ^{(\mu )}_{ki}\lambda ^{(\mu )}_{kj}}{(\sqrt{u_k}-a\zeta ^{max}_k)^2}\frac{\tau _{ki}\tau _{kj}}{u_k}\\&\quad = \frac{\lambda ^{(\varSigma )}_{ kij}}{\sqrt{r_k} - a}\cdot \frac{\rho _{kij}}{\sqrt{r_k}} + \frac{\sqrt{u_k}+a\zeta ^{max}}{(\sqrt{u_k}-a\zeta ^{max}_k)^2} \cdot \frac{\lambda ^{(\mu )}_{ki}\lambda ^{(\mu )}_{kj}}{\sqrt{r_k}-a}\frac{\tau _{ki}\tau _{kj}}{\sqrt{r_k u_k}}\ . \end{aligned}$$

These equations hold, if $\lambda \in \{\lambda ^{(\mu )}_{ki},\lambda ^{(\varSigma )}_{kij}\}$ fulfills

$$\begin{aligned} \frac{\lambda ^2}{1+\lambda /3 \cdot C/\sqrt{\text {Var}(Y)}} \ge 2\ln (2/\delta ) \ , \end{aligned}$$

for $C\in \{\zeta _k^{max}\varDelta _d,\zeta _k^{max}\varDelta _i\varDelta _j\}$ and $\text {Var}(Y)\in \{\tau _{ki}^2,\rho _{kij}^2\}$, respectively. Assume that

$$\begin{aligned} \text {Var}(Y) \ge \frac{4}{9}\cdot C^2\cdot \ln (2/\delta ) \ . \end{aligned}$$

(12)

Observe, that $\tau _{ki}^2$ and $\rho _{kij}^2$ grow as the number of points increase. Again, the EM algorithm should be applied under the assumption that there is a process generating data according to a mixture of the family of distributions the EM algorithm estimates. In that case, $\zeta _k^{max}\varDelta _d$ and $\zeta _k^{max}\varDelta _i\varDelta _j$ do not strictly increase with the number of points. These quantities depend on the position of the points to each other and not on the number of points. Hence, while they might increase with the generation of additional points, asymptotically we expect them to be constant. Hence, we expect that (12) holds for all sufficiently large data sets. We set $\lambda = \sqrt{4\ln (2/\delta )}$ and obtain

$$\begin{aligned} \frac{4\ln (2/\delta )}{1+\sqrt{4\ln (2/\delta )}/3 \cdot C/\sqrt{\text {Var}(Y)}} \ge \frac{4\ln (2/\delta )}{1+\sqrt{4\ln (2/\delta )}/3 \cdot 3/\sqrt{4\ln (2/\delta )}} = 2\ln (2/\delta ) \ . \end{aligned}$$

This yields the claim. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Blömer, J., Brauer, S., Bujna, K. et al. How well do SEM algorithms imitate EM algorithms? A non-asymptotic analysis for mixture models. Adv Data Anal Classif 14, 147–173 (2020). https://doi.org/10.1007/s11634-019-00366-7

Download citation

Received: 03 November 2016
Revised: 04 July 2019
Accepted: 05 July 2019
Published: 10 July 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s11634-019-00366-7

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

How well do SEM algorithms imitate EM algorithms? A non-asymptotic analysis for mixture models

Abstract

Access this article

Similar content being viewed by others

Modified EM Algorithms for Parameter Estimation in Finite Mixture Models

Comparison of the EM, CEM and SEM algorithms in the estimation of finite mixtures of linear mixed models: a simulation study

A Variational Approximations-DIC Rubric for Parameter Estimation and Mixture Model Selection Within a Family Setting

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (bz2 57 KB)

Full proof of the main theorem

Full proof of the main theorem

Lemma 2

Corollary 3

Proof

Lemma 4

Corollary 5

Proof

1.1 Proximity of the parameter updates

Notation

Theorem 6

Proof

Remark 7

Theorem 8

Proof

Theorem 9

Proof

Lemma 10

Proof

Theorem 11

Proof

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation