Skip to main content
Log in

How well do SEM algorithms imitate EM algorithms? A non-asymptotic analysis for mixture models

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

In this paper, we present a theoretical and an experimental comparison of EM and SEM algorithms for different mixture models. The SEM algorithm is a stochastic variant of the EM algorithm. The qualitative intuition behind the SEM algorithm is simple: If the number of observations is large enough, then we expect that an update step of the stochastic SEM algorithm is similar to the corresponding update step of the deterministic EM algorithm. In this paper, we quantify this intuition. We show that with high probability the update equations of any EM-like algorithm and its stochastic variant are similar, given that the input set satisfies certain properties. For instance, this result applies to the well-known EM and SEM algorithm for Gaussian mixture models and EM-like and SEM-like heuristics for multivariate power exponential distributions. Our experiments confirm that our theoretical results also hold for a large number of successive update steps. Thereby we complement the known asymptotic results for the SEM algorithm. We also show that, for multivariate Gaussian and multivariate Laplacian mixture models, an update step of SEM runs nearly twice as fast as an EM update set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

Download references

Acknowledgements

Funding was provide by Deutsche Forschungsgemeinschaft (DE) (Grant No. BL 314/8-1).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sascha Brauer.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (bz2 57 KB)

Full proof of the main theorem

Full proof of the main theorem

In the following, we provide the full proof of Theorem 1. A key tool in our proofs are the following Chernoff-type bounds.

Lemma 2

(McDiarmid (1998)) Let \(X_1,\ldots ,X_n\) be independent random variables in [0, 1] and \(Y=\sum _{i=1}^n X_i\). Then, for all \(\lambda \in [0,1]\) we have

$$\begin{aligned} \text {Pr}\left( \left| Y-{{\,\mathrm{E}\,}}[Y] \right| \ge \lambda \cdot {{\,\mathrm{E}\,}}[Y]\right) \le 2e^{-{{\,\mathrm{E}\,}}[Y]\frac{\lambda ^2}{3}}. \end{aligned}$$

Corollary 3

Let \(X_1,\ldots ,X_n\) be independent random variables in [0, 1] and \(Y=\sum _{i=1}^n X_i\). Let \(\delta \in (0,1)\). If we have

$$\begin{aligned} {{\,\mathrm{E}\,}}[Y]\ge 3\ln (2/\delta )\ , \end{aligned}$$

then with probability at least \(1-\delta \) we have

$$\begin{aligned} \text {Pr}\left( |Y-{{\,\mathrm{E}\,}}[Y]|\ge \sqrt{3\ln (\nicefrac {2}{\delta })} \sqrt{{{\,\mathrm{E}\,}}[Y]} \right) \le \delta . \end{aligned}$$

Proof

Due to \({{\,\mathrm{E}\,}}[Y]\ge 3\ln (2/\delta )\), we have \(\lambda :=\sqrt{3\ln (2/\delta )/E[Y]}\in [0,1]\). Applying Lemma 2 yields the claim. \(\square \)

Lemma 4

(McDiarmid (1998)) Let \(C\ge 0\) be some constant, \(X_1,\ldots ,X_n\) be independent random variables, such that \(\forall i\in [n]: {{\,\mathrm{E}\,}}[X_i] = 0\) and \(\left| X_i \right| \le C\), and let \(Y=\sum _{i=1}^n X_i\). Then, for all \(\lambda \ge 0\) we have

$$\begin{aligned} \text {Pr}\left( Y \ge t \right) \le \exp \left( -\frac{t^2}{2\text {Var}(Y)(1+(tC/3\text {Var}(Y)))}\right) \ . \end{aligned}$$

Corollary 5

Let \(C\ge 0\) be some constant, \(X_1,\ldots ,X_n\) be independent random variables, such that \(\forall i\in [n]: {{\,\mathrm{E}\,}}[X_i] = 0\) and \(\left| X_i \right| \le C\), and let \(Y=\sum _{i=1}^n X_i\). Then, for all \(\delta \in (0,1)\) and all \(\lambda \) with

$$\begin{aligned} \frac{\lambda ^2}{1+\lambda /3 \cdot C/\sqrt{\text {Var}(Y)}} \ge 2\ln (2/\delta ) \end{aligned}$$

we have

$$\begin{aligned} \text {Pr}\left( \left| Y \right| \ge \lambda \sqrt{\text {Var}(Y)} \right) \le \delta \ . \end{aligned}$$

Proof

We obtain the upper and the lower bound, by applying Lemma 4 with \(t = \lambda \sqrt{\text {Var}(Y)}\) to the variables \(Y = \sum _{i=1}^n X_i\) and \(Y' = \sum _{i=1}^n -X_i\), respectively. Then, applying the union bound yields the claim. \(\square \)

1.1 Proximity of the parameter updates

In the following we fix some solution \(\theta ^{old}\) and compare a single update step of the \(\hbox {EM}^*\) algorithm with a single update step of the \(\hbox {SEM}^*\) algorithm, both given \(\theta ^{old}\). To this end, we make extensive use of the variables defined in Algorithms 3 and 4, the quantities defined in Sect. 3, and the following notation.

Notation

Throughout the rest of this section, we fix the following parameters. We let

$$\begin{aligned} \theta ^{old}=\left\{ \left( w^{old}_k,\mu ^{old}_k,\varSigma ^{old}_k\right) \right\} _{k\in [K]}\ . \end{aligned}$$

We fix some component \(k\in [K]\), dimensions \(d,i,j\in [D]\), and a probability of success \(\delta \in (0,1)\). Moreover, as in Theorem 1, we use

$$\begin{aligned} a:=\sqrt{3\ln (2/\delta )} \quad \text{ and } \quad b:=\sqrt{2e\ln (2/\delta )}\ . \end{aligned}$$

First, we bound the difference on the weights and the \(\zeta \)-responsibilities.

Theorem 6

(Proximity of Masses) If we have

$$\begin{aligned} r_k \ge a^2 \ , \end{aligned}$$

then with probability at least \(1-\delta \)

$$\begin{aligned} \left| R_k - r_k \right| \le a\cdot \sqrt{r_k}\ . \end{aligned}$$
(6)

Proof

Recall that \({{\,\mathrm{E}\,}}[R_k]=r_k\) and \(R_k=\sum _{n=1}^N Z_{nk}\) with \(Z_{nk}\in \{0,1\}\). Applying Corollary 3 yields the claim. \(\square \)

Remark 7

(Proximity of Mixture Weights) If Eq. (6) of Theorem 6 is satisfied, then by definition

$$\begin{aligned} \left| W_k - w_k \right| \le \frac{a}{\sqrt{r_k}} \cdot w_k\ . \end{aligned}$$

Theorem 8

(Proximity of \(\zeta \)-Masses) If we have

$$\begin{aligned} u_k \ge a^2 \zeta ^{max}_k \ , \end{aligned}$$

then with probability at least \(1-\delta \)

$$\begin{aligned} \left|u_k - U_k \right|\le a\sqrt{\zeta _k^{max}} \cdot \sqrt{u_k}\ . \end{aligned}$$
(7)

Proof

Recall that \({{\,\mathrm{E}\,}}[U_k]=u_k\) and \(U_k=\sum _{n=1}^N \zeta _{nk}Z_{nk}\) with \(\zeta _{nk}Z_{nk}\in [0,\zeta ^{max}_k]\). Applying Corollary 3 to the (scaled) random variables \(\zeta _{nk}Z_{nk}/ \zeta ^{max}_k\in [0,1]\) yields the claim. \(\square \)

By using the standard deviations \(\tau _{kd}\) and \(\rho _{kij}\), one can bound the difference between the numerator of the mean and scale updates, respectively.

Theorem 9

With probability at least \(1-\delta \), we have

$$\begin{aligned} \left| \sum _{n=1}^NZ_{nk}\zeta _{nk}\left( x_n-\mu _k\right) _d \right| \le \lambda ^{(\mu )}_{ kd}\cdot \tau _{kd} \end{aligned}$$
(8)

for any \(\lambda ^{(\mu )}_{ kd}\) with

$$\begin{aligned} \frac{(\lambda ^{(\mu )}_{ kd})^2}{1+\lambda ^{(\mu )}_{ kd}/3 \cdot \zeta _k^{max}\varDelta _d/\tau _{kd}} \ge 2\ln (2/\delta ) \ . \end{aligned}$$

Proof

For each \(n\in [N]\), define the real random variable

$$\begin{aligned} \tilde{M}_{kdn} :=(Z_{nk}-p_{nk})\zeta _{nk}\left( x_n-\mu _k\right) _d. \end{aligned}$$

Since \({{\,\mathrm{E}\,}}[Z_{nk}]=p_{nk}\), we get that \({{\,\mathrm{E}\,}}\left[ \tilde{M}_{kdn}\right] =0\) and

$$\begin{aligned} \text {Var}(\tilde{M}_{kdn})=p_{nk}(1-p_{nk})\zeta _{nk}^{2}(x_n-\mu _k)_d^2 \end{aligned}$$

Furthermore, since each \(\mu _k\) is a convex combination of \(x_1,\ldots ,x_N\), we obtain

$$\begin{aligned} |\tilde{M}_{kdn}|\le |Z_{nk}-p_{nk}|\cdot \left| \zeta _{nk} \right| \cdot \left| \left( x_{n}-\mu _k\right) _d \right| \le \varDelta _d\cdot \zeta _k^{max}. \end{aligned}$$

Note that by definition of \(\mu _k\), we have \(\sum _{n=1}^N p_{nk}\zeta _{nk}\left( x_n-\mu _k\right) _d=0\). Thus, for the random variable

$$\begin{aligned} \tilde{M}_{kd}= \sum _{n=1}^N \tilde{M}_{kdn}= \sum _{n=1}^N Z_{nk}\zeta _{nk}\left( x_n-\mu _k\right) _d. \end{aligned}$$

we get \({{\,\mathrm{E}\,}}[\tilde{M}_{kd}]=0\) and

$$\begin{aligned} \text {Var}(\tilde{M}_{kd}) :=\sum _{n=1}^N \text {Var}(\tilde{M}_{kdn}) =\sum _{n=1}^N p_{nk}(1-p_{nk})\zeta _{nk}^{2}(x_n-\mu _k)_d^2=\tau ^2_{kd}. \end{aligned}$$

Applying Corollary 5 yields

$$\begin{aligned} \text {Pr}\left( \left| \sum _{n=1}^NZ_{nk}\zeta _{nk}\left( x_n-\mu _k\right) _d \right| \ge \lambda ^{(\mu )}_{ kd}\cdot \tau _{kd}\right) \le \delta \ . \end{aligned}$$

\(\square \)

Lemma 10

We have

$$\begin{aligned} S_k = \frac{\sum _{n=1}^NZ_{nk}\zeta _{nk}y_{nk}}{\sum _{n=1}^NZ_{nk}} - \frac{\sum _{n=1}^N Z_{nk}\zeta _{nk}}{\sum _{n=1}^N Z_{nk}} \nu _k \ , \end{aligned}$$
(9)

where \(y_{nk} = (x_n-\mu _k)(x_n-\mu _k)^T\) and \(\nu _k= (\mu _k-M_k)(\mu _k-M_k)^T\).

Proof

Observe that

$$\begin{aligned}&\sum _{n=1}^NZ_{nk}\zeta _{nk}y_{nk} \\&\quad = \sum _{n=1}^NZ_{nk}\zeta _{nk} (x_n-\mu _k)(x_n-\mu _k)^T \\&\quad = \sum _{n=1}^NZ_{nk}\zeta _{nk} (x_n-M_k+M_k-\mu _k)(x_n-M_k+M_k-\mu _k)^T \\&\quad = \sum _{n=1}^NZ_{nk}\zeta _{nk}(x_n-M_k)(x_n-M_k)^T - \left( \sum _{n=1}^NZ_{nk}\zeta _{nk}(x_n-M_k)\right) (\mu _k-M_k)^T \\&\qquad - (\mu _k-M_k) \left( \sum _{n=1}^NZ_{nk}\zeta _{nk}(x_n-M_k)\right) ^T + \left( \sum _{n=1}^N Z_{nk}\zeta _{nk}\right) \nu _k \\&\quad = \sum _{n=1}^NZ_{nk}\zeta _{nk}(x_n-M_k)(x_n-M_k)^T + \left( \sum _{n=1}^N Z_{nk}\zeta _{nk}\right) \nu _k\\&\quad = \left( \sum _{n=1}^NZ_{nk}\right) S_k + \left( \sum _{n=1}^N Z_{nk}\zeta _{nk}\right) \nu _k\ , \end{aligned}$$

where the second to last equality is due to the fact that

$$\begin{aligned} \sum _{n=1}^NZ_{nk}\zeta _{nk}(x_n-M_k)&= \left( \sum _{n=1}^NZ_{nk}\zeta _{nk} x_n\right) - \left( \sum _{n=1}^NZ_{nk}\zeta _{nk}\right) M_k = \mathbf {0}. \end{aligned}$$

This yields the claim. \(\square \)

Theorem 11

With probability at least \(1-\delta \), we have

$$\begin{aligned} \left| \sum _{n=1}^N Z_{nk}\left( \zeta _{nk}y_{nk}-\varSigma _k\right) _{ij} \right| \le \lambda ^{(\varSigma )}_{ kij}\cdot \rho _{kij} \end{aligned}$$
(10)

where

$$\begin{aligned} y_{nk} = (x_n - \mu _k)(x_n-\mu _k)^T \end{aligned}$$

and for any \(\lambda ^{(\varSigma )}_{ kij}\) with

$$\begin{aligned} \frac{(\lambda ^{(\varSigma )}_{ kij})^2}{1+\lambda ^{(\varSigma )}_{ kij}/3 \cdot \zeta _k^{max}\varDelta _i\varDelta _j/\rho _{kij}} \ge 2\ln (2/\delta ) \end{aligned}$$

Proof

For each \(n\in [N]\), define the real random variable

$$\begin{aligned} \tilde{S}_{kijn} :=(Z_{nk}-p_{nk})\left( \zeta _{nk} y_{nk}-\varSigma _k \right) _{ij}\ . \end{aligned}$$

Using the definitions, we obtain \(|\tilde{S}_{kijn}|\le \zeta _k^{max} \varDelta _i\varDelta _j\), \({{\,\mathrm{E}\,}}[\tilde{S}_{kijn}]=0\), and \(\text {Var}(\tilde{S}_{kijn})=p_{nk}(1-p_{nk})\left( \zeta _{nk} y_{nk}-\varSigma _k\right) _{ij}^2\). Then, for the random variable

$$\begin{aligned} \tilde{S}_{kij} :=\sum _{n=1}^N \tilde{S}_{kijn} = \sum _{n=1}^N Z_{nk}\left( \zeta _{nk} y_{nk}-\varSigma _k\right) _{ij}\ , \end{aligned}$$

we get \({{\,\mathrm{E}\,}}[\tilde{S}_{kij}]=0\) and \(\text {Var}(\tilde{S}_{kij}) = \rho ^2_{kij}\). Hence, we can apply Corollary 5 to obtain

$$\begin{aligned} \text {Pr}\left( \left| \sum _{n=1}^N Z_{nk}\left( \zeta _{nk} y_{nk}-\varSigma _k\right) _{ij} \right| \ge \lambda ^{(\varSigma )}_{ kij}\cdot \rho _{kij} \right) \le \delta \ . \end{aligned}$$

\(\square \)

Finally, we can combine the theorems using the union bound to obtain bounds on the proximity of all update equations.

Proof

(Theorem 1) We combine Theorem 6 through 11 by taking the union bound. Hence, with probability \(1-K\cdot (2+ D + D^2)\cdot \delta \), Eqs. (6) through (10) hold for all \(d,i,j\in [D]\) and \(k\in [K]\).

Fix some \(d,i,j\in [D]\) and \(k\in [K]\). To prove Eq. (2), just observe that due to Eq. (7) and Eq. (8) we have

$$\begin{aligned} \left| \left( M_k-\mu _k\right) _d \right| =\frac{\left| \sum _{n=1}^NZ_{nk}\zeta _{nk} \left( x_n-\mu _k\right) _d \right| }{\left| U_k \right| } \le \frac{\lambda ^{(\mu )}_{kd}}{(\sqrt{u_k}-a\zeta ^{max}_k)}\frac{\tau _{kd}}{\sqrt{u_k}} \ . \end{aligned}$$
(11)

Recall that \(\nu _k= (\mu _k-M_k)(\mu _k-M_k)^T\) and \(y_{nk} = (x_n-\mu _k)(x_n-\mu _k)^T\). By Lemma 10 and by the triangle inequality, we can conclude

$$\begin{aligned}&\left| \left( S_k-\varSigma _k\right) _{ij} \right| \\&\quad =\left| \left( \frac{\sum _{n=1}^NZ_{nk}\zeta _{nk}y_{nk}}{\sum _{n=1}^NZ_{nk}} +\frac{\sum _{n=1}^N Z_{nk}\zeta _{nk}}{\sum _{n=1}^N Z_{nk}} \nu _k -\varSigma _k \right) _{ij} \right| \\&\quad \le \frac{ \left| \left( \sum _{n=1}^N Z_{nk}\left( \zeta _{nk}y_{nk}-\varSigma _k\right) \right) _{ij} \right| }{R_k } + \frac{U_k}{R_k } \cdot \left| \nu _k \right| . \end{aligned}$$

Due to Eq. (11), we have

$$\begin{aligned} \left| \nu _k \right| = \left| \left( \mu _k-M_k\right) _i \right| \cdot \left| \left( \mu _k-M_k\right) _j \right| \le \frac{\lambda ^{(\mu )}_{ki}\lambda ^{(\mu )}_{kj}}{(\sqrt{u_k}-a\zeta ^{max}_k)^2}\frac{\tau _{ki}\tau _{kj}}{u_k} \ . \end{aligned}$$

Moreover, due to Eqs. (6) and (7), we know

$$\begin{aligned} \frac{U_k}{R_k}&\le \frac{\sqrt{u_k}+a\zeta ^{max}}{\sqrt{r_k}-a}\cdot \frac{\sqrt{u_k}}{\sqrt{r_k}} \ . \end{aligned}$$

By combining all these inequalities with Eqs. (6) and (10), we obtain

$$\begin{aligned}&\left| \left( S_k-\varSigma _k\right) _{ij} \right| \\&\quad \le \frac{\lambda ^{(\varSigma )}_{ kij}}{\sqrt{r_k} - a}\cdot \frac{\rho _{kij}}{\sqrt{r_k}} + \frac{\sqrt{u_k}+a\zeta ^{max}}{\sqrt{r_k}-a} \cdot \frac{\sqrt{u_k}}{\sqrt{r_k}} \cdot \frac{\lambda ^{(\mu )}_{ki}\lambda ^{(\mu )}_{kj}}{(\sqrt{u_k}-a\zeta ^{max}_k)^2}\frac{\tau _{ki}\tau _{kj}}{u_k}\\&\quad = \frac{\lambda ^{(\varSigma )}_{ kij}}{\sqrt{r_k} - a}\cdot \frac{\rho _{kij}}{\sqrt{r_k}} + \frac{\sqrt{u_k}+a\zeta ^{max}}{(\sqrt{u_k}-a\zeta ^{max}_k)^2} \cdot \frac{\lambda ^{(\mu )}_{ki}\lambda ^{(\mu )}_{kj}}{\sqrt{r_k}-a}\frac{\tau _{ki}\tau _{kj}}{\sqrt{r_k u_k}}\ . \end{aligned}$$

These equations hold, if \(\lambda \in \{\lambda ^{(\mu )}_{ki},\lambda ^{(\varSigma )}_{kij}\}\) fulfills

$$\begin{aligned} \frac{\lambda ^2}{1+\lambda /3 \cdot C/\sqrt{\text {Var}(Y)}} \ge 2\ln (2/\delta ) \ , \end{aligned}$$

for \(C\in \{\zeta _k^{max}\varDelta _d,\zeta _k^{max}\varDelta _i\varDelta _j\}\) and \(\text {Var}(Y)\in \{\tau _{ki}^2,\rho _{kij}^2\}\), respectively. Assume that

$$\begin{aligned} \text {Var}(Y) \ge \frac{4}{9}\cdot C^2\cdot \ln (2/\delta ) \ . \end{aligned}$$
(12)

Observe, that \(\tau _{ki}^2\) and \(\rho _{kij}^2\) grow as the number of points increase. Again, the EM algorithm should be applied under the assumption that there is a process generating data according to a mixture of the family of distributions the EM algorithm estimates. In that case, \(\zeta _k^{max}\varDelta _d\) and \(\zeta _k^{max}\varDelta _i\varDelta _j\) do not strictly increase with the number of points. These quantities depend on the position of the points to each other and not on the number of points. Hence, while they might increase with the generation of additional points, asymptotically we expect them to be constant. Hence, we expect that (12) holds for all sufficiently large data sets. We set \(\lambda = \sqrt{4\ln (2/\delta )}\) and obtain

$$\begin{aligned} \frac{4\ln (2/\delta )}{1+\sqrt{4\ln (2/\delta )}/3 \cdot C/\sqrt{\text {Var}(Y)}} \ge \frac{4\ln (2/\delta )}{1+\sqrt{4\ln (2/\delta )}/3 \cdot 3/\sqrt{4\ln (2/\delta )}} = 2\ln (2/\delta ) \ . \end{aligned}$$

This yields the claim. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Blömer, J., Brauer, S., Bujna, K. et al. How well do SEM algorithms imitate EM algorithms? A non-asymptotic analysis for mixture models. Adv Data Anal Classif 14, 147–173 (2020). https://doi.org/10.1007/s11634-019-00366-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-019-00366-7

Keywords

Mathematics Subject Classification

Navigation