Skip to main content
Log in

An effective strategy for initializing the EM algorithm in finite mixture models

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Finite mixture models represent one of the most popular tools for modeling heterogeneous data. The traditional approach for parameter estimation is based on maximizing the likelihood function. Direct optimization is often troublesome due to the complex likelihood structure. The expectation–maximization algorithm proves to be an effective remedy that alleviates this issue. The solution obtained by this procedure is entirely driven by the choice of starting parameter values. This highlights the importance of an effective initialization strategy. Despite efforts undertaken in this area, there is no uniform winner found and practitioners tend to ignore the issue, often finding misleading or erroneous results. In this paper, we propose a simple yet effective tool for initializing the expectation–maximization algorithm in the mixture modeling setting. The idea is based on model averaging and proves to be efficient in detecting correct solutions even in those cases when competitors perform poorly. The utility of the proposed methodology is shown through comprehensive simulation study and applied to a well-known classification dataset with good results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Azzalini A, Valle DA (1996) The multivariate skew-normal distribution. Biometrika 83:715–726

    Article  MathSciNet  MATH  Google Scholar 

  • Baudry J-P, Raftery A, Celeux G, Lo K, Gottardo R (2010) Combining mixture components for clustering. J Comput Graph Stat 19:332–353

    Article  MathSciNet  Google Scholar 

  • Biernacki C (2004) Initializing EM using the properties of its trajectories in Gaussian mixtures. Stat Comput 14:267–279

    Article  MathSciNet  Google Scholar 

  • Biernacki C, Celeux G, Govaert G (2003) Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput Stat Data Anal 413:561–575

    Article  MathSciNet  MATH  Google Scholar 

  • Bouveyron C, Brunet C (2013) Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal 71:52–78

    Article  MathSciNet  Google Scholar 

  • Campbell NA, Mahon RJ (1974) A multivariate study of variation in two species of rock crab of genus Leptograsus. Austr J Zool 22:417–425

    Article  Google Scholar 

  • Celebi ME, Kingravi HA, Vela PA (2012) A comparative study of efficient initialization methods for the \(k\)-means clustering algorithm. Comput Res Reposit. arXiv:1209.1960

  • Celeux G, Diebolt J (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput Stat 2:73–82

    Google Scholar 

  • Chen WC, Maitra R (2015) EMCluster: EM Algorithm for Model-Based Clustering of Finite Mixture Gaussian Distribution, R Package. http://cran.r-project.org/package=EMCluster

  • Dias J, Wedel M (2004) An empirical comparison of EM, SEM and MCMC performance for problematic Gaussian mixture likelihoods. Stat Comput 14:323–332

    Article  MathSciNet  Google Scholar 

  • Forgy E (1965) Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics 21:768–780

    Google Scholar 

  • Fraley C (1998) Algorithms for model-based gaussian hierarchical clustering. SIAM J Sci Comput 20:270–281

    Article  MathSciNet  MATH  Google Scholar 

  • Fraley C, Raftery AE (1998) How many clusters? Which cluster method? Answers via model-based cluster analysis. Comput J 41:578–588

    Article  MATH  Google Scholar 

  • Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631

    Article  MathSciNet  MATH  Google Scholar 

  • Fraley C, Raftery AE (2006) MCLUST version 3 for R: normal mixture modeling and model-based clustering. Tech. Rep. 504. University of Washington, Department of Statistics, Seattle

  • Hennig C (2010) Methods for merging Gaussian mixture components. Adv Data Anal Class 4(1):3–34

  • Hershey JR, Olsen PA (2007) Approximating the kullback leibler divergence between gaussian mixture models. In: IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, pp IV-317–IV-320

  • Hoeting JA, Madigan DM, Raftery AE, Volinsky CT (1999) Bayesian model averaging: a tutorial. Stat Sci 14:382–417 (with discussion)

    Article  MathSciNet  MATH  Google Scholar 

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218

    Article  MATH  Google Scholar 

  • Kaufman L, Rousseuw PJ (1990) Finding Groups in Data. Wiley, New York

    Book  Google Scholar 

  • Lebret R, Iovleff S, Langrognet F, Biernacki C, Celeux G, Govaert G (2015) Rmixmod: the R package of the model-based unsupervised, supervised, and semi-supervised classification mixmod library. J Stat Softw 67(6). doi:10.18637/jss.v067.i06

  • MacQueen J (1967) Some methods for classification and analysis of multivariate observations. Proc Fifth Berkeley Symp 1:281–297

    MathSciNet  MATH  Google Scholar 

  • Madigan D, Raftery AE (1994) Model selection and accounting for model uncertainty in graphical models using Occams window. J Am Stat Assoc 89:1535–1546

    Article  MATH  Google Scholar 

  • Maitra R (2009) Initializing partition-optimization algorithms. IEEE/ACM Trans Comput Biol Bioinf 6:144–157

    Article  Google Scholar 

  • Maitra R, Melnykov V (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms. J Comput Graph Stat 19:354–376

    Article  MathSciNet  Google Scholar 

  • McLachlan G, Peel D (2000) Finite Mixture Models. Wiley, New York

    Book  MATH  Google Scholar 

  • Melnykov V (2013) Challenges in model-based clustering. WIREs Comput Stat 5:135–148

    Article  Google Scholar 

  • Melnykov V (2016) Merging mixture components for clustering through pairwise overlap. J Comput Graph Stat 25:66–90

    Article  MathSciNet  Google Scholar 

  • Melnykov V, Chen W-C, Maitra R (2012) MixSim: R package for simulating datasets with pre-specified clustering complexity. J Stat Softw 51:1–25

    Article  Google Scholar 

  • Melnykov V, Melnykov I (2012) Initializing the EM algorithm in Gaussian mixture models with an unknown number of components. Comput Stat Data Anal 56:1381–1395

    Article  MathSciNet  MATH  Google Scholar 

  • Melnykov V, Melnykov I, Michael S (2015a) Semi-supervised model-based clustering with positive and negative constraints. In: Advances in data analysis and classification, pp 1–23

  • Melnykov V, Michael S, Melnykov I (2015b) Recent developments in model-based clustering with applications. In: Celebi ME (ed) Partitional clustering algorithms, vol 1. Springer, Berlin, pp 1–39

  • Prates M, Lachos V, Cabral C (2013) Mixsmsn: fitting finite mixture of scale mixture of skew-normal distributions. J Stat Softw 54(12):1–20

  • Sneath P (1957) The application of computers to taxonomy. J Gener Microbiol 17:201–226

    Google Scholar 

  • Sorensen T (1948) A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. Biologiske Skrifter 5:1–34

    Google Scholar 

  • Stahl D, Sallis H (2012) Model-based cluster analysis. Wiley Interdiscipl Rev Comput Stat 4:341–358

    Article  Google Scholar 

  • Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58:236–244

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Semhar Michael.

Appendix

Appendix

1.1 Assessment of the choice of weights

Fig. 10
figure 10

Results of the simulation study comparing weights to be used in emaEM algorithm. Gaussian mixtures are simulated from different (Kp). The y-axis is the mean and standard deviation of the adjusted Rand index calculated from 100 replication of each case. Solid line is for equal weights and dashed line is for weights based on BIC

As shown in Sect. 2, we need to assign weights to different models when aggregating. The weights can either be based on BIC or we can assign equal weights to all models. In this section, we conducted a small simulation study to see which one gives better performance. We considered a Gaussian mixtures with \(K = 3,15,30\) components and \(p = 2,6\) dimensions. All mixtures are simulated using MixSim package. From each mixture we simulated 100 datasets of size \(n = 50 \times K\). We run the emaEM initialization algorithm on each dataset with the two approaches to weights (BIC and equal). In all experiments, short EM is run 100 times. Although the weights based on BIC have shown a good performance in a variety of settings (Hoeting et al. 1999), in this case, assigning equal weights to all models gives better output. This occurs because, almost all the weight is assigned to the model with the lowest BIC and all other models are assigned almost zero weight when BIC-based weights are used. Therefore, this is essentially running the emEM algorithm. On the other hand, assigning equal weights to all models, the information from all models is gathered. Therefore, we will get some information for each pair of points from all models. Figure 10 shows results for each mixture combination considered. We report the mean and standard deviation of \(\mathcal A\mathcal R\) for equal weights as well as those for weights based on BIC. As we can see, there is a better performance observed when we use equal weights. Hence, for all experimental and real data analysis we used equal weights.

1.2 Performance assessment for unequal cluster sizes

Fig. 11
figure 11

Results of the simulation study for unequal cluster sizes. The x-axis represents the competing initialization methods and y-axis represents the mean adjusted Rand index value achieved by each method. Each sub-plot represents different (\(K,p,\check{\omega },\tau _{(1)}\)) combinations

In this section, we assess the performance of the emaEM initialization method when groups have unequal sizes. This interesting point was raised by one of the anonymous reviewers. That is, whether or not the new initialization methodology (emaEM) will be affected by unequal cluster sizes. To check this, we simulated data from a Gaussian mixture with the smallest cluster having certain fixed size. This can be specified by using the PiLow option of the function MixSim in the MixSim package. Similarly to previous simulation studies presented, Gaussian mixtures varied in the number of components (\(K \in \{3,15,30\}\)), dimensions (\(p\in \{2,6\}\)), and amount of maximum overlap between components (\(\check{\omega }\in \{0, 0.05, 0.1\}\)). The smallest cluster is specified to have a certain proportion of the total observations. To be consistent with our notation, we will use \(\tau _{(1)}\) to represent the mixing proportion of the smallest cluster. In our simulations, we used different values of \(\tau _{(1)}\) to see if there will be a clear trend in performance as we increase the size of the smallest cluster. The values considered are \(\tau _{(1)} \in \{0.1,0.15, 0.2\}\) for mixtures with \(K = 3\), \(\tau _{(1)} \in \{0.02,0.03, 0.04\}\) for mixtures with \(K = 15\), and \(\tau _{(1)} \in \{0.01,0.015, 0.02\}\) for mixtures with \(K = 30\). Note that the clusters with similar sizes would have \(\tau _{(k)}\simeq 1/K\) for a given value of K. Therefore, \(\tau _{(k)} \simeq 0.33, 0.067, 0.033\) for \(K = 3,15,30\), respectively. We simulated 100 mixtures from each combination of (\(K, p, \check{\omega }, \tau _{(1)}\)) and one dataset of size \(n=100 \times K\) from each mixture. The EM algorithm is run initialized by the competing methods M-EM, RndEM, emEM, SEM and emaEM. Adjusted Rand index values are calculated for each of the methods for all 100 datasets in each combination. Mean adjusted Rand index values for each combination (\(K,p,\check{\omega },\tau _{(1)}\)) are given in Fig. 11. Mean ranks (\({\bar{\mathcal R}}\)) based on the adjusted Rand index are presented in Fig. 12. Overall, the results are consistent with the ones found previously with approximately equal clusters. It can be noted that emaEM performed slightly worse than other competing methods when \(K = 3\) and \(\tau _{(1)}\) is small. This is shown in the first row of Figs. 11 and 12. However, when the number of clusters increases emaEM is superior to other methods. To make emaEM superior under all scenarios, the algorithm can be modified slightly to pick the maximum between the aggregated result and the best of random restarts based on likelihood values. Hence, this will guarantee that at least as good as an emEM solution to be found.

Fig. 12
figure 12

Results of the simulation study for unequal cluster sizes. The x-axis represents the competing initialization methods and y-axis represents the mean rank achieved by each method. Each sub-plot has a description as in Fig. 11

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Michael, S., Melnykov, V. An effective strategy for initializing the EM algorithm in finite mixture models. Adv Data Anal Classif 10, 563–583 (2016). https://doi.org/10.1007/s11634-016-0264-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-016-0264-8

Keywords

Mathematics Subject Classification

Navigation