An effective strategy for initializing the EM algorithm in finite mixture models

Michael, Semhar; Melnykov, Volodymyr

doi:10.1007/s11634-016-0264-8

An effective strategy for initializing the EM algorithm in finite mixture models

Regular Article
Published: 03 August 2016

Volume 10, pages 563–583, (2016)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Semhar Michael¹ &
Volodymyr Melnykov²

979 Accesses
16 Citations
2 Altmetric
Explore all metrics

Abstract

Finite mixture models represent one of the most popular tools for modeling heterogeneous data. The traditional approach for parameter estimation is based on maximizing the likelihood function. Direct optimization is often troublesome due to the complex likelihood structure. The expectation–maximization algorithm proves to be an effective remedy that alleviates this issue. The solution obtained by this procedure is entirely driven by the choice of starting parameter values. This highlights the importance of an effective initialization strategy. Despite efforts undertaken in this area, there is no uniform winner found and practitioners tend to ignore the issue, often finding misleading or erroneous results. In this paper, we propose a simple yet effective tool for initializing the expectation–maximization algorithm in the mixture modeling setting. The idea is based on model averaging and proves to be efficient in detecting correct solutions even in those cases when competitors perform poorly. The utility of the proposed methodology is shown through comprehensive simulation study and applied to a well-known classification dataset with good results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Modified EM Algorithms for Parameter Estimation in Finite Mixture Models

EM for mixtures

Article 11 June 2015

Jean-Patrick Baudry & Gilles Celeux

Comparison of the EM, CEM and SEM algorithms in the estimation of finite mixtures of linear mixed models: a simulation study

Article 27 February 2021

Luísa Novais & Susana Faria

References

Azzalini A, Valle DA (1996) The multivariate skew-normal distribution. Biometrika 83:715–726
Article MathSciNet MATH Google Scholar
Baudry J-P, Raftery A, Celeux G, Lo K, Gottardo R (2010) Combining mixture components for clustering. J Comput Graph Stat 19:332–353
Article MathSciNet Google Scholar
Biernacki C (2004) Initializing EM using the properties of its trajectories in Gaussian mixtures. Stat Comput 14:267–279
Article MathSciNet Google Scholar
Biernacki C, Celeux G, Govaert G (2003) Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput Stat Data Anal 413:561–575
Article MathSciNet MATH Google Scholar
Bouveyron C, Brunet C (2013) Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal 71:52–78
Article MathSciNet Google Scholar
Campbell NA, Mahon RJ (1974) A multivariate study of variation in two species of rock crab of genus Leptograsus. Austr J Zool 22:417–425
Article Google Scholar
Celebi ME, Kingravi HA, Vela PA (2012) A comparative study of efficient initialization methods for the \(k\)-means clustering algorithm. Comput Res Reposit. arXiv:1209.1960
Celeux G, Diebolt J (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput Stat 2:73–82
Google Scholar
Chen WC, Maitra R (2015) EMCluster: EM Algorithm for Model-Based Clustering of Finite Mixture Gaussian Distribution, R Package. http://cran.r-project.org/package=EMCluster
Dias J, Wedel M (2004) An empirical comparison of EM, SEM and MCMC performance for problematic Gaussian mixture likelihoods. Stat Comput 14:323–332
Article MathSciNet Google Scholar
Forgy E (1965) Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics 21:768–780
Google Scholar
Fraley C (1998) Algorithms for model-based gaussian hierarchical clustering. SIAM J Sci Comput 20:270–281
Article MathSciNet MATH Google Scholar
Fraley C, Raftery AE (1998) How many clusters? Which cluster method? Answers via model-based cluster analysis. Comput J 41:578–588
Article MATH Google Scholar
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631
Article MathSciNet MATH Google Scholar
Fraley C, Raftery AE (2006) MCLUST version 3 for R: normal mixture modeling and model-based clustering. Tech. Rep. 504. University of Washington, Department of Statistics, Seattle
Hennig C (2010) Methods for merging Gaussian mixture components. Adv Data Anal Class 4(1):3–34
Hershey JR, Olsen PA (2007) Approximating the kullback leibler divergence between gaussian mixture models. In: IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, pp IV-317–IV-320
Hoeting JA, Madigan DM, Raftery AE, Volinsky CT (1999) Bayesian model averaging: a tutorial. Stat Sci 14:382–417 (with discussion)
Article MathSciNet MATH Google Scholar
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Article MATH Google Scholar
Kaufman L, Rousseuw PJ (1990) Finding Groups in Data. Wiley, New York
Book Google Scholar
Lebret R, Iovleff S, Langrognet F, Biernacki C, Celeux G, Govaert G (2015) Rmixmod: the R package of the model-based unsupervised, supervised, and semi-supervised classification mixmod library. J Stat Softw 67(6). doi:10.18637/jss.v067.i06
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. Proc Fifth Berkeley Symp 1:281–297
MathSciNet MATH Google Scholar
Madigan D, Raftery AE (1994) Model selection and accounting for model uncertainty in graphical models using Occams window. J Am Stat Assoc 89:1535–1546
Article MATH Google Scholar
Maitra R (2009) Initializing partition-optimization algorithms. IEEE/ACM Trans Comput Biol Bioinf 6:144–157
Article Google Scholar
Maitra R, Melnykov V (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms. J Comput Graph Stat 19:354–376
Article MathSciNet Google Scholar
McLachlan G, Peel D (2000) Finite Mixture Models. Wiley, New York
Book MATH Google Scholar
Melnykov V (2013) Challenges in model-based clustering. WIREs Comput Stat 5:135–148
Article Google Scholar
Melnykov V (2016) Merging mixture components for clustering through pairwise overlap. J Comput Graph Stat 25:66–90
Article MathSciNet Google Scholar
Melnykov V, Chen W-C, Maitra R (2012) MixSim: R package for simulating datasets with pre-specified clustering complexity. J Stat Softw 51:1–25
Article Google Scholar
Melnykov V, Melnykov I (2012) Initializing the EM algorithm in Gaussian mixture models with an unknown number of components. Comput Stat Data Anal 56:1381–1395
Article MathSciNet MATH Google Scholar
Melnykov V, Melnykov I, Michael S (2015a) Semi-supervised model-based clustering with positive and negative constraints. In: Advances in data analysis and classification, pp 1–23
Melnykov V, Michael S, Melnykov I (2015b) Recent developments in model-based clustering with applications. In: Celebi ME (ed) Partitional clustering algorithms, vol 1. Springer, Berlin, pp 1–39
Prates M, Lachos V, Cabral C (2013) Mixsmsn: fitting finite mixture of scale mixture of skew-normal distributions. J Stat Softw 54(12):1–20
Sneath P (1957) The application of computers to taxonomy. J Gener Microbiol 17:201–226
Google Scholar
Sorensen T (1948) A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. Biologiske Skrifter 5:1–34
Google Scholar
Stahl D, Sallis H (2012) Model-based cluster analysis. Wiley Interdiscipl Rev Comput Stat 4:341–358
Article Google Scholar
Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58:236–244
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, 57007, USA
Semhar Michael
Department of Information Systems, Statistics, and Management Science, University of Alabama, Tuscaloosa, AL, 35487, USA
Volodymyr Melnykov

Authors

Semhar Michael
View author publications
You can also search for this author in PubMed Google Scholar
Volodymyr Melnykov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Semhar Michael.

Appendix

1.1 Assessment of the choice of weights

As shown in Sect. 2, we need to assign weights to different models when aggregating. The weights can either be based on BIC or we can assign equal weights to all models. In this section, we conducted a small simulation study to see which one gives better performance. We considered a Gaussian mixtures with \(K = 3,15,30\) components and \(p = 2,6\) dimensions. All mixtures are simulated using MixSim package. From each mixture we simulated 100 datasets of size \(n = 50 \times K\). We run the emaEM initialization algorithm on each dataset with the two approaches to weights (BIC and equal). In all experiments, short EM is run 100 times. Although the weights based on BIC have shown a good performance in a variety of settings (Hoeting et al. 1999), in this case, assigning equal weights to all models gives better output. This occurs because, almost all the weight is assigned to the model with the lowest BIC and all other models are assigned almost zero weight when BIC-based weights are used. Therefore, this is essentially running the emEM algorithm. On the other hand, assigning equal weights to all models, the information from all models is gathered. Therefore, we will get some information for each pair of points from all models. Figure 10 shows results for each mixture combination considered. We report the mean and standard deviation of \(\mathcal A\mathcal R\) for equal weights as well as those for weights based on BIC. As we can see, there is a better performance observed when we use equal weights. Hence, for all experimental and real data analysis we used equal weights.

1.2 Performance assessment for unequal cluster sizes

In this section, we assess the performance of the emaEM initialization method when groups have unequal sizes. This interesting point was raised by one of the anonymous reviewers. That is, whether or not the new initialization methodology (emaEM) will be affected by unequal cluster sizes. To check this, we simulated data from a Gaussian mixture with the smallest cluster having certain fixed size. This can be specified by using the PiLow option of the function MixSim in the MixSim package. Similarly to previous simulation studies presented, Gaussian mixtures varied in the number of components (\(K \in \{3,15,30\}\)), dimensions (\(p\in \{2,6\}\)), and amount of maximum overlap between components (\(\check{\omega }\in \{0, 0.05, 0.1\}\)). The smallest cluster is specified to have a certain proportion of the total observations. To be consistent with our notation, we will use \(\tau _{(1)}\) to represent the mixing proportion of the smallest cluster. In our simulations, we used different values of \(\tau _{(1)}\) to see if there will be a clear trend in performance as we increase the size of the smallest cluster. The values considered are \(\tau _{(1)} \in \{0.1,0.15, 0.2\}\) for mixtures with \(K = 3\), \(\tau _{(1)} \in \{0.02,0.03, 0.04\}\) for mixtures with \(K = 15\), and \(\tau _{(1)} \in \{0.01,0.015, 0.02\}\) for mixtures with \(K = 30\). Note that the clusters with similar sizes would have \(\tau _{(k)}\simeq 1/K\) for a given value of K. Therefore, \(\tau _{(k)} \simeq 0.33, 0.067, 0.033\) for \(K = 3,15,30\), respectively. We simulated 100 mixtures from each combination of (\(K, p, \check{\omega }, \tau _{(1)}\)) and one dataset of size \(n=100 \times K\) from each mixture. The EM algorithm is run initialized by the competing methods M-EM, RndEM, emEM, SEM and emaEM. Adjusted Rand index values are calculated for each of the methods for all 100 datasets in each combination. Mean adjusted Rand index values for each combination (\(K,p,\check{\omega },\tau _{(1)}\)) are given in Fig. 11. Mean ranks (\({\bar{\mathcal R}}\)) based on the adjusted Rand index are presented in Fig. 12. Overall, the results are consistent with the ones found previously with approximately equal clusters. It can be noted that emaEM performed slightly worse than other competing methods when \(K = 3\) and \(\tau _{(1)}\) is small. This is shown in the first row of Figs. 11 and 12. However, when the number of clusters increases emaEM is superior to other methods. To make emaEM superior under all scenarios, the algorithm can be modified slightly to pick the maximum between the aggregated result and the best of random restarts based on likelihood values. Hence, this will guarantee that at least as good as an emEM solution to be found.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Michael, S., Melnykov, V. An effective strategy for initializing the EM algorithm in finite mixture models. Adv Data Anal Classif 10, 563–583 (2016). https://doi.org/10.1007/s11634-016-0264-8

Download citation

Received: 03 May 2015
Revised: 12 July 2016
Accepted: 14 July 2016
Published: 03 August 2016
Issue Date: December 2016
DOI: https://doi.org/10.1007/s11634-016-0264-8

Keywords

Mathematics Subject Classification

62H30

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

An effective strategy for initializing the EM algorithm in finite mixture models

Abstract

Access this article

Similar content being viewed by others