Abstract
Finite mixture models represent one of the most popular tools for modeling heterogeneous data. The traditional approach for parameter estimation is based on maximizing the likelihood function. Direct optimization is often troublesome due to the complex likelihood structure. The expectation–maximization algorithm proves to be an effective remedy that alleviates this issue. The solution obtained by this procedure is entirely driven by the choice of starting parameter values. This highlights the importance of an effective initialization strategy. Despite efforts undertaken in this area, there is no uniform winner found and practitioners tend to ignore the issue, often finding misleading or erroneous results. In this paper, we propose a simple yet effective tool for initializing the expectation–maximization algorithm in the mixture modeling setting. The idea is based on model averaging and proves to be efficient in detecting correct solutions even in those cases when competitors perform poorly. The utility of the proposed methodology is shown through comprehensive simulation study and applied to a well-known classification dataset with good results.
Similar content being viewed by others
References
Azzalini A, Valle DA (1996) The multivariate skew-normal distribution. Biometrika 83:715–726
Baudry J-P, Raftery A, Celeux G, Lo K, Gottardo R (2010) Combining mixture components for clustering. J Comput Graph Stat 19:332–353
Biernacki C (2004) Initializing EM using the properties of its trajectories in Gaussian mixtures. Stat Comput 14:267–279
Biernacki C, Celeux G, Govaert G (2003) Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput Stat Data Anal 413:561–575
Bouveyron C, Brunet C (2013) Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal 71:52–78
Campbell NA, Mahon RJ (1974) A multivariate study of variation in two species of rock crab of genus Leptograsus. Austr J Zool 22:417–425
Celebi ME, Kingravi HA, Vela PA (2012) A comparative study of efficient initialization methods for the \(k\)-means clustering algorithm. Comput Res Reposit. arXiv:1209.1960
Celeux G, Diebolt J (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput Stat 2:73–82
Chen WC, Maitra R (2015) EMCluster: EM Algorithm for Model-Based Clustering of Finite Mixture Gaussian Distribution, R Package. http://cran.r-project.org/package=EMCluster
Dias J, Wedel M (2004) An empirical comparison of EM, SEM and MCMC performance for problematic Gaussian mixture likelihoods. Stat Comput 14:323–332
Forgy E (1965) Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics 21:768–780
Fraley C (1998) Algorithms for model-based gaussian hierarchical clustering. SIAM J Sci Comput 20:270–281
Fraley C, Raftery AE (1998) How many clusters? Which cluster method? Answers via model-based cluster analysis. Comput J 41:578–588
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631
Fraley C, Raftery AE (2006) MCLUST version 3 for R: normal mixture modeling and model-based clustering. Tech. Rep. 504. University of Washington, Department of Statistics, Seattle
Hennig C (2010) Methods for merging Gaussian mixture components. Adv Data Anal Class 4(1):3–34
Hershey JR, Olsen PA (2007) Approximating the kullback leibler divergence between gaussian mixture models. In: IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, pp IV-317–IV-320
Hoeting JA, Madigan DM, Raftery AE, Volinsky CT (1999) Bayesian model averaging: a tutorial. Stat Sci 14:382–417 (with discussion)
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Kaufman L, Rousseuw PJ (1990) Finding Groups in Data. Wiley, New York
Lebret R, Iovleff S, Langrognet F, Biernacki C, Celeux G, Govaert G (2015) Rmixmod: the R package of the model-based unsupervised, supervised, and semi-supervised classification mixmod library. J Stat Softw 67(6). doi:10.18637/jss.v067.i06
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. Proc Fifth Berkeley Symp 1:281–297
Madigan D, Raftery AE (1994) Model selection and accounting for model uncertainty in graphical models using Occams window. J Am Stat Assoc 89:1535–1546
Maitra R (2009) Initializing partition-optimization algorithms. IEEE/ACM Trans Comput Biol Bioinf 6:144–157
Maitra R, Melnykov V (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms. J Comput Graph Stat 19:354–376
McLachlan G, Peel D (2000) Finite Mixture Models. Wiley, New York
Melnykov V (2013) Challenges in model-based clustering. WIREs Comput Stat 5:135–148
Melnykov V (2016) Merging mixture components for clustering through pairwise overlap. J Comput Graph Stat 25:66–90
Melnykov V, Chen W-C, Maitra R (2012) MixSim: R package for simulating datasets with pre-specified clustering complexity. J Stat Softw 51:1–25
Melnykov V, Melnykov I (2012) Initializing the EM algorithm in Gaussian mixture models with an unknown number of components. Comput Stat Data Anal 56:1381–1395
Melnykov V, Melnykov I, Michael S (2015a) Semi-supervised model-based clustering with positive and negative constraints. In: Advances in data analysis and classification, pp 1–23
Melnykov V, Michael S, Melnykov I (2015b) Recent developments in model-based clustering with applications. In: Celebi ME (ed) Partitional clustering algorithms, vol 1. Springer, Berlin, pp 1–39
Prates M, Lachos V, Cabral C (2013) Mixsmsn: fitting finite mixture of scale mixture of skew-normal distributions. J Stat Softw 54(12):1–20
Sneath P (1957) The application of computers to taxonomy. J Gener Microbiol 17:201–226
Sorensen T (1948) A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. Biologiske Skrifter 5:1–34
Stahl D, Sallis H (2012) Model-based cluster analysis. Wiley Interdiscipl Rev Comput Stat 4:341–358
Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58:236–244
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Assessment of the choice of weights
As shown in Sect. 2, we need to assign weights to different models when aggregating. The weights can either be based on BIC or we can assign equal weights to all models. In this section, we conducted a small simulation study to see which one gives better performance. We considered a Gaussian mixtures with \(K = 3,15,30\) components and \(p = 2,6\) dimensions. All mixtures are simulated using MixSim package. From each mixture we simulated 100 datasets of size \(n = 50 \times K\). We run the emaEM initialization algorithm on each dataset with the two approaches to weights (BIC and equal). In all experiments, short EM is run 100 times. Although the weights based on BIC have shown a good performance in a variety of settings (Hoeting et al. 1999), in this case, assigning equal weights to all models gives better output. This occurs because, almost all the weight is assigned to the model with the lowest BIC and all other models are assigned almost zero weight when BIC-based weights are used. Therefore, this is essentially running the emEM algorithm. On the other hand, assigning equal weights to all models, the information from all models is gathered. Therefore, we will get some information for each pair of points from all models. Figure 10 shows results for each mixture combination considered. We report the mean and standard deviation of \(\mathcal A\mathcal R\) for equal weights as well as those for weights based on BIC. As we can see, there is a better performance observed when we use equal weights. Hence, for all experimental and real data analysis we used equal weights.
1.2 Performance assessment for unequal cluster sizes
In this section, we assess the performance of the emaEM initialization method when groups have unequal sizes. This interesting point was raised by one of the anonymous reviewers. That is, whether or not the new initialization methodology (emaEM) will be affected by unequal cluster sizes. To check this, we simulated data from a Gaussian mixture with the smallest cluster having certain fixed size. This can be specified by using the PiLow option of the function MixSim in the MixSim package. Similarly to previous simulation studies presented, Gaussian mixtures varied in the number of components (\(K \in \{3,15,30\}\)), dimensions (\(p\in \{2,6\}\)), and amount of maximum overlap between components (\(\check{\omega }\in \{0, 0.05, 0.1\}\)). The smallest cluster is specified to have a certain proportion of the total observations. To be consistent with our notation, we will use \(\tau _{(1)}\) to represent the mixing proportion of the smallest cluster. In our simulations, we used different values of \(\tau _{(1)}\) to see if there will be a clear trend in performance as we increase the size of the smallest cluster. The values considered are \(\tau _{(1)} \in \{0.1,0.15, 0.2\}\) for mixtures with \(K = 3\), \(\tau _{(1)} \in \{0.02,0.03, 0.04\}\) for mixtures with \(K = 15\), and \(\tau _{(1)} \in \{0.01,0.015, 0.02\}\) for mixtures with \(K = 30\). Note that the clusters with similar sizes would have \(\tau _{(k)}\simeq 1/K\) for a given value of K. Therefore, \(\tau _{(k)} \simeq 0.33, 0.067, 0.033\) for \(K = 3,15,30\), respectively. We simulated 100 mixtures from each combination of (\(K, p, \check{\omega }, \tau _{(1)}\)) and one dataset of size \(n=100 \times K\) from each mixture. The EM algorithm is run initialized by the competing methods M-EM, RndEM, emEM, SEM and emaEM. Adjusted Rand index values are calculated for each of the methods for all 100 datasets in each combination. Mean adjusted Rand index values for each combination (\(K,p,\check{\omega },\tau _{(1)}\)) are given in Fig. 11. Mean ranks (\({\bar{\mathcal R}}\)) based on the adjusted Rand index are presented in Fig. 12. Overall, the results are consistent with the ones found previously with approximately equal clusters. It can be noted that emaEM performed slightly worse than other competing methods when \(K = 3\) and \(\tau _{(1)}\) is small. This is shown in the first row of Figs. 11 and 12. However, when the number of clusters increases emaEM is superior to other methods. To make emaEM superior under all scenarios, the algorithm can be modified slightly to pick the maximum between the aggregated result and the best of random restarts based on likelihood values. Hence, this will guarantee that at least as good as an emEM solution to be found.
Rights and permissions
About this article
Cite this article
Michael, S., Melnykov, V. An effective strategy for initializing the EM algorithm in finite mixture models. Adv Data Anal Classif 10, 563–583 (2016). https://doi.org/10.1007/s11634-016-0264-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-016-0264-8