Robust clustering by deterministic agglomeration EM of mixtures of multivariate t-distributions
Introduction
The use of finite mixture model fitting as a systematic and flexible approach to clustering of multivariate data has become common practice in statistics [1], as well as in a large number of pattern recognition and analysis applications [2]. Many authors use the expectation–maximization (EM) algorithm [3], [4] for the task of maximizing the model likelihood, with most work in the field focusing on mixtures of multivariate gaussians. However, it is widely acknowledged that this framework suffers from three significant difficulties, which are shared by most other traditional algorithms for the fuzzy partition of data. First, EM is a local maximum seeker, which, in view of the mixture likelihood being riddled with many local maxima, makes its result highly sensitive to initialization. Second, the problem of determining the optimal number of components in the mixture remains difficult, typically requiring multiple runs of the algorithm with different numbers of components. The third problem is the lack of robustness with respect to outliers that results from using gaussians, often, a poor model of variability in real data sets.
In this paper we introduce a new method for robust model-based clustering, with convergence properties that significantly surpass those of the ordinary EM algorithm. Our strategy combines two recent approaches to achieving improved convergence properties within the EM framework: deterministic annealing [5] and competitive agglomeration [6], by running a deterministic annealing EM algorithm in agglomeration mode. Robustness is enhanced by moving from mixture components that are multivariate gaussians, to multivariate t-distributions with their wider tails, a standard approach with probabilistic models [7], [8]. As a direct result of using the t-distributions an additional set of weights that automatically down-weighs atypical data points is introduced, thus achieving the desired robustness.
The rest of the paper is organized as follows: first, we review related work in the field. In Section 2 we present the mixture of multivariate t-distributions model, repeat the derivation of the EM algorithm for this model, and develop 2 versions of the DAEM algorithm for this model. In Section 3 we present significant problems encountered by these algorithms when applied as prescribed by previous studies of DAEM. We present a simple yet powerful modification of the annealing schedule, termed DAGEM, which overcomes these problems. In Section 4 we test the algorithms on two more benchmark problems. We conclude in Section 5 with a general discussion of several aspects of the new algorithms, as well as related issues left open.
A number of recent studies have attempted to resolve one or more of the problems with the gaussian mixture EM framework listed above.
Deterministic annealing, split and merge operations and competitive agglomeration are methods that explicitly attempt to overcome the initialization sensitivity problem. All three methods essentially add an extra schedule on top of the expectation and maximization iterations of the EM algorithm.
Deterministic annealing EM (DAEM) was introduced by Ueda and Nakano [5], and a modified version (termed REM 2) was independently developed by Sahani [9]. Applications of deterministic annealing are reviewed in [10]. The standard EM algorithm [3], [4] can be viewed [11] as alternating maximization offirst with respect to the distribution of the missing data, (E step), and then with respect to the model's parameters, θ (M step). Similarly, DAEM is based on alternating maximization of a more general ‘negative free energy’where the parameter β has the intuitive interpretation of 1/temperature, and for β=1 the DAEM iterations are equivalent to those of the EM. β undergoes an annealing schedule, changing from an extremely small value (‘infinite temperature’) to 1, while at each intermediate value the modified E and M steps are repeated to convergence. At lower values of β the negative free energy is increasingly convex (totally convex entropy term at β=0), and thus less riddled with local maxima. The motivation is therefore an attempt to track the global maximum, in progressively less convex models, an idea related to Homotopy Continuation methods of optimization [12]. When applied to mixture models, the algorithm undergoes a series of phase transitions with decreasing ‘temperature’, in which the model size is increased (starting with one component at low β). Sahani [9] notes that the DAEM algorithm undergoes problematic jumps in the free energy when a mixture component is split. To correct this problem, he introduces a slightly different ‘negative free energy’ termIn general, the DAEM framework is incapable of tracking the global maximum, and in particular in mixture models the global maximum does not move smoothly as the temperature varies, a condition for guaranteed tracking. A series of simulation experiments in [5], [9] have shown, however, that DAEM and REM 2 for mixtures of gaussians have a significantly better convergence property than those of the standard EM.
Ueda et al. [13] note, that EM and DAEM for mixture have a difficulty avoiding local maxima associated with situations where separated areas of space have too few or too many components. To overcome this apparent problem they suggest allowing some mixture components to merge while others split (total number of components is conserved), a process they call split and merge EM (SMEM). Numerical results of the convergence capabilities of this algorithm are very encouraging. Competitive agglomeration starts with a highly over-specified number of components, and adds a penalty term to the log-likelihood that increasingly favors agglomeration of all the data points into fewer, larger components. Marginally small components are discarded, and following the agglomeration phase, the weight of the penalty term is progressively reduced. In different publications by Frigui et al. [6], [14], [15] the penalty terms α∑j=1gπj2 and are used, with α a parameter that undergoes an annealing schedule (increase from zero to a finite value, followed by a gradual decrease back to zero). A related Bayesian approach with the penalty term (where NC represents the number of parameters per mixture component) was introduced in Ref. [16]. In competitive agglomeration the update of the mixing proportions πj during the M step is modified, while in DAEM type algorithms the E step is the one modified.
In addition to these methods, other authors have introduced adaptations of methods of global optimization to the EM framework, e.g. genetic optimization [17].
Determination of the optimal number of components in a mixture model traditionally depends on finding the ML models for all relevant number of components, and then picking the one that has maximal penalized log-likelihood, according to a selected penalty criterion, such as the AIC, BIC, MDL or other criteria [18], [19].
Several authors have attempted to make the process of moving between different model sizes more efficient. Competitive agglomeration automatically goes through a monotonically decreasing number of mixture components, however, it does not include a natural stopping criteria related to penalized log-likelihood, relying instead on predetermination of an exact annealing schedule. In contrast, the agglomerative EM based algorithm of Banfield and Raftery [20] and Fraley [21], which also goes through a monotonically decreasing number of components (selecting at each model size the optimal components to be merged), allows for a penalized log-likelihood comparison between different model sizes. An Agglomerative EM algorithm with a Bayesian penalty term was proposed in Ref. [19], and other Bayesian approaches based on trimming unimportant components were described in [16], [22].
DAEM and REM 2 go through a monotonically increasing number of components, changing during phase transitions. To determine which size is optimal, one would have to anneal models of all sizes, each to β=1. With REM 2, Sahani [9] uses a monotonicity argument to efficiently purge many of the possibilities while annealing, a process he terms ‘Cascading Model Selection’.
One can roughly separate the attempts at enhancing the robustness of the EM gaussian mixture decomposition algorithm into parametric and other methods. A review of many robust clustering methods appears in Ref. [23].
Parametric methods involve choosing a parametric model to replace one based on a mixture of gaussians, generally one where one or more of the mixture's components can ‘explain’ outliers. Perhaps the most popular approach is adding an extra component with uniform density [9], [20], diffusely covering the entire measurement space. The mixing proportion of the uniform component becomes an additional mixture parameter. A second approach is to choose mixture components with wider tails than the gaussian distribution. A standard choice in statistics is the multivariate t distribution [7], [24]. An EM algorithm for mixtures of t-distributions was introduced recently in Ref. [8]. This algorithm introduces an additional set of weights, estimated during the E step, which essentially performs a soft rejection of outliers. A significant advantage of using the t distribution is the ability to tune the model's robustness to a particular application or even a particular data set, by tuning the degrees of freedom parameter.
Semi-parametric approaches, such as the use of various M-estimators, borrow from Huber's work on robust statistics [25]. In particular, the use of Huber's ψ-function [26], [27] is a hybrid of a gaussian distribution with laplacian tails. Another approach introduced recently uses least trimmed squares estimators [15].
Section snippets
The multivariate t mixture model
The components in our mixture are multivariate t-distributions, which are parameterized by a unique mean , covariance matrix and a ‘degrees of freedom’ (DOF) parameter ν (same for all components). Effectively, ν parameterizes the ‘robustness’ of the distribution, that is, how wide the tails are. The case ν→∞ corresponds to a gaussian distribution and when ν=1 we obtain the wide tailed multivariate Cauchy distribution (the covariance is infinite for ν⩽2). For p-dimensional data vectors ,
Classical approach
In previous publications involving DAEM, the annealing parameter β was varied from an extremely small value (high ‘temperature’), where the mixture collapses to one component, to β=1. To evaluate the performance of our algorithms combined with such an annealing schedule, we performed the following experiments: mixtures of four t-distributed two-dimensional components with unit covariance were generated with centers picked from a uniform distribution in the [−5,5]×[−5,5] square. A fifth,
Unequal covariance, uniform noise, 2-D
Fig. 2 illustrates “snapshots” of our second algorithm at several different ‘temperatures’, for a contaminated mixture of gaussians with unequal covariances. Uniform noise was added in the ±20 rectangle, at a 20% level. The mixture was initialized using a 15 component FCM, and . Of the mixture components, the 2 on the left are extremely small containing only 3% of the points each. During the cooling phase the estimated ν, as well as estimated covariance of the small components increase
Discussion
In this paper we develop robust and initialization insensitive algorithms by creating deterministic annealing versions of the EM algorithm for mixtures of multivariate t-distributions. Creating a DAEM framework for this model proved not to be a straightforward extension of DAEM for mixtures of gaussians. As the problem contains two distinct sets of augmented data (binary memberships zij and typicality weights uij), two different algorithms are possible, one where both are treated equally
Acknowledgements
I wish to thank Professors R.A. Normann, M. Figueiredo, R.D. Nowak and S.S. Nagarajan for valuable input in preparing this manuscript. The work was supported by a State of Utah Center of Excellence contract #95-3365.
About the Author—SHY SHOHAM received his B.Sc. in Physics in 1993 from Tel Aviv University, Israel. Since 1996, he is pursuing a Ph.D. in Bioengineering at the University of Utah. His work at the Center for Neural Interfaces involves developing robust and efficient clustering and estimation algorithms for an implantable Brain–Computer interface based on multi-electrode arrays. His general interests include Applied Neurophysiology, Brain Imaging, Computational Neuroscience and Statistical Signal
References (33)
- et al.
Deterministic annealing EM algorithm
Neural Networks
(1998) - et al.
Clustering by competitive agglomeration
Pattern Recognition
(1997) ML estimation of the multivariate t distribution and the EM Algorithm
J. Multivar. Anal.
(1997)- et al.
Finite Mixture Models
(2000) - et al.
Statistical pattern recognition: a review
IEEE Trans. Pattern Anal. Mach. Intell.
(2000) - et al.
Maximum likelihood from incomplete data using the EM algorithm (with discussion)
J. R. Stat. Soc. B
(1977) - et al.
The EM Algorithm and Extensions
(1997) - S. Medasani, R. Krishnapuram, Determination of the number of components in Gaussian mixtures using agglomerative...
- et al.
Robust statistical modeling using the t distribution
J. Amer. Stat. Assoc.
(1989) - et al.
Robust mixture modelling using the t distribution
Stat. Comput.
(2000)
Deterministic annealing for clustering, compression, classification, regression, and related optimization problems
Proc. IEEE
SMEM algorithm for mixture models
Neural Comput.
Cited by (94)
Statistical image watermark decoder using NSM-HMT in NSCT-FGPCET magnitude domain
2022, Journal of Information Security and ApplicationsHigh-dimensional unsupervised classification via parsimonious contaminated mixtures
2020, Pattern RecognitionStatistical modeling for fast Fourier transform coefficients of operational vibration measurements with non-Gaussianity using complex-valued t distribution
2019, Mechanical Systems and Signal ProcessingA projection-based split-and-merge clustering algorithm
2019, Expert Systems with ApplicationsCitation Excerpt :As we know, GMM is generally solved by EM algorithm. And a series of works (Shoham, 2002; Ueda, Nakano, Ghahramani, & Hinton, 2000; Zhang, Zhang, & Yi, 2004; Zhang, Chen, Sun, & Chan, 2003) aimed at improving EM algorithm by solving the local convergence problem to some extent. Especially, SMEM which proposed by Ueda et al. (2000) is a typical split-and-merge method.
On convergence and parameter selection of the EM and DA-EM algorithms for Gaussian mixtures
2018, Pattern RecognitionCitation Excerpt :The basic idea of the DA-EM is to begin at a high temperature β, and then decreases the temperature to zero according to some cooling strategy to avoid poor local optima. The DA-EM algorithm had been studied and applied in various areas, such as Shoham [18], Itaya et al. [19], Guo and Cui [20] and Okamura et al. [21]. On the other hand, Figueiredo and Jain [6] pointed out that, the heuristic behind the deterministic annealing is to force the entropy of the assignments to decrease slowly for avoiding poor local optima.
About the Author—SHY SHOHAM received his B.Sc. in Physics in 1993 from Tel Aviv University, Israel. Since 1996, he is pursuing a Ph.D. in Bioengineering at the University of Utah. His work at the Center for Neural Interfaces involves developing robust and efficient clustering and estimation algorithms for an implantable Brain–Computer interface based on multi-electrode arrays. His general interests include Applied Neurophysiology, Brain Imaging, Computational Neuroscience and Statistical Signal Processing.