Robust clustering by deterministic agglomeration EM of mixtures of multivariate t-distributions

doi:10.1016/S0031-3203(01)00080-2

Pattern Recognition

Volume 35, Issue 5, May 2002, Pages 1127-1142

https://doi.org/10.1016/S0031-3203(01)00080-2 Get rights and content

Abstract

This paper presents new robust clustering algorithms, which significantly improve upon the noise and initialization sensitivity of traditional mixture decomposition algorithms, and simplify the determination of the optimal number of clusters in the data set. The algorithms implement maximum likelihood mixture decomposition of multivariate t-distributions, a robust parametric extension of gaussian mixture decomposition. We achieve improved convergence capability relative to the expectation–maximization (EM) approach by deriving deterministic annealing EM (DAEM) algorithms for this mixture model and turning them into agglomerative algorithms (going through a monotonically decreasing number of components), an approach we term deterministic agglomeration EM (DAGEM). Two versions are derived, based on two variants of DAEM for mixture models. Simulation studies demonstrate the algorithms’ performance for mixtures with isotropic and non-isotropic covariances in two and 10 dimensions with known or unknown levels of outlier contamination.

Introduction

The use of finite mixture model fitting as a systematic and flexible approach to clustering of multivariate data has become common practice in statistics [1], as well as in a large number of pattern recognition and analysis applications [2]. Many authors use the expectation–maximization (EM) algorithm [3], [4] for the task of maximizing the model likelihood, with most work in the field focusing on mixtures of multivariate gaussians. However, it is widely acknowledged that this framework suffers from three significant difficulties, which are shared by most other traditional algorithms for the fuzzy partition of data. First, EM is a local maximum seeker, which, in view of the mixture likelihood being riddled with many local maxima, makes its result highly sensitive to initialization. Second, the problem of determining the optimal number of components in the mixture remains difficult, typically requiring multiple runs of the algorithm with different numbers of components. The third problem is the lack of robustness with respect to outliers that results from using gaussians, often, a poor model of variability in real data sets.

In this paper we introduce a new method for robust model-based clustering, with convergence properties that significantly surpass those of the ordinary EM algorithm. Our strategy combines two recent approaches to achieving improved convergence properties within the EM framework: deterministic annealing [5] and competitive agglomeration [6], by running a deterministic annealing EM algorithm in agglomeration mode. Robustness is enhanced by moving from mixture components that are multivariate gaussians, to multivariate t-distributions with their wider tails, a standard approach with probabilistic models [7], [8]. As a direct result of using the t-distributions an additional set of weights that automatically down-weighs atypical data points is introduced, thus achieving the desired robustness.

The rest of the paper is organized as follows: first, we review related work in the field. In Section 2 we present the mixture of multivariate t-distributions model, repeat the derivation of the EM algorithm for this model, and develop 2 versions of the DAEM algorithm for this model. In Section 3 we present significant problems encountered by these algorithms when applied as prescribed by previous studies of DAEM. We present a simple yet powerful modification of the annealing schedule, termed DAGEM, which overcomes these problems. In Section 4 we test the algorithms on two more benchmark problems. We conclude in Section 5 with a general discussion of several aspects of the new algorithms, as well as related issues left open.

A number of recent studies have attempted to resolve one or more of the problems with the gaussian mixture EM framework listed above.

Deterministic annealing, split and merge operations and competitive agglomeration are methods that explicitly attempt to overcome the initialization sensitivity problem. All three methods essentially add an extra schedule on top of the expectation and maximization iterations of the EM algorithm.

Deterministic annealing EM (DAEM) was introduced by Ueda and Nakano [5], and a modified version (termed REM 2) was independently developed by Sahani [9]. Applications of deterministic annealing are reviewed in [10]. The standard EM algorithm [3], [4] can be viewed [11] as alternating maximization of $F=∫ log (p(x_{obs} | x_{mis};θ)p(x_{mis};θ))f(x_{mis}) d x_{mis} −∫(log f(x_{mis}))f(x_{mis}) d x_{mis},$ first with respect to the distribution of the missing data, $f(x_{mis})$ (E step), and then with respect to the model's parameters, θ (M step). Similarly, DAEM is based on alternating maximization of a more general ‘negative free energy’ $F_{β} =β∫ log (p(x_{obs} | x_{mis};θ)p(x_{mis};θ))f(x_{mis}) d x_{mis} −∫(log f(x_{mis}))f(x_{mis}) d x_{mis},$ where the parameter β has the intuitive interpretation of 1/temperature, and for β=1 the DAEM iterations are equivalent to those of the EM. β undergoes an annealing schedule, changing from an extremely small value (‘infinite temperature’) to 1, while at each intermediate value the modified E and M steps are repeated to convergence. At lower values of β the negative free energy is increasingly convex (totally convex entropy term at β=0), and thus less riddled with local maxima. The motivation is therefore an attempt to track the global maximum, in progressively less convex models, an idea related to Homotopy Continuation methods of optimization [12]. When applied to mixture models, the algorithm undergoes a series of phase transitions with decreasing ‘temperature’, in which the model size is increased (starting with one component at low β). Sahani [9] notes that the DAEM algorithm undergoes problematic jumps in the free energy when a mixture component is split. To correct this problem, he introduces a slightly different ‘negative free energy’ term $F_{β} =∫(β log (p(x_{obs} | x_{mis};θ))+ log (p(x_{mis};θ)))f(x_{mis}) d x_{mis} −∫ log (f(x_{mis}))f(x_{mis}) d x_{mis} .$ In general, the DAEM framework is incapable of tracking the global maximum, and in particular in mixture models the global maximum does not move smoothly as the temperature varies, a condition for guaranteed tracking. A series of simulation experiments in [5], [9] have shown, however, that DAEM and REM 2 for mixtures of gaussians have a significantly better convergence property than those of the standard EM.

Ueda et al. [13] note, that EM and DAEM for mixture have a difficulty avoiding local maxima associated with situations where separated areas of space have too few or too many components. To overcome this apparent problem they suggest allowing some mixture components to merge while others split (total number of components is conserved), a process they call split and merge EM (SMEM). Numerical results of the convergence capabilities of this algorithm are very encouraging. Competitive agglomeration starts with a highly over-specified number of components, and adds a penalty term to the log-likelihood that increasingly favors agglomeration of all the data points into fewer, larger components. Marginally small components are discarded, and following the agglomeration phase, the weight of the penalty term is progressively reduced. In different publications by Frigui et al. [6], [14], [15] the penalty terms α∑_j=1^gπ_j² and $α∑_{j=1}^{g} π_{j} log π_{j}$ are used, with α a parameter that undergoes an annealing schedule (increase from zero to a finite value, followed by a gradual decrease back to zero). A related Bayesian approach with the penalty term $N_{C} /2∑_{j=1}^{g} log π_{j}$ (where N_C represents the number of parameters per mixture component) was introduced in Ref. [16]. In competitive agglomeration the update of the mixing proportions π_j during the M step is modified, while in DAEM type algorithms the E step is the one modified.

In addition to these methods, other authors have introduced adaptations of methods of global optimization to the EM framework, e.g. genetic optimization [17].

Determination of the optimal number of components in a mixture model traditionally depends on finding the ML models for all relevant number of components, and then picking the one that has maximal penalized log-likelihood, according to a selected penalty criterion, such as the AIC, BIC, MDL or other criteria [18], [19].

Several authors have attempted to make the process of moving between different model sizes more efficient. Competitive agglomeration automatically goes through a monotonically decreasing number of mixture components, however, it does not include a natural stopping criteria related to penalized log-likelihood, relying instead on predetermination of an exact annealing schedule. In contrast, the agglomerative EM based algorithm of Banfield and Raftery [20] and Fraley [21], which also goes through a monotonically decreasing number of components (selecting at each model size the optimal components to be merged), allows for a penalized log-likelihood comparison between different model sizes. An Agglomerative EM algorithm with a Bayesian penalty term was proposed in Ref. [19], and other Bayesian approaches based on trimming unimportant components were described in [16], [22].

DAEM and REM 2 go through a monotonically increasing number of components, changing during phase transitions. To determine which size is optimal, one would have to anneal models of all sizes, each to β=1. With REM 2, Sahani [9] uses a monotonicity argument to efficiently purge many of the possibilities while annealing, a process he terms ‘Cascading Model Selection’.

One can roughly separate the attempts at enhancing the robustness of the EM gaussian mixture decomposition algorithm into parametric and other methods. A review of many robust clustering methods appears in Ref. [23].

Parametric methods involve choosing a parametric model to replace one based on a mixture of gaussians, generally one where one or more of the mixture's components can ‘explain’ outliers. Perhaps the most popular approach is adding an extra component with uniform density [9], [20], diffusely covering the entire measurement space. The mixing proportion of the uniform component becomes an additional mixture parameter. A second approach is to choose mixture components with wider tails than the gaussian distribution. A standard choice in statistics is the multivariate t distribution [7], [24]. An EM algorithm for mixtures of t-distributions was introduced recently in Ref. [8]. This algorithm introduces an additional set of weights, estimated during the E step, which essentially performs a soft rejection of outliers. A significant advantage of using the t distribution is the ability to tune the model's robustness to a particular application or even a particular data set, by tuning the degrees of freedom parameter.

Semi-parametric approaches, such as the use of various M-estimators, borrow from Huber's work on robust statistics [25]. In particular, the use of Huber's ψ-function [26], [27] is a hybrid of a gaussian distribution with laplacian tails. Another approach introduced recently uses least trimmed squares estimators [15].

Section snippets

The multivariate t mixture model

The components in our mixture are multivariate t-distributions, which are parameterized by a unique mean $μ_{j}$ , covariance matrix $Σ_{j}$ and a ‘degrees of freedom’ (DOF) parameter ν (same for all components). Effectively, ν parameterizes the ‘robustness’ of the distribution, that is, how wide the tails are. The case ν→∞ corresponds to a gaussian distribution and when ν=1 we obtain the wide tailed multivariate Cauchy distribution (the covariance is infinite for ν⩽2). For p-dimensional data vectors $x_{i}$ ,

Classical approach

In previous publications involving DAEM, the annealing parameter β was varied from an extremely small value (high ‘temperature’), where the mixture collapses to one component, to β=1. To evaluate the performance of our algorithms combined with such an annealing schedule, we performed the following experiments: mixtures of four t-distributed two-dimensional components with unit covariance were generated with centers picked from a uniform distribution in the [−5,5]×[−5,5] square. A fifth,

Unequal covariance, uniform noise, 2-D

Fig. 2 illustrates “snapshots” of our second algorithm at several different ‘temperatures’, for a contaminated mixture of gaussians with unequal covariances. Uniform noise was added in the ±20 rectangle, at a 20% level. The mixture was initialized using a 15 component FCM, and $ν^{∗} =1$ . Of the mixture components, the 2 on the left are extremely small containing only 3% of the points each. During the cooling phase the estimated ν, as well as estimated covariance of the small components increase

Discussion

In this paper we develop robust and initialization insensitive algorithms by creating deterministic annealing versions of the EM algorithm for mixtures of multivariate t-distributions. Creating a DAEM framework for this model proved not to be a straightforward extension of DAEM for mixtures of gaussians. As the problem contains two distinct sets of augmented data (binary memberships z_ij and typicality weights u_ij), two different algorithms are possible, one where both are treated equally

Acknowledgements

I wish to thank Professors R.A. Normann, M. Figueiredo, R.D. Nowak and S.S. Nagarajan for valuable input in preparing this manuscript. The work was supported by a State of Utah Center of Excellence contract #95-3365.

About the Author—SHY SHOHAM received his B.Sc. in Physics in 1993 from Tel Aviv University, Israel. Since 1996, he is pursuing a Ph.D. in Bioengineering at the University of Utah. His work at the Center for Neural Interfaces involves developing robust and efficient clustering and estimation algorithms for an implantable Brain–Computer interface based on multi-electrode arrays. His general interests include Applied Neurophysiology, Brain Imaging, Computational Neuroscience and Statistical Signal

References (33)

N. Ueda et al.
Deterministic annealing EM algorithm
Neural Networks
(1998)
H. Frigui et al.
Clustering by competitive agglomeration
Pattern Recognition
(1997)
C. Liu
ML estimation of the multivariate t distribution and the EM Algorithm
J. Multivar. Anal.
(1997)
G.J. McLachlan et al.
Finite Mixture Models
(2000)
A.K. Jain et al.
Statistical pattern recognition: a review
IEEE Trans. Pattern Anal. Mach. Intell.
(2000)
A.P. Dempster et al.
Maximum likelihood from incomplete data using the EM algorithm (with discussion)
J. R. Stat. Soc. B
(1977)
G.J. McLachlan et al.
The EM Algorithm and Extensions
(1997)
S. Medasani, R. Krishnapuram, Determination of the number of components in Gaussian mixtures using agglomerative...
K.L. Lange et al.
Robust statistical modeling using the t distribution
J. Amer. Stat. Assoc.
(1989)
D. Peel et al.
Robust mixture modelling using the t distribution
Stat. Comput.
(2000)

M. Sahani, Latent Variable Models for Neural Data Analysis, Department of Computation and Neural Systems, California...

K. Rose

Deterministic annealing for clustering, compression, classification, regression, and related optimization problems

Proc. IEEE

(1998)

R.M. Neal, G.E. Hinton, A view of the EM algorithm that justifies incremental, sparse, and other variants, in: M.I....

J. Puzicha, T. Hofmann, M.J. Buhmann. Deterministic Annealing: fast physical heuristics for real time optimization of...

N. Ueda

SMEM algorithm for mixture models

Neural Comput.

(2000)

S. Medasani, R. Krishnapuram. Categorization of image databases for efficient retrieval using robust mixture...

Cited by (94)

Statistical image watermark decoder using NSM-HMT in NSCT-FGPCET magnitude domain
2022, Journal of Information Security and Applications
The image watermarking technology is an effective technology for protecting image copyright at present. Image watermarking techniques have a constraint relationship between imperceptibility, robustness, and watermark capacity. Balancing imperceptibility, robustness, and watermark capacity becomes a tricky problem. This paper proposes a statistical image watermarking scheme using nonsubsampled Contourlet transform (NSCT) domain fast generic polar complex exponential transform (FGPCET) magnitudes and nonsymmetric mixtures (NSM) based hidden Markov tree (HMT). In the embedding part, to enhance robustness and imperceptibility, we insert the watermark signal into the robust local NSCT domain FGPCET magnitudes through the multiplicative method. In the decoding part, a statistical image watermarking decoder is designed using the NSM-HMT model and maximum likelihood criterion. Here, marginal characteristics and strong correlations of local NSCT domain FGPCET magnitudes are described by NSM-HMT. In addition, the generalized expectation-maximization (GEM)-clusterized method of moments (CMM) approach is used for accurate parameters. Extensive Monte Carlo experiments validate the better performance of the proposed image watermarking method compared to other well-known methods.
Robust stochastic configuration networks for industrial data modelling with Student's-t mixture distribution
2022, Information Sciences
Data collected from industrial sites commonly contains outliers or noise that obey unknown distributions, making it challenging to establish an accurate data-driven model. Therefore, this paper proposes a novel robust stochastic configuration network based on a Student’s-t mixture distribution (termed as SM-RSC). Firstly, a stochastic configuration algorithm is employed to determine the number of hidden nodes, the input weights and biases. Secondly, the maximum a posteriori (MAP) estimate is used to evaluate the output weights of the SCN learner model under the assumption that outliers or noises obey the Student’s-t mixture distribution. Because the output weights cannot be solved directly due to the unknown hyper-parameters of the mixture distribution, we apply the well-known expectation–maximization (EM) algorithm for optimizing the hyper-parameters of the mixture distribution and update the output weights iteratively. The proposed algorithm is evaluated by a function approximation, four benchmark datasets, and a case study on industrial data modelling for a waste incineration process. The results show that SM-RSC performs favorably compared with other methods.
High-dimensional unsupervised classification via parsimonious contaminated mixtures
2020, Pattern Recognition
The contaminated Gaussian distribution represents a simple heavy-tailed elliptical generalization of the Gaussian distribution; unlike the often-considered t-distribution, it also allows for automatic detection of mild outlying or “bad” points in the same way that observations are typically assigned to the groups in the finite mixture model context. Starting from this distribution, we propose the contaminated factor analysis model as a method for dimensionality reduction and detection of bad points in higher dimensions. A mixture of contaminated Gaussian factor analyzers (MCGFA) model follows therefrom, and extends the recently proposed mixture of contaminated Gaussian distributions to high-dimensional data. We introduce a family of 32 parsimonious models formed by introducing constraints on the covariance and contamination structures of the general MCGFA model. We outline a variant of the expectation-maximization algorithm for parameter estimation. Various implementation issues are discussed, and the novel family of models is compared to well-established approaches on both simulated and real data.
Statistical modeling for fast Fourier transform coefficients of operational vibration measurements with non-Gaussianity using complex-valued t distribution
2019, Mechanical Systems and Signal Processing
Frequency-domain operational vibration responses have been widely applied in structural health monitoring (SHM). Characterizing and quantifying their uncertainties are of fundamental importance for enhancing the robustness of SHM technologies. The classic complex Gaussian distribution is being increasingly used to model the distributions of fast Fourier transform (FFT) coefficients due to its elegant and convenient mathematical nature. However, the field-test data analysis of engineering structures under operational vibrations in this study emphasize the possibility of non-Gaussianity of some observations. The higher peaks and heavier tails than those of a Gaussian distribution emerge as observable features. As a member of the general family of elliptically symmetric distributions, the t distribution has been widely recognized as a useful extension of the Gaussian distribution for the robust statistical modeling of data sets with heavier-than-normal tails. In this paper, we consider using the complex-valued t distribution to characterize the FFT coefficients with high kurtosis and heavy tails. The probability density function (PDF) of multivariate proper complex t random variables are derived based on the equivalent counterparts in the real-valued domain. The marginal PDFs of the real part, the imaginary part, the magnitude, and the phase of a univariate complex-valued t random variable are also derived analytically based on advanced integration techniques. The field test data of different engineering structures provide an illustration of the performance of complex Gaussian and complex t probabilistic models at different frequencies evaluated via the K-S test, goodness-of-fit test, and probability plots.
A projection-based split-and-merge clustering algorithm
2019, Expert Systems with Applications
Citation Excerpt :
As we know, GMM is generally solved by EM algorithm. And a series of works (Shoham, 2002; Ueda, Nakano, Ghahramani, & Hinton, 2000; Zhang, Zhang, & Yi, 2004; Zhang, Chen, Sun, & Chan, 2003) aimed at improving EM algorithm by solving the local convergence problem to some extent. Especially, SMEM which proposed by Ueda et al. (2000) is a typical split-and-merge method.
A novel split-and-merge clustering algorithm is proposed by using projection technology and K-means method. There are two key technologies in the proposed method: shape recognition based on projection and split-and-merge process based on K-means. By projecting the data onto the connection of any two cluster centers, no matter how large the dimension of data is, we can always obtain an one-dimension density curve of the projection to guarantee an acceptable amount of calculation. Further embedded the kernel density estimate, we can determine the distribution of clusters by the shape of the one-dimensional density curve. In the split-and-merge process, this algorithm not only addresses the sensitivity in selecting initial cluster centers, but also automatically give a reasonable number of clusters. We also discuss the possibility to extend the projection split-and-merge method from K-means to density based methods (as EM algorithm and Cross-entropy clustering). Both simulation and real data experimental results show that our method performance well especially under strict data conditions.
On convergence and parameter selection of the EM and DA-EM algorithms for Gaussian mixtures
2018, Pattern Recognition
Citation Excerpt :
The basic idea of the DA-EM is to begin at a high temperature β, and then decreases the temperature to zero according to some cooling strategy to avoid poor local optima. The DA-EM algorithm had been studied and applied in various areas, such as Shoham [18], Itaya et al. [19], Guo and Cui [20] and Okamura et al. [21]. On the other hand, Figueiredo and Jain [6] pointed out that, the heuristic behind the deterministic annealing is to force the entropy of the assignments to decrease slowly for avoiding poor local optima.
The expectation & maximization (EM) for Gaussian mixtures is popular as a clustering algorithm. However, the EM algorithm is sensitive to initial values, and so Ueda and Nakano [4] proposed the deterministic annealing EM (DA-EM) algorithm to improve it. In this paper, we investigate theoretical behaviors of the EM and DA-EM algorithms. We first derive a general Jacobian matrix of the DA-EM algorithm with respect to posterior probabilities. We then propose a theoretical lower bound for initialization of the annealing parameter in the DA-EM algorithm. On the other hand, some researches mentioned that the EM algorithm exhibits a self-annealing behavior, that is, the equal posterior probability with small random perturbations can avoid the EM algorithm to output the mass center for Gaussian mixtures. However, there is no theoretical analysis on this self-annealing property. Since the DA-EM will become the EM when the annealing parameter is 1, according to the Jacobian matrix of the DA-EM, we can prove the self-annealing property of the EM algorithm for Gaussian mixtures. Based on these results, we give not only convergence behaviors of the equal posterior probabilities and initialization lower bound of the temperature parameter of the DA-EM, but also a theoretical explanation why the EM algorithm for Gaussian mixtures exhibits a self-annealing behavior.

View all citing articles on Scopus

View full text

Robust clustering by deterministic agglomeration EM of mixtures of multivariate t-distributions

Abstract

Introduction

Section snippets

The multivariate t mixture model

Classical approach

Unequal covariance, uniform noise, 2-D

Discussion

Acknowledgements

Neural Networks

Pattern Recognition

J. Multivar. Anal.

Finite Mixture Models

Statistical pattern recognition: a review

IEEE Trans. Pattern Anal. Mach. Intell.

Maximum likelihood from incomplete data using the EM algorithm (with discussion)

J. R. Stat. Soc. B

The EM Algorithm and Extensions

Robust statistical modeling using the t distribution

J. Amer. Stat. Assoc.

Robust mixture modelling using the t distribution

Stat. Comput.

Deterministic annealing for clustering, compression, classification, regression, and related optimization problems

Proc. IEEE

SMEM algorithm for mixture models

Neural Comput.