Elsevier

Neurocomputing

Volume 87, 15 June 2012, Pages 19-25
Neurocomputing

Refining Gaussian mixture model based on enhanced manifold learning

https://doi.org/10.1016/j.neucom.2012.01.029Get rights and content

Abstract

Gaussian mixture model (GMM) has been widely used for data analysis in various domains including text documents, face images and genes. GMM can be viewed as a simple linear superposition of Gaussian components, each of which represents a data cluster. Recent models, namely Laplacian regularized GMM (LapGMM) and locally consistent GMM (LCGMM) have been proposed to preserve the local manifold structure of the data for modeling Gaussian mixtures, and show superior performance than the original GMM. However, these two models ignore the global manifold structure without consideration of the widely separated points. In this paper, we introduce refined Gaussian mixture model (RGMM), which explicitly places separated points far apart from each other as well as brings nearby points closer together according to the probability distributions of Gaussians, in the hope of fully discovering the discriminating power of manifold learning for estimating Gaussian mixtures. We use EM algorithm to optimize the maximum likelihood function of RGMM. Experimental results on three real-world data sets demonstrate the effectiveness of RGMM in data clustering.

Introduction

Gaussian mixture model (GMM [1]) is one of the most popular methods for data analysis [2], [3], [4]. Compared to the single Gaussian, GMM uses a mixture of Gaussians, which gives a better characterization of real data by taking a linear combination of Gaussian densities. Each Gaussian density is called a component of the mixture which has its own mean and covariance. GMM is also usually used to cluster data. We can introduce discrete latent variables, which define cluster assignments of data points to specific components of the mixture, into Gaussian distributions. Clustering of data can thus be achieved through estimating the parameters associated with the latent variable of the Gaussian mixture. To find maximum likelihood estimators in Gaussian mixtures with latent variables, a general method, namely expectation–maximization (EM) algorithm can be used. It has been revealed that there is a close similarity between k-means algorithm and the EM algorithm for Gaussian mixtures [1]. The k-means algorithm performs the clustering in a hard way that each data point is assigned into uniquely one cluster, whereas the EM algorithm makes a soft assignment based on the posterior probabilities.

A limitation of original GMM is that it only considers the cases where the data is drawn from Euclidean space. However, recent studies [5], [6] have shown that naturally occurring data, such as texts and images, cannot possibly “fill up” the ambient Euclidean space, rather it probably concentrates around or closes to a lower dimensional submanifold whose structure plays a fundamental role in developing various kinds of algorithms including dimensionality reduction, clustering, supervised learning and semi-supervised learning algorithms [7], [8], [9], [10], [11], [12].

By incorporating the informative manifold structure, Laplacian regularized GMM (LapGMM [13]) and locally consistent GMM (LCGMM [14]) have been proposed to improve the original Gaussian mixture model. Based on the intuition that the neighboring data points on the manifold are likely to be sampled by the same Gaussian component, the two models incorporate a regularized term using the graph Laplacian [15] into the maximum likelihood function of the original GMM. In this way, the probability distributions over Gaussian components of data can vary smoothly along the geodesics of data manifold, and therefore the two models have shown superior results for data clustering compared to original GMM.

However, these two models only enhance the similarity among the distributions over Gaussian components of data points within a neighborhood, whereas place no constraint on the points that are widely separated. In other words, the manifold learning of these two models only consider the locality information of data geometry, whereas the global structure is not well preserved.

In this paper, we present a novel Gaussian mixture model to focus on fully discovering the discriminating power of manifold learning, which is referred to as refined Gaussian mixture model (RGMM). In RGMM, we refined the manifold learning and cleanly enhance the separability of the probability distributions associated with widely separated points in addition to increasing proximity of those of nearby points, in the hope of preserving the global manifold structure as well as maintaining local consistency. The similar idea of putting symmetric constraints on nearby and far apart points on the manifold has been successfully applied to dimension reduction [16], [17] as well as other applications [18], and enhanced learning performance has been reported. We formulate RGMM by incorporating a regularized term into the standard GMM and use expectation–maximization (EM) algorithm to solve the optimization problem. We provide experimental results on three real-world data including text corpora, face images and clinical data and demonstrate the effectiveness of RGMM.

The rest of this paper is organized as follows. We provide the background and overview the related work in Section 2. We then formulate EGMM and give an EM procedure for optimizing the maximum likelihood problem of RGMM in Section 3. The empirical evidence is presented in Section 4. Finally, some conclusive remarks are made in Section 5.

Section snippets

Background and notations

Let us begin with reviewing the formulation of Gaussian mixture models, as well as providing a brief description of the related work.

Symmetrically discriminative GMM

In this section, we go to full exposition of our proposed model, named refined Gaussian mixture model (RGMM). We then use the expectation–maximization (EM) algorithm to solve the regularized maximum likelihood problem of RGMM.

Experiments

In this section, we conduct extensive experiments to evaluate our proposed RGMM on synthetic data and three real-world data sets including texts, face images and clinical data. We compare RGMM performance against the following algorithms:

  • Locally consistent Gaussian mixture model (LCGMM [14]).

  • Laplacian regularized Gaussian mixture model (LapGMM [13]).

  • The original Gaussian mixture model (GMM [1]).

  • Principal component analysis (PCA).

Conclusions and future work

In this paper, we have motivated a novel Gaussian mixture model based on refined manifold learning. The proposed model cleanly encourages both keeping the distributions over Gaussians associated with widely separated pairs relatively dissimilar in addition to keeping ones associated with points near a neighborhood relatively similar. In this respect, it is a refined manifold learning based GMM with local and global consistency. We name it as refined Gaussian mixture model (RGMM). We solve the

Acknowledgment

This work is supported by the “11th Five-Year Plan” National Scientific and Technological Support Key Project—Regional Health Data Center and Resource Sharing System (No. 2008BAH27B04).

JianFeng Shen received the MD degree in Medicine from Zhejiang University, China, in 2003. He is working in Information Center of Health Bureau of Zhejiang Province. His research interests include neurosurgery and health information.

References (22)

  • X. He, P. Niyogi, Locality preserving projections, in: Advances in Neural Information Processing Systems, vol. 16,...
  • Cited by (20)

    • A rank-based framework through manifold learning for improved clustering tasks

      2021, Information Sciences
      Citation Excerpt :

      Traditional pairwise similarity measures computed in the Euclidean space often ignore the structural information of the dataset manifold. High-dimensional data often lies on a lower-dimensional submanifold, whose structure plays a fundamental role in clustering and other machine learning tasks [15]. Despite some achievements of spectral clustering in modeling data that lies on manifolds, other works [15–21] provided further successful investigation in this direction.

    • Generalized topographic block model

      2016, Neurocomputing
      Citation Excerpt :

      As a perspective, the parameter selection or the clustering of the nodes of the map [55,56] could be addressed. Local maxima in the training process is a serious concern in GTM, this may be worse for BGTM hence this issue needs to be reduced in future for making the model available for further extensive experiments, by adding constraints [57,58] for instance. Other distributions and link functions for the cells are also interesting to explore in order to help improving the robustness and the fitting.

    • A novel multimode process monitoring method integrating LCGMM with modified LFDA

      2015, Chinese Journal of Chemical Engineering
      Citation Excerpt :

      Due to multiple steady operating states, the collected process data are essentially deemed as the mixture of Gaussian clusters and follow non-Gaussian distribution on the whole. Thus, mixture models are developed to decompose the process data into multiple clusters to approximate the non-Gaussian distribution whilst the randomness of process data is captured [23–27]. However, conventional mixture models like GMM, mixture of probabilistic PCA and mixture PCA, suffer from high-dimensional and collinear process variables.

    View all citing articles on Scopus

    JianFeng Shen received the MD degree in Medicine from Zhejiang University, China, in 2003. He is working in Information Center of Health Bureau of Zhejiang Province. His research interests include neurosurgery and health information.

    Jiajun Bu received the BS and PhD degrees in Computer Science from Zhejiang University, China, in 1995 and 2000, respectively. He is a professor in College of Computer Science, Zhejiang University. His research interests include embedded system, data mining, information retrieval, and mobile database.

    Bin Ju received the MD degree in Computer Science from Zhejiang University, China, in 2005. He is currently a candidate for a PhD degree in Computer Science at Zhejiang University. His research interests include machine learning, data mining, and health information.

    Tao Jiang received the MD degree in Medicine from Zhejiang University, China, in 2011. His research interests include health information, java developing, and data mining.

    Hao Wu received the BS and MS degrees in Computer Science from Zhejiang University. His research interests include machine learning, data mining, and information retrieval.

    Lanjuan Li is a member of CAE (Chinese Academy of Engineering) and Professor/Chief Physician of Infectious Diseases. Her research interests include infectious disease and health information.

    View full text