Refining Gaussian mixture model based on enhanced manifold learning
Introduction
Gaussian mixture model (GMM [1]) is one of the most popular methods for data analysis [2], [3], [4]. Compared to the single Gaussian, GMM uses a mixture of Gaussians, which gives a better characterization of real data by taking a linear combination of Gaussian densities. Each Gaussian density is called a component of the mixture which has its own mean and covariance. GMM is also usually used to cluster data. We can introduce discrete latent variables, which define cluster assignments of data points to specific components of the mixture, into Gaussian distributions. Clustering of data can thus be achieved through estimating the parameters associated with the latent variable of the Gaussian mixture. To find maximum likelihood estimators in Gaussian mixtures with latent variables, a general method, namely expectation–maximization (EM) algorithm can be used. It has been revealed that there is a close similarity between k-means algorithm and the EM algorithm for Gaussian mixtures [1]. The k-means algorithm performs the clustering in a hard way that each data point is assigned into uniquely one cluster, whereas the EM algorithm makes a soft assignment based on the posterior probabilities.
A limitation of original GMM is that it only considers the cases where the data is drawn from Euclidean space. However, recent studies [5], [6] have shown that naturally occurring data, such as texts and images, cannot possibly “fill up” the ambient Euclidean space, rather it probably concentrates around or closes to a lower dimensional submanifold whose structure plays a fundamental role in developing various kinds of algorithms including dimensionality reduction, clustering, supervised learning and semi-supervised learning algorithms [7], [8], [9], [10], [11], [12].
By incorporating the informative manifold structure, Laplacian regularized GMM (LapGMM [13]) and locally consistent GMM (LCGMM [14]) have been proposed to improve the original Gaussian mixture model. Based on the intuition that the neighboring data points on the manifold are likely to be sampled by the same Gaussian component, the two models incorporate a regularized term using the graph Laplacian [15] into the maximum likelihood function of the original GMM. In this way, the probability distributions over Gaussian components of data can vary smoothly along the geodesics of data manifold, and therefore the two models have shown superior results for data clustering compared to original GMM.
However, these two models only enhance the similarity among the distributions over Gaussian components of data points within a neighborhood, whereas place no constraint on the points that are widely separated. In other words, the manifold learning of these two models only consider the locality information of data geometry, whereas the global structure is not well preserved.
In this paper, we present a novel Gaussian mixture model to focus on fully discovering the discriminating power of manifold learning, which is referred to as refined Gaussian mixture model (RGMM). In RGMM, we refined the manifold learning and cleanly enhance the separability of the probability distributions associated with widely separated points in addition to increasing proximity of those of nearby points, in the hope of preserving the global manifold structure as well as maintaining local consistency. The similar idea of putting symmetric constraints on nearby and far apart points on the manifold has been successfully applied to dimension reduction [16], [17] as well as other applications [18], and enhanced learning performance has been reported. We formulate RGMM by incorporating a regularized term into the standard GMM and use expectation–maximization (EM) algorithm to solve the optimization problem. We provide experimental results on three real-world data including text corpora, face images and clinical data and demonstrate the effectiveness of RGMM.
The rest of this paper is organized as follows. We provide the background and overview the related work in Section 2. We then formulate EGMM and give an EM procedure for optimizing the maximum likelihood problem of RGMM in Section 3. The empirical evidence is presented in Section 4. Finally, some conclusive remarks are made in Section 5.
Section snippets
Background and notations
Let us begin with reviewing the formulation of Gaussian mixture models, as well as providing a brief description of the related work.
Symmetrically discriminative GMM
In this section, we go to full exposition of our proposed model, named refined Gaussian mixture model (RGMM). We then use the expectation–maximization (EM) algorithm to solve the regularized maximum likelihood problem of RGMM.
Experiments
In this section, we conduct extensive experiments to evaluate our proposed RGMM on synthetic data and three real-world data sets including texts, face images and clinical data. We compare RGMM performance against the following algorithms:
Locally consistent Gaussian mixture model (LCGMM [14]).
Laplacian regularized Gaussian mixture model (LapGMM [13]).
The original Gaussian mixture model (GMM [1]).
Principal component analysis (PCA).
Conclusions and future work
In this paper, we have motivated a novel Gaussian mixture model based on refined manifold learning. The proposed model cleanly encourages both keeping the distributions over Gaussians associated with widely separated pairs relatively dissimilar in addition to keeping ones associated with points near a neighborhood relatively similar. In this respect, it is a refined manifold learning based GMM with local and global consistency. We name it as refined Gaussian mixture model (RGMM). We solve the
Acknowledgment
This work is supported by the “11th Five-Year Plan” National Scientific and Technological Support Key Project—Regional Health Data Center and Resource Sharing System (No. 2008BAH27B04).
JianFeng Shen received the MD degree in Medicine from Zhejiang University, China, in 2003. He is working in Information Center of Health Bureau of Zhejiang Province. His research interests include neurosurgery and health information.
References (22)
- et al.
Image thresholding using a novel estimation method in generalized Gaussian distribution mixture modeling
Neurocomputing
(2008) An iterative algorithm for entropy regularized likelihood learning on Gaussian mixture with automatic model selection
Neurocomputing
(2006)- et al.
A gradient BYY harmony learning rule on Gaussian mixture with automated model selection
Neurocomputing
(2004) - et al.
Stable local dimensionality reduction approaches
Pattern Recognition
(2009) - et al.
Locality sensitive semi-supervised feature selection
Neurocomputing
(2008) - et al.
Constrained Laplacian Eigenmap for dimensionality reduction
Neurocomputing
(2010) - et al.
two-step framework for highly nonlinear data unfolding
Neurocomputing
(2010) Pattern Recognition and Machine Learning
(2006)- et al.
Nonlinear dimensionality reduction by locally linear embedding
Science
(2000) - M. Belkin, P. Niyogi, Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering, 2001, pp....
Cited by (20)
Graph signal interpolation and extrapolation over manifold of Gaussian mixture
2024, Signal ProcessingDisparities in resilience and recovery of ridesourcing usage during COVID-19
2024, Journal of Transport GeographyA rank-based framework through manifold learning for improved clustering tasks
2021, Information SciencesCitation Excerpt :Traditional pairwise similarity measures computed in the Euclidean space often ignore the structural information of the dataset manifold. High-dimensional data often lies on a lower-dimensional submanifold, whose structure plays a fundamental role in clustering and other machine learning tasks [15]. Despite some achievements of spectral clustering in modeling data that lies on manifolds, other works [15–21] provided further successful investigation in this direction.
Generalized topographic block model
2016, NeurocomputingCitation Excerpt :As a perspective, the parameter selection or the clustering of the nodes of the map [55,56] could be addressed. Local maxima in the training process is a serious concern in GTM, this may be worse for BGTM hence this issue needs to be reduced in future for making the model available for further extensive experiments, by adding constraints [57,58] for instance. Other distributions and link functions for the cells are also interesting to explore in order to help improving the robustness and the fitting.
Harmonious competition learning for Gaussian mixtures
2015, NeurocomputingA novel multimode process monitoring method integrating LCGMM with modified LFDA
2015, Chinese Journal of Chemical EngineeringCitation Excerpt :Due to multiple steady operating states, the collected process data are essentially deemed as the mixture of Gaussian clusters and follow non-Gaussian distribution on the whole. Thus, mixture models are developed to decompose the process data into multiple clusters to approximate the non-Gaussian distribution whilst the randomness of process data is captured [23–27]. However, conventional mixture models like GMM, mixture of probabilistic PCA and mixture PCA, suffer from high-dimensional and collinear process variables.
JianFeng Shen received the MD degree in Medicine from Zhejiang University, China, in 2003. He is working in Information Center of Health Bureau of Zhejiang Province. His research interests include neurosurgery and health information.
Jiajun Bu received the BS and PhD degrees in Computer Science from Zhejiang University, China, in 1995 and 2000, respectively. He is a professor in College of Computer Science, Zhejiang University. His research interests include embedded system, data mining, information retrieval, and mobile database.
Bin Ju received the MD degree in Computer Science from Zhejiang University, China, in 2005. He is currently a candidate for a PhD degree in Computer Science at Zhejiang University. His research interests include machine learning, data mining, and health information.
Tao Jiang received the MD degree in Medicine from Zhejiang University, China, in 2011. His research interests include health information, java developing, and data mining.
Hao Wu received the BS and MS degrees in Computer Science from Zhejiang University. His research interests include machine learning, data mining, and information retrieval.
Lanjuan Li is a member of CAE (Chinese Academy of Engineering) and Professor/Chief Physician of Infectious Diseases. Her research interests include infectious disease and health information.