Highlighting data clusters by graph embedding
Introduction
Clustering refers to the task of grouping a set of objects such that objects in the same group are more similar to each other than to those in different groups. As a common technique for statistical data analysis, it has been studied and used in many fields [24], [15]. Different approaches need to be developed to reveal the underlying cluster relationship for different types of data. In this work, we are particularly interested in the cluster structures of high-dimensional data, which has attracted much recent research attention. High-dimensional data, with anywhere from tens to thousands or even millions of features, are often encountered in a variety of applications like videos, images, texts and complex networks. Being hard to think in and impossible to visualize, the high dimensionality poses significant difficulties and challenges to modern data processing research [18].
As a treatment to this “curse of dimensionality”, dimensionality reduction techniques have been investigated during the past decades, particularly in areas of statistics, neural networks, machine learning and computing sciences. These techniques try to reduce the number of random variables under consideration with the assumption that high-dimensional data have an intrinsic dimension that is significantly lower than the number of features they appear to have. Common techniques include principal component analysis and metric multi-dimensional scaling [16], [9]. The two classical linear methods project the data from a high-dimensional space into a low-dimensional subspace by either maximizing the projected variance or best preserving the pairwise squared distance among the data.
Besides linear methods, much research on nonlinear techniques has been devoted. Well-known methods include self-organizing map, generative topographic map and related [17], [3]. These methods can be regarded as a type of neural networks that is trained using unsupervised learning to produce a low-dimensional representation of the input space of the samples, which have been successfully applied in many challenging tasks [4].
More recent work of nonlinear techniques focuses on graph embedding methods. These embedding methods build upon but go beyond the classical linear solutions. They assume that the data are from a low-dimensional manifold that is embedded in a high-dimensional space, which is more general than the assumption of subspace by linear methods. The methods often start from building a sparse connectivity graph describing local relationship between each data point and its neighbors. The graph serves as an approximation to the underlying data manifold. With the graph, a compact representation of the data in a low-dimensional space can be obtained in different ways [8], [2], [11], [33], [28], [29], [30], [31]. These methods differ in preserving different signatures of the underlying manifold, such as the geodesic distances between inputs and the local combination angles. These distinct features make the embedding algorithms applicable in different domains.
In this paper, we develop a novel embedding method for analysis of high-dimensional data and graphs. Compared with existing approaches, the proposed method tries to find a low-dimensional depiction of graphs with the objective of highlighting inherent cluster structures by moving intra-cluster points together, and pushing inter-cluster points apart. In empirical evaluation, we found that the method often produces separation of clusters far more evident than other methods.
From a computational point of view, the objective function of the proposed model can be naturally relaxed and solved by a semi-definite program with a linear constraint, which provides an effective and efficient solution. Such a formulation is also flexible in incorporating prior knowledge which can often be expressed as linear equality or inequality constraints.
The paper is organized as follows: we briefly review the necessary background on semi-definite programming, which plays an important role to the success of the proposed model. Then we illustrate the modularity embedding model in detail. Finally we report our empirical evaluation of the method with promising results and conclude the work.
Section snippets
Background on semi-definite programming
Semi-definite programming (SDP) is a relatively new field which is of growing interest, and dramatic advances have been made recently [25], [32]. SDP deals with convex optimization problems over symmetric positive semi-definite matrix variables with linear objective function and linear constraints. It may be regarded as an extension of, but much more general than, linear programming.
Denote by the space of all real symmetric matrices, equipped with the inner product
Model
For an undirected graph , where is a set of vertices and E is a set of edges connecting pairs of vertices in V. Let wij be an element of the adjacency matrix W of the graph, which gives the number of edges between vertices vi and vj. We further denote as the degree of vi and as the total edge number.
Our proposed work is based on the notion of “modularity” in the study of complex networks [26]. Assume that the degree mi associated with each vertex vi is
Evaluation
We present visualization and evaluation results on a variety of real-world and synthetic datasets to explain the distinct features of the proposed model and show the improvement over other embedding methods in exploiting cluster structures.
Conclusion
Significant achievements have been witnessed in the study of graph embedding methods in recent years. These methods are based on rather different geometric intuitions and have different properties and different application domains. In this paper, we provide a novel graph embedding method. Compared with existing ones, our method focuses on exploiting and highlighting the cluster structures inherent in graphs. The method reports improved results in empirical evaluation and adds a useful tool to
Acknowledgments
The work is supported by The Science and Technology Development Fund (Project No. 044/2010/A and 006/2014/A), Macao SAR, China.
Wenye Li is a faculty member at Macao Polytechnic Institute. His research interest is on convex optimization, probabilistic inference and machine learning techniques, with applications in data processing and social network analysis. He had his education in Shandong University, Chinese Academy of Sciences and The Chinese University of Hong Kong, all in computer science. Before undertaking the current position at Macao Polytechnic Institute in 2009, he was a postdoc researcher at Alberta
References (36)
Revealing network communities with a nonlinear programming method
Inf. Sci.
(2013)- K. Bache, M. Lichman, UCI Machine Learning Repository,...
- M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, in: Advances in Neural...
- et al.
GTMthe generative topographic mapping
Neural Comput.
(1998) Neural Networks for Pattern Recognition
(1995)CSDP, a C library for semidefinite programming
Optim. Methods Softw.
(1999)- et al.
On modularity clustering
IEEE Trans. Knowl. Data Eng.
(2008) - et al.
A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization
Math. Program.
(2003) Spectral Graph Theory
(1997)- et al.
Multidimensional Scaling
(2000)
A direct formulation for sparse PCA using semidefinite programming
SIAM Rev.
Hessian eigenmapslocally linear embedding techniques for high-dimensional data
Proc. Natl. Acad. Sci.
Community structure in social and biological networks
Proc. Natl. Acad. Sci.
Semidefinite programming in combinatorial optimization
Math. Program.
Data clusteringa review
ACM Comput. Surv.
Principal Component Analysis
Self-Organizing Maps
Cited by (0)
Wenye Li is a faculty member at Macao Polytechnic Institute. His research interest is on convex optimization, probabilistic inference and machine learning techniques, with applications in data processing and social network analysis. He had his education in Shandong University, Chinese Academy of Sciences and The Chinese University of Hong Kong, all in computer science. Before undertaking the current position at Macao Polytechnic Institute in 2009, he was a postdoc researcher at Alberta Ingenuity Center of Machine Learning.