Elsevier

Neurocomputing

Volume 165, 1 October 2015, Pages 75-80
Neurocomputing

Highlighting data clusters by graph embedding

https://doi.org/10.1016/j.neucom.2014.07.085Get rights and content

Abstract

We propose a novel method, modularity embedding, to embed high-dimensional data or graphs in a low-dimensional space. Central to our work is a model that quantifies the relationship of two data points by their pairwise modular value. A larger value indicates a higher chance that they should be placed near to each other, and vice versa. The objective function of the model has a simple formulation of minimizing the sum of squared distances between data points weighted by pairwise modular values. It is naturally relaxed as a semi-definite program that learns a low-rank kernel matrix with only one linear constraint, which can be solved efficiently by modern mathematical optimization solvers. Compared with traditional graph embedding algorithms, the proposed method is shown to be able to highlight cluster structures inherent in high-dimensional data and graphs, which provides a promising tool in data analysis applications.

Introduction

Clustering refers to the task of grouping a set of objects such that objects in the same group are more similar to each other than to those in different groups. As a common technique for statistical data analysis, it has been studied and used in many fields [24], [15]. Different approaches need to be developed to reveal the underlying cluster relationship for different types of data. In this work, we are particularly interested in the cluster structures of high-dimensional data, which has attracted much recent research attention. High-dimensional data, with anywhere from tens to thousands or even millions of features, are often encountered in a variety of applications like videos, images, texts and complex networks. Being hard to think in and impossible to visualize, the high dimensionality poses significant difficulties and challenges to modern data processing research [18].

As a treatment to this “curse of dimensionality”, dimensionality reduction techniques have been investigated during the past decades, particularly in areas of statistics, neural networks, machine learning and computing sciences. These techniques try to reduce the number of random variables under consideration with the assumption that high-dimensional data have an intrinsic dimension that is significantly lower than the number of features they appear to have. Common techniques include principal component analysis and metric multi-dimensional scaling [16], [9]. The two classical linear methods project the data from a high-dimensional space into a low-dimensional subspace by either maximizing the projected variance or best preserving the pairwise squared distance among the data.

Besides linear methods, much research on nonlinear techniques has been devoted. Well-known methods include self-organizing map, generative topographic map and related [17], [3]. These methods can be regarded as a type of neural networks that is trained using unsupervised learning to produce a low-dimensional representation of the input space of the samples, which have been successfully applied in many challenging tasks [4].

More recent work of nonlinear techniques focuses on graph embedding methods. These embedding methods build upon but go beyond the classical linear solutions. They assume that the data are from a low-dimensional manifold that is embedded in a high-dimensional space, which is more general than the assumption of subspace by linear methods. The methods often start from building a sparse connectivity graph describing local relationship between each data point and its neighbors. The graph serves as an approximation to the underlying data manifold. With the graph, a compact representation of the data in a low-dimensional space can be obtained in different ways [8], [2], [11], [33], [28], [29], [30], [31]. These methods differ in preserving different signatures of the underlying manifold, such as the geodesic distances between inputs and the local combination angles. These distinct features make the embedding algorithms applicable in different domains.

In this paper, we develop a novel embedding method for analysis of high-dimensional data and graphs. Compared with existing approaches, the proposed method tries to find a low-dimensional depiction of graphs with the objective of highlighting inherent cluster structures by moving intra-cluster points together, and pushing inter-cluster points apart. In empirical evaluation, we found that the method often produces separation of clusters far more evident than other methods.

From a computational point of view, the objective function of the proposed model can be naturally relaxed and solved by a semi-definite program with a linear constraint, which provides an effective and efficient solution. Such a formulation is also flexible in incorporating prior knowledge which can often be expressed as linear equality or inequality constraints.

The paper is organized as follows: we briefly review the necessary background on semi-definite programming, which plays an important role to the success of the proposed model. Then we illustrate the modularity embedding model in detail. Finally we report our empirical evaluation of the method with promising results and conclude the work.

Section snippets

Background on semi-definite programming

Semi-definite programming (SDP) is a relatively new field which is of growing interest, and dramatic advances have been made recently [25], [32]. SDP deals with convex optimization problems over symmetric positive semi-definite matrix variables with linear objective function and linear constraints. It may be regarded as an extension of, but much more general than, linear programming.

Denote by Sn the space of all n×n real symmetric matrices, equipped with the inner product X,Y=tr(XTY)=i,j=1nx

Model

For an undirected graph G=(V,E), where V={v1,,vn} is a set of vertices and E is a set of edges connecting pairs of vertices in V. Let wij be an element of the adjacency matrix W of the graph, which gives the number of edges between vertices vi and vj. We further denote mi=jwij as the degree of vi and m=12imi as the total edge number.

Our proposed work is based on the notion of “modularity” in the study of complex networks [26]. Assume that the degree mi associated with each vertex vi is

Evaluation

We present visualization and evaluation results on a variety of real-world and synthetic datasets to explain the distinct features of the proposed model and show the improvement over other embedding methods in exploiting cluster structures.

Conclusion

Significant achievements have been witnessed in the study of graph embedding methods in recent years. These methods are based on rather different geometric intuitions and have different properties and different application domains. In this paper, we provide a novel graph embedding method. Compared with existing ones, our method focuses on exploiting and highlighting the cluster structures inherent in graphs. The method reports improved results in empirical evaluation and adds a useful tool to

Acknowledgments

The work is supported by The Science and Technology Development Fund (Project No. 044/2010/A and 006/2014/A), Macao SAR, China.

Wenye Li is a faculty member at Macao Polytechnic Institute. His research interest is on convex optimization, probabilistic inference and machine learning techniques, with applications in data processing and social network analysis. He had his education in Shandong University, Chinese Academy of Sciences and The Chinese University of Hong Kong, all in computer science. Before undertaking the current position at Macao Polytechnic Institute in 2009, he was a postdoc researcher at Alberta

References (36)

  • W. Li

    Revealing network communities with a nonlinear programming method

    Inf. Sci.

    (2013)
  • K. Bache, M. Lichman, UCI Machine Learning Repository,...
  • M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, in: Advances in Neural...
  • C. Bishop et al.

    GTMthe generative topographic mapping

    Neural Comput.

    (1998)
  • C.M. Bishop

    Neural Networks for Pattern Recognition

    (1995)
  • B. Borchers

    CSDP, a C library for semidefinite programming

    Optim. Methods Softw.

    (1999)
  • U. Brandes et al.

    On modularity clustering

    IEEE Trans. Knowl. Data Eng.

    (2008)
  • S. Burer et al.

    A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization

    Math. Program.

    (2003)
  • F. Chung

    Spectral Graph Theory

    (1997)
  • T. Cox et al.

    Multidimensional Scaling

    (2000)
  • A. d׳Aspremont et al.

    A direct formulation for sparse PCA using semidefinite programming

    SIAM Rev.

    (2007)
  • D.L. Donoho et al.

    Hessian eigenmapslocally linear embedding techniques for high-dimensional data

    Proc. Natl. Acad. Sci.

    (2003)
  • M. Girvan et al.

    Community structure in social and biological networks

    Proc. Natl. Acad. Sci.

    (2002)
  • M.X. Goemans

    Semidefinite programming in combinatorial optimization

    Math. Program.

    (1997)
  • H. Hu, Y. van Gennip, B. Hunter, M.A. Porter, A.L. Bertozzi, Multislice modularity optimization in community detection...
  • A. Jain et al.

    Data clusteringa review

    ACM Comput. Surv.

    (1999)
  • I. Jolliffe

    Principal Component Analysis

    (2002)
  • T. Kohonen

    Self-Organizing Maps

    (2000)
  • Cited by (0)

    Wenye Li is a faculty member at Macao Polytechnic Institute. His research interest is on convex optimization, probabilistic inference and machine learning techniques, with applications in data processing and social network analysis. He had his education in Shandong University, Chinese Academy of Sciences and The Chinese University of Hong Kong, all in computer science. Before undertaking the current position at Macao Polytechnic Institute in 2009, he was a postdoc researcher at Alberta Ingenuity Center of Machine Learning.

    View full text