Elsevier

Signal Processing

Volume 189, December 2021, 108301
Signal Processing

Fast spectral clustering method based on graph similarity matrix completion

https://doi.org/10.1016/j.sigpro.2021.108301Get rights and content

Highlights

  • The calculation of graph similarity matrix in spectral clustering is computational complex for the large high-dimensional data sets.

  • The similarity matrix can be obtained using matrix completion method to improve the computational efficiency.

  • The Schatten capped p norm used in this paper integrates the rank norm, nuclear norm, and the Schatten p norm.

  • The split Bregman algorithm based on the Schatten capped p norm and randomized singular value decomposition is developed to accelerate the convergence of matrix completion.

Abstract

Spectral clustering (SC) is a widely used technique to perform group unsupervised classification of graph signals. However, SC is sometimes computationally intensive due to the need to calculate the graph similarity matrices on large high-dimensional data sets. This paper proposes an efficient SC method that rapidly calculates the similarity matrix using a matrix completion algorithm. First, a portion of the elements in the similarity matrix are selected by a blue noise sampling mask, and their similarity values are calculated directly from the original dataset. After that, a split Bregman algorithm based on the Schatten capped p norm is developed to rapidly retrieve the rest of the matrix elements. Finally, spectral clustering is performed based on the completed similarity matrix. A set of simulations based on different data sets are used to assess the performance of the proposed method. It is shown that for a sufficiently large sampling rate, the proposed method can accurately calculate the completed similarity matrix, and attain good clustering results while improving on computational efficiency.

Introduction

With the advances of data science and information technology, the structure of data acquired from real world applications has become more diverse and complex. Traditional signal processing methods cannot be used directly to represent and process data sets with complex networks and topological structures, such as community networks [1], transportation networks [2], radar technology [3], [4], [5] and other data systems with latent non-Euclidean structures [6]. In recent years, the emergence and development of graph signal processing (GSP) provided an efficient tool to describe and analyze such data in irregular grids. GSP mainly focuses on the representation, transformation, and processing of graph signals [7], [8]. A graph signal is defined as a set of data samples supported by a graph G={V,E}, where V and E represent the sets of vertices and edges, respectively. The vertices can be associated with traditional one-dimensional (1D) temporal signals, two-dimensional (2D) images, or more complex data. The edges describe the relationship between vertices, which provide more degrees of freedom to depict the topology within vertices, thus the graph signals can represent such complex data with non-Euclidean structures. In addition, each vertex corresponds to a graph signal sample. Fig. 1 shows an example of a graph signal with five vertices V={vi|i=1,,5} and eight edges E={wi,j|if vi and vj are connected}. The graph signal values are denoted by a data set f={f(i)i=1,,5}, where f(i) is the value of the signal on the ith vertex. In principle, the f(i) can represent signals on a sensor network, intensity of an image field, features on a social network, or any other measurable phenomena in a network.

Spectral clustering (SC) is a widely used classification technique to disclose the internal relation of graph signal data, which has found wide applications in the GSP field [9], [10]. The principle behind SC is to divide the entire graph into separate subgraphs by enhancing the intra-subgraph correlation, while suppressing the inter-subgraph correlation [11]. First, the similarity between any pair of vertices is used to establish the similarity matrix and the Laplacian matrix. Then, SC clusters the data set into several groups based on the main eigenvectors of the Laplacian matrix corresponding to several smaller eigenvalues. Shi et al. proposed the normalized-cut (Ncut) objective function to measure both the total dissimilarity between different subgraphs and the total similarity within the subgraphs [12]. Wang et al. proposed an SC method based on a similarity and dissimilarity criterion to improve the clustering performance [13]. In addition, Huang et al. proposed scalable spectral clustering and ultra-scalable ensemble clustering to improve the scalability and robustness [14]. Cai et al. proposed the landmark-based spectral clustering method for large-scale clustering problems [15]. Ding et al. considered the pair constraints for the objective function of graph cuts, and derived a semi-supervised approximate SC method based on the hidden Markov random field [16]. For the complex manifold data structure, the k-affinity propagation clustering algorithm was proposed based on the manifold similarity measure, which can automatically calculate the appropriate number of clusters [17]. In addition, a dynamic incremental sampling method for Nyström spectral clustering was proposed, which used different probability distributions to select more representative sampling points, so as to reflect the distribution of the data set better [18].

SC methods have many advantages. They are simple and easy to implement and have global convergence properties [11]. However, computing the similarity matrix on high-dimensional data sets becomes time-consuming, which significantly reduces the computational efficiency of SC methods. The complexity arises because the similarity matrix is composed of the correlations between numerous pairs of high-dimensional data samples associated with different graph vertices. The complexity in calculating the correlation coefficients rapidly increases with the growth of data dimensionality and the number of vertices.

To circumvent this limitation, this paper proposes a fast SC method based on matrix completion. The proposed method is based on the following observations. In a similarity matrix, the elements in one row or one column represent the cross correlation of data samples to all other samples. The correlation is often assessed by the normalized inner product between each pair of data samples. Thus, the rows or columns in the similarity matrix are likely highly correlated, and thus the similarity matrix can be inherently modeled as a low-rank matrix. Based on this assumption, we can first calculate a portion of the similarity matrix elements directly from the original data sets, and then quickly retrieve the others by low-rank matrix completion methods.

Low-rank matrix completion is a method that recovers all the matrix elements from the incomplete observation of its entries. Matrix completion was firstly modeled as a rank minimization problem [19] which is an NP-hard problem that is difficult to solve. To overcome this limitation, the rank minimization problem was relaxed into a convex optimization problem by replacing the rank function with the nuclear norm [20]. Subsequently, several algorithms were proposed to solve this problem, such as the singular value thresholding (SVT) algorithm [21], fixed point and Bregman iterative algorithm [22], split Bregman algorithm based on nuclear norm [23] and so on. More recently, other nonconvex and more cost compact functions were proposed to solve the matrix completion problem, such as the truncated nuclear norm [24], capped norm [25], and the Schatten p norm [26]. Li et al. proposed the Schatten capped p norm (SCp norm) [27], which approximates the rank function by combining the capped norm and Schatten p norm, then they used the algorithm of alternating direction method of multipliers (ADMM) to solve the SCp norm minimization problem.

To reduce the time complexity and retain the matrix completion accuracy, this paper develops a new matrix completion method based on the SCp norm in conjunction with the split Bregman algorithm. First, a blue noise sampling mask is used to select a small set of elements from the similarity matrix, whose values are calculated from the original graph data set. After that, the unknown elements are rapidly retrieved by the SC p norm minimization method. Different from the method in [27], this paper uses the split Bregman algorithm instead of the ADMM algorithm to minimize the SCp norm. That is because the split Bregman algorithm is simpler to implement than the ADMM algorithm, and thus it can effectively reduce the computing time. In addition, the singular values used in the proposed method are calculated by randomized singular value decomposition (SVD) to accelerate the computational speed. Finally, the completed similarity matrix can be used to conduct the SC tasks. The methods are then applied in two applications, namely 1D financial series data and 2D lithography layout images. The proposed method is shown to effectively improve the computational efficiency, and obtain accurate clustering results under challenging subsampling rates.

The remainder of this paper is organized as follows. The basic concept and fundamentals of SC algorithm are briefly described in Section II. The proposed method for the completion of the similarity matrix is introduced in Section III. The simulations and analysis are presented in Section IV. Section V provides the conclusions.

Section snippets

Fundamental of spectral clustering

Consider an undirected weighted graph G={V,E} with N vertices. If the vertices vi and vj are connected by an edge, the edge weight between them is set to be wi,j=wj,i>0, otherwise wi,j=0. The edge weight can be formulated as the similarity between the data samples associated with different vertices, such as the absolute value of the Pearson correlation coefficient. In this case, the matrix W*=[wi,j*]RN×N, i,j=1,,N is called similarity matrix with all the diagonal elements equal to 0. It is

Initialization of similarity matrix

In the initialization phase, the location of a sparse set of non-diagonal elements of the similarity matrix are first selected using the blue noise sampling method. Then, the true values of these selected elements are calculated in advance using the original graph data set. These selected elements serve as the known entries in matrix completion. Blue noise sampling is used since it was proved to have better sampling uniformity than random sampling [32], [33]. The blue noise is regarded as the

Simulation and analysis

This section provides the simulations to verify the proposed fast SC method based on two different data sets, i.e., the 2D lithography layout data and the 1D financial time series data. The details of the simulation settings, results, and analysis are presented as follows. In the future, we may introduce relevant ensemble clustering algorithms to improve the robustness of clustering results [43], [44].

Conclusion

This paper proposes the use of matrix completion method to efficiently calculate the graph similarity matrix, based on which a fast SC approach was developed. This paper considered the low-rank property of graph similarity matrix and developed a new matrix completion method to reconstruct the entire similarity matrix from a small set of true element values. Firstly, some elements in the similarity matrix were selected by blue noise sampling. Then, the low-rank matrix completion problem on the

CRediT authorship contribution statement

Xu Ma: Methodology, Investigation, Conceptualization, Writing – review & editing, Project administration. Shengen Zhang: Software, Data curation, Methodology, Investigation, Writing – original draft, Validation. Karelia Pena-Pena: Conceptualization, Writing – review & editing, Methodology. Gonzalo R. Arce: Conceptualization, Supervision, Investigation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was partially supported by Fundamental Research Funds for the Central Universities (2020CX02002).

References (53)

  • S. Ding et al.

    A semi-supervised approximate spectral clustering algorithm based on HMRF model

    Inf. Sci.

    (2018)
  • N. Tremblay et al.

    Graph wavelets for multiscale community mining

    IEEE Trans. Signal Process.

    (2014)
  • D.M. Mohan et al.

    Wavelets on graphs with application to transportation networks

    17th International Conference on Intelligent Transportation Systems (ITSC)

    (2014)
  • J. Zheng et al.

    Efficient data transmission strategy for IIoTs with arbitrary geometrical array

    IEEE Trans. Ind. Informat.

    (2021)
  • J. Zheng et al.

    Accurate detection and localization of UAV swarms-enabled MEC system

    IEEE Trans. Ind. Informat.

    (2021)
  • J. Zheng et al.

    Parameterized centroid frequency-chirp rate distribution for LFM signal analysis and mechanisms of constant delay introduction

    IEEE Trans. Signal Process.

    (2017)
  • M. Newman

    Network: An introduction

    (2010)
  • L. Stanković et al.

    Vertex-Frequency Analysis of Graph Signals

    (2019)
  • D.I. Shuman et al.

    The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains

    IEEE Signal Process. Mag.

    (2013)
  • N. Tremblay et al.

    Compressive spectral clustering

    Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML)

    (2016)
  • N. Tremblay et al.

    Accelerated spectral clustering using graph filtering of random signals

    2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2016)
  • U. Luxburg

    A tutorial on spectral clustering

    Stat. Comput.

    (2007)
  • J. Shi et al.

    Normalized cuts and image segmentation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2000)
  • B. Wang et al.

    Spectral clustering based on similarity and dissimilarity criterion

    Pattern Anal. Applic.

    (2017)
  • D. Huang et al.

    Ultra-scalable spectral clustering and ensemble clustering

    IEEE Tran. Knowl. Data Eng.

    (2020)
  • D. Cai et al.

    Large scale spectral clustering via landmark-based sparse representation

    IEEE Trans. Cybern.

    (2015)
  • H. Jia et al.

    A k-AP clustering algorithm based on manifold similarity measure

    International Conference on Intelligent Information Processing (IFIPAICT)

    (2018)
  • H. Jia et al.

    A Nyström spectral clustering algorithm based on probability incremental sampling

    Soft Comput.

    (2017)
  • E.J. Candès et al.

    Exact matrix completion via convex optimization

    Found. Comput. Math.

    (2009)
  • M. Fazel

    Matrix rank minimization with applications

    (2002)
  • J. Cai et al.

    A singular value thresholding algorithm for matrix completion

    SIAM J. Optim.

    (2010)
  • S. Ma et al.

    Fixed point and Bregman iterative methods for matrix rank minimization

    Math. Program.

    (2011)
  • A. Gogna et al.

    Matrix recovery using split Bregman

    2014 22nd International Conference on Pattern Recognition (ICPR)

    (2014)
  • Y. Hu et al.

    Fast and accurate matrix completion via truncated nuclear norm regularization

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • Q. Sun et al.

    Robust principal component analysis via capped norms

    Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD)

    (2013)
  • F. Nie et al.

    Low-rank matrix recovery via efficient schatten p-norm minimization

    Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI)

    (2012)
  • Cited by (7)

    • An evidence accumulation based block diagonal cluster model for intent recognition from EEG

      2022, Biomedical Signal Processing and Control
      Citation Excerpt :

      However, this kind of work is computationally intensive, this is because the selection of appropriate feature is a complex problem, which needs to ensure the similarity of the data in the class and the separability of the inter-class data, so feature extraction and dimensionality reduction requires a lot of experimentation and analysis [17]. Similarity matrix has been extensively applied to clustering research in recent years, which has the advantage of improving the robustness and quality of clustering results [18,19]. Such approaches focus on the pairwise similarity between each pair of data samples to learn patterns, alleviating the complexity of processing data features.

    • Micro-service anomaly detection based on second-order nearest neighbor

      2023, Proceedings of 2023 IEEE 3rd International Conference on Information Technology, Big Data and Artificial Intelligence, ICIBA 2023
    View all citing articles on Scopus
    1

    [orcid=0000-0003-2012-9808]

    View full text