Fast spectral clustering method based on graph similarity matrix completion

doi:10.1016/j.sigpro.2021.108301

Signal Processing

Volume 189, December 2021, 108301

https://doi.org/10.1016/j.sigpro.2021.108301 Get rights and content

Highlights

•
The calculation of graph similarity matrix in spectral clustering is computational complex for the large high-dimensional data sets.
•
The similarity matrix can be obtained using matrix completion method to improve the computational efficiency.
•
The Schatten capped $p$ norm used in this paper integrates the rank norm, nuclear norm, and the Schatten $p$ norm.
•
The split Bregman algorithm based on the Schatten capped $p$ norm and randomized singular value decomposition is developed to accelerate the convergence of matrix completion.

Abstract

Spectral clustering (SC) is a widely used technique to perform group unsupervised classification of graph signals. However, SC is sometimes computationally intensive due to the need to calculate the graph similarity matrices on large high-dimensional data sets. This paper proposes an efficient SC method that rapidly calculates the similarity matrix using a matrix completion algorithm. First, a portion of the elements in the similarity matrix are selected by a blue noise sampling mask, and their similarity values are calculated directly from the original dataset. After that, a split Bregman algorithm based on the Schatten capped p norm is developed to rapidly retrieve the rest of the matrix elements. Finally, spectral clustering is performed based on the completed similarity matrix. A set of simulations based on different data sets are used to assess the performance of the proposed method. It is shown that for a sufficiently large sampling rate, the proposed method can accurately calculate the completed similarity matrix, and attain good clustering results while improving on computational efficiency.

Introduction

With the advances of data science and information technology, the structure of data acquired from real world applications has become more diverse and complex. Traditional signal processing methods cannot be used directly to represent and process data sets with complex networks and topological structures, such as community networks [1], transportation networks [2], radar technology [3], [4], [5] and other data systems with latent non-Euclidean structures [6]. In recent years, the emergence and development of graph signal processing (GSP) provided an efficient tool to describe and analyze such data in irregular grids. GSP mainly focuses on the representation, transformation, and processing of graph signals [7], [8]. A graph signal is defined as a set of data samples supported by a graph $G = {V, E}$ , where $V$ and $E$ represent the sets of vertices and edges, respectively. The vertices can be associated with traditional one-dimensional (1D) temporal signals, two-dimensional (2D) images, or more complex data. The edges describe the relationship between vertices, which provide more degrees of freedom to depict the topology within vertices, thus the graph signals can represent such complex data with non-Euclidean structures. In addition, each vertex corresponds to a graph signal sample. Fig. 1 shows an example of a graph signal with five vertices $V = {v_{i} | i = 1, \dots, 5}$ and eight edges $E = {w_{i, j} |$ if $v_{i}$ and $v_{j}$ are connected $}$ . The graph signal values are denoted by a data set $f = {f {(i)}_{i = 1, \dots, 5}}$ , where $f (i)$ is the value of the signal on the ith vertex. In principle, the $f (i)$ can represent signals on a sensor network, intensity of an image field, features on a social network, or any other measurable phenomena in a network.

Spectral clustering (SC) is a widely used classification technique to disclose the internal relation of graph signal data, which has found wide applications in the GSP field [9], [10]. The principle behind SC is to divide the entire graph into separate subgraphs by enhancing the intra-subgraph correlation, while suppressing the inter-subgraph correlation [11]. First, the similarity between any pair of vertices is used to establish the similarity matrix and the Laplacian matrix. Then, SC clusters the data set into several groups based on the main eigenvectors of the Laplacian matrix corresponding to several smaller eigenvalues. Shi et al. proposed the normalized-cut (Ncut) objective function to measure both the total dissimilarity between different subgraphs and the total similarity within the subgraphs [12]. Wang et al. proposed an SC method based on a similarity and dissimilarity criterion to improve the clustering performance [13]. In addition, Huang et al. proposed scalable spectral clustering and ultra-scalable ensemble clustering to improve the scalability and robustness [14]. Cai et al. proposed the landmark-based spectral clustering method for large-scale clustering problems [15]. Ding et al. considered the pair constraints for the objective function of graph cuts, and derived a semi-supervised approximate SC method based on the hidden Markov random field [16]. For the complex manifold data structure, the k-affinity propagation clustering algorithm was proposed based on the manifold similarity measure, which can automatically calculate the appropriate number of clusters [17]. In addition, a dynamic incremental sampling method for Nyström spectral clustering was proposed, which used different probability distributions to select more representative sampling points, so as to reflect the distribution of the data set better [18].

SC methods have many advantages. They are simple and easy to implement and have global convergence properties [11]. However, computing the similarity matrix on high-dimensional data sets becomes time-consuming, which significantly reduces the computational efficiency of SC methods. The complexity arises because the similarity matrix is composed of the correlations between numerous pairs of high-dimensional data samples associated with different graph vertices. The complexity in calculating the correlation coefficients rapidly increases with the growth of data dimensionality and the number of vertices.

To circumvent this limitation, this paper proposes a fast SC method based on matrix completion. The proposed method is based on the following observations. In a similarity matrix, the elements in one row or one column represent the cross correlation of data samples to all other samples. The correlation is often assessed by the normalized inner product between each pair of data samples. Thus, the rows or columns in the similarity matrix are likely highly correlated, and thus the similarity matrix can be inherently modeled as a low-rank matrix. Based on this assumption, we can first calculate a portion of the similarity matrix elements directly from the original data sets, and then quickly retrieve the others by low-rank matrix completion methods.

Low-rank matrix completion is a method that recovers all the matrix elements from the incomplete observation of its entries. Matrix completion was firstly modeled as a rank minimization problem [19] which is an NP-hard problem that is difficult to solve. To overcome this limitation, the rank minimization problem was relaxed into a convex optimization problem by replacing the rank function with the nuclear norm [20]. Subsequently, several algorithms were proposed to solve this problem, such as the singular value thresholding (SVT) algorithm [21], fixed point and Bregman iterative algorithm [22], split Bregman algorithm based on nuclear norm [23] and so on. More recently, other nonconvex and more cost compact functions were proposed to solve the matrix completion problem, such as the truncated nuclear norm [24], capped norm [25], and the Schatten p norm [26]. Li et al. proposed the Schatten capped p norm (SCp norm) [27], which approximates the rank function by combining the capped norm and Schatten p norm, then they used the algorithm of alternating direction method of multipliers (ADMM) to solve the SCp norm minimization problem.

To reduce the time complexity and retain the matrix completion accuracy, this paper develops a new matrix completion method based on the SCp norm in conjunction with the split Bregman algorithm. First, a blue noise sampling mask is used to select a small set of elements from the similarity matrix, whose values are calculated from the original graph data set. After that, the unknown elements are rapidly retrieved by the SC p norm minimization method. Different from the method in [27], this paper uses the split Bregman algorithm instead of the ADMM algorithm to minimize the SCp norm. That is because the split Bregman algorithm is simpler to implement than the ADMM algorithm, and thus it can effectively reduce the computing time. In addition, the singular values used in the proposed method are calculated by randomized singular value decomposition (SVD) to accelerate the computational speed. Finally, the completed similarity matrix can be used to conduct the SC tasks. The methods are then applied in two applications, namely 1D financial series data and 2D lithography layout images. The proposed method is shown to effectively improve the computational efficiency, and obtain accurate clustering results under challenging subsampling rates.

The remainder of this paper is organized as follows. The basic concept and fundamentals of SC algorithm are briefly described in Section II. The proposed method for the completion of the similarity matrix is introduced in Section III. The simulations and analysis are presented in Section IV. Section V provides the conclusions.

Section snippets

Fundamental of spectral clustering

Consider an undirected weighted graph $G = {V, E}$ with N vertices. If the vertices $v_{i}$ and $v_{j}$ are connected by an edge, the edge weight between them is set to be $w_{i, j} = w_{j, i} > 0$ , otherwise $w_{i, j} = 0$ . The edge weight can be formulated as the similarity between the data samples associated with different vertices, such as the absolute value of the Pearson correlation coefficient. In this case, the matrix $W^{*} = [w_{i, j}^{*}] \in R^{N \times N}$ , $i, j = 1, \dots, N$ is called similarity matrix with all the diagonal elements equal to 0. It is

Initialization of similarity matrix

In the initialization phase, the location of a sparse set of non-diagonal elements of the similarity matrix are first selected using the blue noise sampling method. Then, the true values of these selected elements are calculated in advance using the original graph data set. These selected elements serve as the known entries in matrix completion. Blue noise sampling is used since it was proved to have better sampling uniformity than random sampling [32], [33]. The blue noise is regarded as the

Simulation and analysis

This section provides the simulations to verify the proposed fast SC method based on two different data sets, i.e., the 2D lithography layout data and the 1D financial time series data. The details of the simulation settings, results, and analysis are presented as follows. In the future, we may introduce relevant ensemble clustering algorithms to improve the robustness of clustering results [43], [44].

Conclusion

This paper proposes the use of matrix completion method to efficiently calculate the graph similarity matrix, based on which a fast SC approach was developed. This paper considered the low-rank property of graph similarity matrix and developed a new matrix completion method to reconstruct the entire similarity matrix from a small set of true element values. Firstly, some elements in the similarity matrix were selected by blue noise sampling. Then, the low-rank matrix completion problem on the

CRediT authorship contribution statement

Xu Ma: Methodology, Investigation, Conceptualization, Writing – review & editing, Project administration. Shengen Zhang: Software, Data curation, Methodology, Investigation, Writing – original draft, Validation. Karelia Pena-Pena: Conceptualization, Writing – review & editing, Methodology. Gonzalo R. Arce: Conceptualization, Supervision, Investigation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was partially supported by Fundamental Research Funds for the Central Universities (2020CX02002).

References (53)

S. Ding et al.
A semi-supervised approximate spectral clustering algorithm based on HMRF model
Inf. Sci.
(2018)
N. Tremblay et al.
Graph wavelets for multiscale community mining
IEEE Trans. Signal Process.
(2014)
D.M. Mohan et al.
Wavelets on graphs with application to transportation networks
17th International Conference on Intelligent Transportation Systems (ITSC)
(2014)
J. Zheng et al.
Efficient data transmission strategy for IIoTs with arbitrary geometrical array
IEEE Trans. Ind. Informat.
(2021)
J. Zheng et al.
Accurate detection and localization of UAV swarms-enabled MEC system
IEEE Trans. Ind. Informat.
(2021)
J. Zheng et al.
Parameterized centroid frequency-chirp rate distribution for LFM signal analysis and mechanisms of constant delay introduction
IEEE Trans. Signal Process.
(2017)
M. Newman
Network: An introduction
(2010)
L. Stanković et al.
Vertex-Frequency Analysis of Graph Signals
(2019)
D.I. Shuman et al.
The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains
IEEE Signal Process. Mag.
(2013)
N. Tremblay et al.
Compressive spectral clustering
Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML)
(2016)

N. Tremblay et al.

Accelerated spectral clustering using graph filtering of random signals

2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

(2016)

U. Luxburg

A tutorial on spectral clustering

Stat. Comput.

(2007)

J. Shi et al.

Normalized cuts and image segmentation

IEEE Trans. Pattern Anal. Mach. Intell.

(2000)

B. Wang et al.

Spectral clustering based on similarity and dissimilarity criterion

Pattern Anal. Applic.

(2017)

D. Huang et al.

Ultra-scalable spectral clustering and ensemble clustering

IEEE Tran. Knowl. Data Eng.

(2020)

D. Cai et al.

Large scale spectral clustering via landmark-based sparse representation

IEEE Trans. Cybern.

(2015)

H. Jia et al.

A k-AP clustering algorithm based on manifold similarity measure

International Conference on Intelligent Information Processing (IFIPAICT)

(2018)

H. Jia et al.

A Nyström spectral clustering algorithm based on probability incremental sampling

Soft Comput.

(2017)

E.J. Candès et al.

Exact matrix completion via convex optimization

Found. Comput. Math.

(2009)

M. Fazel

Matrix rank minimization with applications

(2002)

J. Cai et al.

A singular value thresholding algorithm for matrix completion

SIAM J. Optim.

(2010)

S. Ma et al.

Fixed point and Bregman iterative methods for matrix rank minimization

Math. Program.

(2011)

A. Gogna et al.

Matrix recovery using split Bregman

2014 22nd International Conference on Pattern Recognition (ICPR)

(2014)

Y. Hu et al.

Fast and accurate matrix completion via truncated nuclear norm regularization

IEEE Trans. Pattern Anal. Mach. Intell.

(2013)

Q. Sun et al.

Robust principal component analysis via capped norms

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD)

(2013)

F. Nie et al.

Low-rank matrix recovery via efficient schatten p-norm minimization

Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI)

(2012)

Cited by (7)

Truncated quadratic norm minimization for bilinear factorization based matrix completion
2024, Signal Processing
Low-rank matrix completion is an important research topic with a wide range of applications. One prevailing way for matrix recovery is based on rank minimization. Directly solving this problem is NP hard. Therefore, various rank surrogates are developed, like nuclear norm. However, nuclear norm regularization minimizes the sum of all the singular values, and hence the rank is not well approximated. We propose a new rank substitution named truncated quadratic norm that performs the corresponding truncated quadratic operation on the singular values. This function takes the square of the minor singular values and maps large singular values to one. In order to reduce computational complexity, the original target matrix is factorized into two small matrices on which the truncated quadratic norm is imposed. The resultant problem is then solved by alternating minimization. We also prove that the solution sequence is able to converge to a critical point. Experimental results on synthetic data and real-world images demonstrate the excellent performance of our method in terms of recovery accuracy.
An evidence accumulation based block diagonal cluster model for intent recognition from EEG
2022, Biomedical Signal Processing and Control
Citation Excerpt :
However, this kind of work is computationally intensive, this is because the selection of appropriate feature is a complex problem, which needs to ensure the similarity of the data in the class and the separability of the inter-class data, so feature extraction and dimensionality reduction requires a lot of experimentation and analysis [17]. Similarity matrix has been extensively applied to clustering research in recent years, which has the advantage of improving the robustness and quality of clustering results [18,19]. Such approaches focus on the pairwise similarity between each pair of data samples to learn patterns, alleviating the complexity of processing data features.
Most of the probabilistic mixture models perform clustering by observing the eigenvectors of the data sample and these models rely on the layout of features. Clustering ensemble based on similarity matrices avoids complex processing of samples by only accessing basic clusters. However, while there are many literatures on the probability mixture model for clustering, there is almost no study focusing on applying the similarity matrix to the probability mixture model. Therefore, a new clustering method called block clustering structure of evidence accumulation matrix (BEAM) is proposed in this study by combining the clustering ensemble and the probability mixture model. Specifically, evidence accumulation (EA) is developed to obtain a similarity matrix of samples. The interpretability of the similarity matrix can be improved due to sample-based similarity measures, and then the diagonal block model is designed to identify representative block cluster structures from the similarity matrix. The proposed method has been evaluated on the BCI Competition IV Data set 1 and the block-diagonal structure of the similarity matrix is discovered, which ensures high similarity within the same cluster and large separation between the clusters. In addition, the Davies-Bouldin index (DBI) and adjusted rand index (ARI) are used to evaluate BEAM performance. The results show that the proposed method is superior to the state-of-the-art approaches.
Semantic Spectral Clustering with Contrastive Learning and Neighbor Mining
2024, Neural Processing Letters
Adaptive Convergent Visible Graph Network: An Interpretable Method for Intelligent Rolling Bearing Diagnosis
2023, SSRN
Cluster-CAM: Cluster-Weighted Visual Interpretation of CNNs’ Decision in Image Classification
2023, arXiv
Micro-service anomaly detection based on second-order nearest neighbor
2023, Proceedings of 2023 IEEE 3rd International Conference on Information Technology, Big Data and Artificial Intelligence, ICIBA 2023

View all citing articles on Scopus

¹: [orcid=0000-0003-2012-9808]

View full text

Fast spectral clustering method based on graph similarity matrix completion

Highlights

Abstract

Introduction

Section snippets

Fundamental of spectral clustering

Initialization of similarity matrix

Simulation and analysis

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgement

Inf. Sci.

Graph wavelets for multiscale community mining

IEEE Trans. Signal Process.

Wavelets on graphs with application to transportation networks

17th International Conference on Intelligent Transportation Systems (ITSC)

Efficient data transmission strategy for IIoTs with arbitrary geometrical array

IEEE Trans. Ind. Informat.

Accurate detection and localization of UAV swarms-enabled MEC system

IEEE Trans. Ind. Informat.

Parameterized centroid frequency-chirp rate distribution for LFM signal analysis and mechanisms of constant delay introduction

IEEE Trans. Signal Process.

Network: An introduction

Vertex-Frequency Analysis of Graph Signals

The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains

IEEE Signal Process. Mag.

Compressive spectral clustering

Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML)

Accelerated spectral clustering using graph filtering of random signals

2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

A tutorial on spectral clustering

Stat. Comput.

Normalized cuts and image segmentation

IEEE Trans. Pattern Anal. Mach. Intell.

Spectral clustering based on similarity and dissimilarity criterion

Pattern Anal. Applic.

Ultra-scalable spectral clustering and ensemble clustering

IEEE Tran. Knowl. Data Eng.

Large scale spectral clustering via landmark-based sparse representation

IEEE Trans. Cybern.

A k-AP clustering algorithm based on manifold similarity measure

International Conference on Intelligent Information Processing (IFIPAICT)

A Nyström spectral clustering algorithm based on probability incremental sampling

Soft Comput.

Exact matrix completion via convex optimization

Found. Comput. Math.

Matrix rank minimization with applications

A singular value thresholding algorithm for matrix completion

SIAM J. Optim.

Fixed point and Bregman iterative methods for matrix rank minimization

Math. Program.

Matrix recovery using split Bregman

2014 22nd International Conference on Pattern Recognition (ICPR)

Fast and accurate matrix completion via truncated nuclear norm regularization

IEEE Trans. Pattern Anal. Mach. Intell.

Robust principal component analysis via capped norms

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD)

Low-rank matrix recovery via efficient schatten p-norm minimization

Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI)