Fast spectral clustering method based on graph similarity matrix completion
Introduction
With the advances of data science and information technology, the structure of data acquired from real world applications has become more diverse and complex. Traditional signal processing methods cannot be used directly to represent and process data sets with complex networks and topological structures, such as community networks [1], transportation networks [2], radar technology [3], [4], [5] and other data systems with latent non-Euclidean structures [6]. In recent years, the emergence and development of graph signal processing (GSP) provided an efficient tool to describe and analyze such data in irregular grids. GSP mainly focuses on the representation, transformation, and processing of graph signals [7], [8]. A graph signal is defined as a set of data samples supported by a graph , where and represent the sets of vertices and edges, respectively. The vertices can be associated with traditional one-dimensional (1D) temporal signals, two-dimensional (2D) images, or more complex data. The edges describe the relationship between vertices, which provide more degrees of freedom to depict the topology within vertices, thus the graph signals can represent such complex data with non-Euclidean structures. In addition, each vertex corresponds to a graph signal sample. Fig. 1 shows an example of a graph signal with five vertices and eight edges if and are connected. The graph signal values are denoted by a data set , where is the value of the signal on the ith vertex. In principle, the can represent signals on a sensor network, intensity of an image field, features on a social network, or any other measurable phenomena in a network.
Spectral clustering (SC) is a widely used classification technique to disclose the internal relation of graph signal data, which has found wide applications in the GSP field [9], [10]. The principle behind SC is to divide the entire graph into separate subgraphs by enhancing the intra-subgraph correlation, while suppressing the inter-subgraph correlation [11]. First, the similarity between any pair of vertices is used to establish the similarity matrix and the Laplacian matrix. Then, SC clusters the data set into several groups based on the main eigenvectors of the Laplacian matrix corresponding to several smaller eigenvalues. Shi et al. proposed the normalized-cut (Ncut) objective function to measure both the total dissimilarity between different subgraphs and the total similarity within the subgraphs [12]. Wang et al. proposed an SC method based on a similarity and dissimilarity criterion to improve the clustering performance [13]. In addition, Huang et al. proposed scalable spectral clustering and ultra-scalable ensemble clustering to improve the scalability and robustness [14]. Cai et al. proposed the landmark-based spectral clustering method for large-scale clustering problems [15]. Ding et al. considered the pair constraints for the objective function of graph cuts, and derived a semi-supervised approximate SC method based on the hidden Markov random field [16]. For the complex manifold data structure, the k-affinity propagation clustering algorithm was proposed based on the manifold similarity measure, which can automatically calculate the appropriate number of clusters [17]. In addition, a dynamic incremental sampling method for Nyström spectral clustering was proposed, which used different probability distributions to select more representative sampling points, so as to reflect the distribution of the data set better [18].
SC methods have many advantages. They are simple and easy to implement and have global convergence properties [11]. However, computing the similarity matrix on high-dimensional data sets becomes time-consuming, which significantly reduces the computational efficiency of SC methods. The complexity arises because the similarity matrix is composed of the correlations between numerous pairs of high-dimensional data samples associated with different graph vertices. The complexity in calculating the correlation coefficients rapidly increases with the growth of data dimensionality and the number of vertices.
To circumvent this limitation, this paper proposes a fast SC method based on matrix completion. The proposed method is based on the following observations. In a similarity matrix, the elements in one row or one column represent the cross correlation of data samples to all other samples. The correlation is often assessed by the normalized inner product between each pair of data samples. Thus, the rows or columns in the similarity matrix are likely highly correlated, and thus the similarity matrix can be inherently modeled as a low-rank matrix. Based on this assumption, we can first calculate a portion of the similarity matrix elements directly from the original data sets, and then quickly retrieve the others by low-rank matrix completion methods.
Low-rank matrix completion is a method that recovers all the matrix elements from the incomplete observation of its entries. Matrix completion was firstly modeled as a rank minimization problem [19] which is an NP-hard problem that is difficult to solve. To overcome this limitation, the rank minimization problem was relaxed into a convex optimization problem by replacing the rank function with the nuclear norm [20]. Subsequently, several algorithms were proposed to solve this problem, such as the singular value thresholding (SVT) algorithm [21], fixed point and Bregman iterative algorithm [22], split Bregman algorithm based on nuclear norm [23] and so on. More recently, other nonconvex and more cost compact functions were proposed to solve the matrix completion problem, such as the truncated nuclear norm [24], capped norm [25], and the Schatten p norm [26]. Li et al. proposed the Schatten capped p norm (SCp norm) [27], which approximates the rank function by combining the capped norm and Schatten p norm, then they used the algorithm of alternating direction method of multipliers (ADMM) to solve the SCp norm minimization problem.
To reduce the time complexity and retain the matrix completion accuracy, this paper develops a new matrix completion method based on the SCp norm in conjunction with the split Bregman algorithm. First, a blue noise sampling mask is used to select a small set of elements from the similarity matrix, whose values are calculated from the original graph data set. After that, the unknown elements are rapidly retrieved by the SC p norm minimization method. Different from the method in [27], this paper uses the split Bregman algorithm instead of the ADMM algorithm to minimize the SCp norm. That is because the split Bregman algorithm is simpler to implement than the ADMM algorithm, and thus it can effectively reduce the computing time. In addition, the singular values used in the proposed method are calculated by randomized singular value decomposition (SVD) to accelerate the computational speed. Finally, the completed similarity matrix can be used to conduct the SC tasks. The methods are then applied in two applications, namely 1D financial series data and 2D lithography layout images. The proposed method is shown to effectively improve the computational efficiency, and obtain accurate clustering results under challenging subsampling rates.
The remainder of this paper is organized as follows. The basic concept and fundamentals of SC algorithm are briefly described in Section II. The proposed method for the completion of the similarity matrix is introduced in Section III. The simulations and analysis are presented in Section IV. Section V provides the conclusions.
Section snippets
Fundamental of spectral clustering
Consider an undirected weighted graph with N vertices. If the vertices and are connected by an edge, the edge weight between them is set to be , otherwise . The edge weight can be formulated as the similarity between the data samples associated with different vertices, such as the absolute value of the Pearson correlation coefficient. In this case, the matrix , is called similarity matrix with all the diagonal elements equal to 0. It is
Initialization of similarity matrix
In the initialization phase, the location of a sparse set of non-diagonal elements of the similarity matrix are first selected using the blue noise sampling method. Then, the true values of these selected elements are calculated in advance using the original graph data set. These selected elements serve as the known entries in matrix completion. Blue noise sampling is used since it was proved to have better sampling uniformity than random sampling [32], [33]. The blue noise is regarded as the
Simulation and analysis
This section provides the simulations to verify the proposed fast SC method based on two different data sets, i.e., the 2D lithography layout data and the 1D financial time series data. The details of the simulation settings, results, and analysis are presented as follows. In the future, we may introduce relevant ensemble clustering algorithms to improve the robustness of clustering results [43], [44].
Conclusion
This paper proposes the use of matrix completion method to efficiently calculate the graph similarity matrix, based on which a fast SC approach was developed. This paper considered the low-rank property of graph similarity matrix and developed a new matrix completion method to reconstruct the entire similarity matrix from a small set of true element values. Firstly, some elements in the similarity matrix were selected by blue noise sampling. Then, the low-rank matrix completion problem on the
CRediT authorship contribution statement
Xu Ma: Methodology, Investigation, Conceptualization, Writing – review & editing, Project administration. Shengen Zhang: Software, Data curation, Methodology, Investigation, Writing – original draft, Validation. Karelia Pena-Pena: Conceptualization, Writing – review & editing, Methodology. Gonzalo R. Arce: Conceptualization, Supervision, Investigation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work was partially supported by Fundamental Research Funds for the Central Universities (2020CX02002).
References (53)
- et al.
A semi-supervised approximate spectral clustering algorithm based on HMRF model
Inf. Sci.
(2018) - et al.
Graph wavelets for multiscale community mining
IEEE Trans. Signal Process.
(2014) - et al.
Wavelets on graphs with application to transportation networks
17th International Conference on Intelligent Transportation Systems (ITSC)
(2014) - et al.
Efficient data transmission strategy for IIoTs with arbitrary geometrical array
IEEE Trans. Ind. Informat.
(2021) - et al.
Accurate detection and localization of UAV swarms-enabled MEC system
IEEE Trans. Ind. Informat.
(2021) - et al.
Parameterized centroid frequency-chirp rate distribution for LFM signal analysis and mechanisms of constant delay introduction
IEEE Trans. Signal Process.
(2017) Network: An introduction
(2010)- et al.
Vertex-Frequency Analysis of Graph Signals
(2019) - et al.
The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains
IEEE Signal Process. Mag.
(2013) - et al.
Compressive spectral clustering
Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML)
(2016)
Accelerated spectral clustering using graph filtering of random signals
2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
A tutorial on spectral clustering
Stat. Comput.
Normalized cuts and image segmentation
IEEE Trans. Pattern Anal. Mach. Intell.
Spectral clustering based on similarity and dissimilarity criterion
Pattern Anal. Applic.
Ultra-scalable spectral clustering and ensemble clustering
IEEE Tran. Knowl. Data Eng.
Large scale spectral clustering via landmark-based sparse representation
IEEE Trans. Cybern.
A k-AP clustering algorithm based on manifold similarity measure
International Conference on Intelligent Information Processing (IFIPAICT)
A Nyström spectral clustering algorithm based on probability incremental sampling
Soft Comput.
Exact matrix completion via convex optimization
Found. Comput. Math.
Matrix rank minimization with applications
A singular value thresholding algorithm for matrix completion
SIAM J. Optim.
Fixed point and Bregman iterative methods for matrix rank minimization
Math. Program.
Matrix recovery using split Bregman
2014 22nd International Conference on Pattern Recognition (ICPR)
Fast and accurate matrix completion via truncated nuclear norm regularization
IEEE Trans. Pattern Anal. Mach. Intell.
Robust principal component analysis via capped norms
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD)
Low-rank matrix recovery via efficient schatten p-norm minimization
Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI)
Cited by (7)
Truncated quadratic norm minimization for bilinear factorization based matrix completion
2024, Signal ProcessingAn evidence accumulation based block diagonal cluster model for intent recognition from EEG
2022, Biomedical Signal Processing and ControlCitation Excerpt :However, this kind of work is computationally intensive, this is because the selection of appropriate feature is a complex problem, which needs to ensure the similarity of the data in the class and the separability of the inter-class data, so feature extraction and dimensionality reduction requires a lot of experimentation and analysis [17]. Similarity matrix has been extensively applied to clustering research in recent years, which has the advantage of improving the robustness and quality of clustering results [18,19]. Such approaches focus on the pairwise similarity between each pair of data samples to learn patterns, alleviating the complexity of processing data features.
Semantic Spectral Clustering with Contrastive Learning and Neighbor Mining
2024, Neural Processing LettersMicro-service anomaly detection based on second-order nearest neighbor
2023, Proceedings of 2023 IEEE 3rd International Conference on Information Technology, Big Data and Artificial Intelligence, ICIBA 2023
- 1
[orcid=0000-0003-2012-9808]