A general framework for scalable spectral clustering based on document models

doi:10.1016/j.patrec.2019.06.010

Pattern Recognition Letters

Volume 125, 1 July 2019, Pages 488-493

https://doi.org/10.1016/j.patrec.2019.06.010 Get rights and content

Abstract

We propose a new, general framework for fast, approximate spectral clustering on large data sets. We first consider the special setting of cosine similarity for clustering sparse data (e.g., documents under the bag-of-words model) or data of at most a few hundred dimensions (e.g., small images). We show that in those cases various versions of spectral clustering, such as the Ng-Jordan-Weiss algorithm (NIPS 2001), Normalized Cut (Shi and Malik, 2000), and Diffusion Maps (Coifman and Lafon, 2006), can be implemented solely based on three kinds of efficient operations on the data matrix – elementwise manipulation, matrix-vector multiplication and low-rank SVD, thus eliminating the need to compute the weight matrix. For general similarity and any kind of data, we present a landmark-based technique that first converts the given data (or a landmark set selected from them) to a collection of “documents” and then applies to them the scalable implementation of spectral clustering with cosine similarity. Our algorithm is simple to implement and fast to run, with additional benefits such as a naturally embedded outliers removal step. We conduct extensive experiments to compare our algorithm with a few existing methods to demonstrate its superior performance.

Introduction

Owing to the breakthroughs made at the beginning of this century [10], [12], [16], spectral clustering has become a very popular clustering approach in the machine learning and data mining communities. Given data points $x_{1}, \dots, x_{n} \in R^{d}$ to be grouped into k clusters, the fundamental idea of spectral clustering is to construct a pairwise similarity matrix $W \in R^{n \times n}$ and use its top eigenvectors to form a low dimensional embedding where clusters are tight and well separated (such that simple methods like k-means can be employed). Though conceptually quite simple, spectral clustering can easily adapt to nonconvex geometries and separate nonintersecting shapes. As a result, it has been successfully applied to many practical tasks, such as image segmentation, document clustering, and network partitioning, often significantly outperforming traditional methods. Furthermore, spectral clustering has a very rich theory and is intimately related to a few other fields such as random walk [10], [14], and graph embedding/partitioning [2], [13], [16], [20], [21].

Despite its remarkable success, spectral clustering is well known to suffer from extensive computational cost (owing to the n × n matrix W). Consequently, there has been considerable effort in the literature to develop fast, approximate algorithms to make spectral clustering scalable to large data sets [3], [4], [5], [6], [8], [11], [15], [17], [18], [22]. Interestingly, the majority of them use a landmark-based sampling strategy to reduce the complexity. According to motivation and focus, those methods can roughly be divided into several different categories:

•
Nyström approximation methods [6], [8], [19]. These methods focus on the eigendecomposition of the weight matrix W and utilize advanced linear algebra to estimate the eigenvectors of W from a reduced version obtained through row and/or column sampling.
•
Data reduction methods [15], [17], [18], [22]. These methods start by reducing the input data to a small subset of data representatives and then use spectral clustering to partition the representative set and correspondingly infer the labels of the full data set.
•
Sparse representation methods [3]. Such methods utilize the recent progress made in sparse coding by selecting a small subset of landmark points to sparsely represent the remaining data points. Afterwards, spectral clustering is applied with the landmark-based sparse representations.

The first two classes of methods use only a reduced data set for eigenvectors calculation or clustering and thus there may be a loss of information during the process. In contrast, the third class of methods does use all the data for clustering, however, they are very expensive as they require solving a sparse coding problem for each original data point.

In this paper we build upon our recent work [4], [5] to present a general scalable spectral clustering framework that uses the full data set $X \in R^{n \times d}$ for clustering but cleverly avoids computing the n × n weight matrix. We start by considering a special setting in which (1) X is large in size n but has some sort of low dimensional structure – either of moderate dimension d (e.g., for a collection of small images) or being sparse (e.g. as a document-term matrix), and (2) it is appropriate to use the cosine similarity. We show that in such a setting one can perform fast spectral clustering solely based on three kinds of efficient operations on the data matrix X: elementwise manipulation, matrix-vector multiplication, and low-rank SVD.

For a scalable implementation of spectral clustering with general similarity, we also use a small subset of landmark points selected from the given data and compute the similarities between each data point and its closest landmark points. However, unlike the previous approaches [3], [4], [5]), we interpret the sparse similarity matrix between them as a document-term frequency matrix by (1) regarding the given data as “documents” and the landmarks as “terms” or (2) regarding the landmarks as “documents” and the given data as “terms”. We then apply the fast implementation of spectral clustering with the cosine similarity to the “documents” in each model to obtain a clustering of the given data (for the landmark documents model, a classification step will be needed after the landmarks have been clustered).

Our proposed method has many advantages:

•
Our work provides a unified framework for scalable implementations of various spectral clustering algorithms with arbitrary similarity functions.
•
Our methodology is based on novel document models for the similarity matrix between the given data and the selected landmark set.
•
Our algorithm is simple to implement and runs very fast as it utilizes very efficient matrix operations. In fact, both the computational complexity and memory requirement are linear in the size of the data.
•
Our implementation of spectral clustering is naturally combined with an outliers removal procedure which enhances the robustness and accuracy of the procedure.

The rest of the paper is organized as follows. In Section 2 we review three variations of spectral clustering. We then present in Section 3 a unified scalable spectral clustering framework, first in the special setting of cosine similarity (Section 3.1) and then in the setting of general similarity (Section 3.2). Experiments are conducted in Section 4 to compare our algorithms against several competitors. Finally, in Section 5, we draw some conclusions while pointing out a future direction.

Notation. Vectors are denoted by boldface lowercase letters (e.g., a, b). The ith element of a is written as a_i or a(i). We denote the constant vector of one (in column form) as 1, with its dimension implied by the context.

Matrices are denoted by boldface uppercase letters (e.g., A, B). The (i, j) entry of A is denoted by a_ij or A(i, j). The ith row of A is denoted by A(i, : ) while its columns are written as A(:, j). We use I to denote the identity matrix (with its dimension implied by the context).

Section snippets

Review of spectral clustering

Spectral clustering refers to a family of clustering algorithms that utilize the spectral decomposition of a graph Laplacian matrix constructed on the input data [9]. Despite their differences in motivation and execution, they all consist of the following three steps:

1.
Construction of a similarity matrix. The first step of spectral clustering is to use a function $κ : R^{d} \times R^{d} \mapsto R^{+}$ to quantify the similarity between each pair of points: $w_{i j} = {\begin{matrix} κ (x_{i}, x_{j}), & i \neq j; \\ 0, & i = j . \end{matrix}$ This naturally induces a weighted graph on

Methodology

To present our scalable implementation techniques for spectral clustering, we first assume the special setting of cosine similarity. Afterwards, we introduce a landmark-based document model to handle general similarity functions. We focus on the NJW algorithm for the exposition of ideas, as they easily extend to NCut and DM^(t).

Results

We conduct extensive experiments to examine the performance of our proposed algorithms (Algorithms 1 and 2). First, in Section 4.1 we compare the plain and scalable implementations of the three spectral clustering algorithms, namely NJW, NCut, and DM^(t), with the cosine similarity for clustering documents and images. Next, in Section 4.2, we compare Algorithm 2 with three existing scalable implementations – KASP [22], cSPEC [19] and LSC [3]. For all algorithms, we removed $α = 1 %$ of the data

Conclusions and future work

We presented a novel scalable spectral clustering framework that can handle any kind of similarity. Our approach uses a landmark-sampling technique to convert the given data (or the landmarks) to a collection of documents so that we can use the cosine similarity to group them. Furthermore, we developed a unified scalable computing procedure for three kinds of spectral clustering algorithms with the cosine similarity based on very efficient matrix operations. We conducted extensive experiments

Conflict of interest

None.

Acknowledgments

We thank the referees for careful review and helpful feedback. G. Chen was supported by the Simons Foundation Collaboration Grant for Mathematicians while conducting this research.

References (22)

R. Coifman et al.
Diffusion maps
Appl. Comput. Harmon. Anal.
(2006)
H. Qiu et al.
Graph matching and clustering using spectral partitions
Pattern Recognit.
(2006)
K. Tasdemir
Vector quantization based approximate spectral clustering of large datasets
Pattern Recognit.
(2012)
L. Wang et al.
Approximate pairwise clustering for large data sets via sampling plus extension
Pattern Recognit.
(2011)
B. Xiao et al.
A generative model for graph matching and embedding
Comput. Vision Image Underst.
(2009)
B. Xiao et al.
Geometric characterization and clustering of graphs using heat kernel embeddings
Image Vis. Comput.
(2010)
C. Aggarwal, C. Zhai, A Survey of Text Clustering Algorithms, Springer US, Boston, MA, pp....
M. Belkin et al.
Laplacian eigenmaps for dimensionality reduction and data representation
Neural Comput.
(2003)
D. Cai et al.
Large scale spectral clustering via landmark-based sparse representation
IEEE Trans. Cybern.
(2015)
G. Chen
A scalable spectral clustering algorithm based on landmark-embedding and cosine similarity
Bai X., Hancock E., Ho T., Wilson R., Biggio B., Robles-Kelly A. (eds) Structural, Syntactic, and Statistical Pattern Recognition. S+SSPR 2018. Lecture Notes in Computer Science
(2018)

G. Chen

Scalable spectral clustering with cosine similarity

Proceedings of the 24th International Conference on Pattern Recognition (ICPR), Beijing, China

(2018)

Cited by (6)

A modified bond energy algorithm with fuzzy merging and its application to Arabic text document clustering
2020, Expert Systems with Applications
Citation Excerpt :
Furthermore, spherical k-means uses the data-driven threshold that can dynamically adjust the value of the centroid vectors because it uses the cosine similarity. Chen (2019) presented a new adaptive spectral clustering architecture, which could handle some similarities. Their method used a sampling technique to transform data (or landmarks) into a document collection in which the cosine similarity can be used.
Conventional textual documents clustering algorithms suffer from several shortcomings, such as the slow convergence of the immense high-dimensional data, the sensitivity to the initial value, and the understandability of the description of the resulted clusters. Although many clustering algorithms have been developed for English and other languages, very few have tackled the problem of clustering the under-resourced Arabic language. In this work, we propose a modified version of the Bond Energy Algorithm (BEA) combined with a fuzzy merging technique to solve the problem of Arabic text document clustering. The proposed algorithm, Clustering Arabic Documents based on Bond Energy, hereafter named CADBE, attempts to identify and display natural variable clusters within huge sized data. CADBE has three steps to cluster Arabic documents: the first step instantiates a cluster affinity matrix using the BEA, the second step uses a new and novel method to partition the cluster matrix automatically into small coherent clusters, and the last step uses a fuzzy merging technique to merge similar clusters based on the associations and interrelations between the resulted clusters. Experimental results showed that the proposed algorithm effectively outperformed the conventional clustering algorithms such as Expectation–Maximization (EM), Single Linkage, and UPGMA in terms of clustering purity and entropy. It also outperformed k-means, k-means++, spherical k-means, and CoclusMod in most test cases. However, there are several merits of CADBE. First, unlike the traditional clustering algorithms, it does not require to specify the number of clusters. In addition, it produces clusters with distinct boundaries, which makes its results more objective, and finally it is deterministic, such that it is insensitive to the order in which documents are presented to the algorithm.
Special issue on recent advances in statistical, structural and syntactic pattern recognition
2020, Pattern Recognition Letters
Fast, Memory-Efficient Spectral Clustering with Cosine Similarity
2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
A fast incremental spectral clustering algorithm with cosine similarity
2023, IEEE International Conference on Data Mining Workshops, ICDMW
The larger the better: Analysis of a scalable spectral clustering algorithm with cosine similarity
2021, Frontiers in Artificial Intelligence and Applications
A novel unsupervised spectral clustering for pure-tone audiograms towards hearing aid filter bank design and initial configurations
2022, Applied Sciences (Switzerland)

View full text

A general framework for scalable spectral clustering based on document models

Abstract

Introduction

Section snippets

Review of spectral clustering

Methodology

Results

Conclusions and future work

Conflict of interest

Acknowledgments

Appl. Comput. Harmon. Anal.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Comput. Vision Image Underst.

Image Vis. Comput.

Laplacian eigenmaps for dimensionality reduction and data representation

Neural Comput.

Large scale spectral clustering via landmark-based sparse representation

IEEE Trans. Cybern.

A scalable spectral clustering algorithm based on landmark-embedding and cosine similarity

Bai X., Hancock E., Ho T., Wilson R., Biggio B., Robles-Kelly A. (eds) Structural, Syntactic, and Statistical Pattern Recognition. S+SSPR 2018. Lecture Notes in Computer Science

Scalable spectral clustering with cosine similarity

Proceedings of the 24th International Conference on Pattern Recognition (ICPR), Beijing, China