Elsevier

Pattern Recognition Letters

Volume 125, 1 July 2019, Pages 488-493
Pattern Recognition Letters

A general framework for scalable spectral clustering based on document models

https://doi.org/10.1016/j.patrec.2019.06.010Get rights and content

Abstract

We propose a new, general framework for fast, approximate spectral clustering on large data sets. We first consider the special setting of cosine similarity for clustering sparse data (e.g., documents under the bag-of-words model) or data of at most a few hundred dimensions (e.g., small images). We show that in those cases various versions of spectral clustering, such as the Ng-Jordan-Weiss algorithm (NIPS 2001), Normalized Cut (Shi and Malik, 2000), and Diffusion Maps (Coifman and Lafon, 2006), can be implemented solely based on three kinds of efficient operations on the data matrix – elementwise manipulation, matrix-vector multiplication and low-rank SVD, thus eliminating the need to compute the weight matrix. For general similarity and any kind of data, we present a landmark-based technique that first converts the given data (or a landmark set selected from them) to a collection of “documents” and then applies to them the scalable implementation of spectral clustering with cosine similarity. Our algorithm is simple to implement and fast to run, with additional benefits such as a naturally embedded outliers removal step. We conduct extensive experiments to compare our algorithm with a few existing methods to demonstrate its superior performance.

Introduction

Owing to the breakthroughs made at the beginning of this century [10], [12], [16], spectral clustering has become a very popular clustering approach in the machine learning and data mining communities. Given data points x1,,xnRd to be grouped into k clusters, the fundamental idea of spectral clustering is to construct a pairwise similarity matrix WRn×n and use its top eigenvectors to form a low dimensional embedding where clusters are tight and well separated (such that simple methods like k-means can be employed). Though conceptually quite simple, spectral clustering can easily adapt to nonconvex geometries and separate nonintersecting shapes. As a result, it has been successfully applied to many practical tasks, such as image segmentation, document clustering, and network partitioning, often significantly outperforming traditional methods. Furthermore, spectral clustering has a very rich theory and is intimately related to a few other fields such as random walk [10], [14], and graph embedding/partitioning [2], [13], [16], [20], [21].

Despite its remarkable success, spectral clustering is well known to suffer from extensive computational cost (owing to the n × n matrix W). Consequently, there has been considerable effort in the literature to develop fast, approximate algorithms to make spectral clustering scalable to large data sets [3], [4], [5], [6], [8], [11], [15], [17], [18], [22]. Interestingly, the majority of them use a landmark-based sampling strategy to reduce the complexity. According to motivation and focus, those methods can roughly be divided into several different categories:

  • Nyström approximation methods [6], [8], [19]. These methods focus on the eigendecomposition of the weight matrix W and utilize advanced linear algebra to estimate the eigenvectors of W from a reduced version obtained through row and/or column sampling.

  • Data reduction methods [15], [17], [18], [22]. These methods start by reducing the input data to a small subset of data representatives and then use spectral clustering to partition the representative set and correspondingly infer the labels of the full data set.

  • Sparse representation methods [3]. Such methods utilize the recent progress made in sparse coding by selecting a small subset of landmark points to sparsely represent the remaining data points. Afterwards, spectral clustering is applied with the landmark-based sparse representations.

The first two classes of methods use only a reduced data set for eigenvectors calculation or clustering and thus there may be a loss of information during the process. In contrast, the third class of methods does use all the data for clustering, however, they are very expensive as they require solving a sparse coding problem for each original data point.

In this paper we build upon our recent work [4], [5] to present a general scalable spectral clustering framework that uses the full data set XRn×d for clustering but cleverly avoids computing the n × n weight matrix. We start by considering a special setting in which (1) X is large in size n but has some sort of low dimensional structure – either of moderate dimension d (e.g., for a collection of small images) or being sparse (e.g. as a document-term matrix), and (2) it is appropriate to use the cosine similarity. We show that in such a setting one can perform fast spectral clustering solely based on three kinds of efficient operations on the data matrix X: elementwise manipulation, matrix-vector multiplication, and low-rank SVD.

For a scalable implementation of spectral clustering with general similarity, we also use a small subset of landmark points selected from the given data and compute the similarities between each data point and its closest landmark points. However, unlike the previous approaches [3], [4], [5]), we interpret the sparse similarity matrix between them as a document-term frequency matrix by (1) regarding the given data as “documents” and the landmarks as “terms” or (2) regarding the landmarks as “documents” and the given data as “terms”. We then apply the fast implementation of spectral clustering with the cosine similarity to the “documents” in each model to obtain a clustering of the given data (for the landmark documents model, a classification step will be needed after the landmarks have been clustered).

Our proposed method has many advantages:

  • Our work provides a unified framework for scalable implementations of various spectral clustering algorithms with arbitrary similarity functions.

  • Our methodology is based on novel document models for the similarity matrix between the given data and the selected landmark set.

  • Our algorithm is simple to implement and runs very fast as it utilizes very efficient matrix operations. In fact, both the computational complexity and memory requirement are linear in the size of the data.

  • Our implementation of spectral clustering is naturally combined with an outliers removal procedure which enhances the robustness and accuracy of the procedure.

The rest of the paper is organized as follows. In Section 2 we review three variations of spectral clustering. We then present in Section 3 a unified scalable spectral clustering framework, first in the special setting of cosine similarity (Section 3.1) and then in the setting of general similarity (Section 3.2). Experiments are conducted in Section 4 to compare our algorithms against several competitors. Finally, in Section 5, we draw some conclusions while pointing out a future direction.

Notation. Vectors are denoted by boldface lowercase letters (e.g., a, b). The ith element of a is written as ai or a(i). We denote the constant vector of one (in column form) as 1, with its dimension implied by the context.

Matrices are denoted by boldface uppercase letters (e.g., A, B). The (i, j) entry of A is denoted by aij or A(i, j). The ith row of A is denoted by A(i, : ) while its columns are written as A(:, j). We use I to denote the identity matrix (with its dimension implied by the context).

Section snippets

Review of spectral clustering

Spectral clustering refers to a family of clustering algorithms that utilize the spectral decomposition of a graph Laplacian matrix constructed on the input data [9]. Despite their differences in motivation and execution, they all consist of the following three steps:

  • 1.

    Construction of a similarity matrix. The first step of spectral clustering is to use a function κ:Rd×RdR+ to quantify the similarity between each pair of points:wij={κ(xi,xj),ij;0,i=j.This naturally induces a weighted graph on

Methodology

To present our scalable implementation techniques for spectral clustering, we first assume the special setting of cosine similarity. Afterwards, we introduce a landmark-based document model to handle general similarity functions. We focus on the NJW algorithm for the exposition of ideas, as they easily extend to NCut and DM(t).

Results

We conduct extensive experiments to examine the performance of our proposed algorithms (Algorithms  1 and 2). First, in Section 4.1 we compare the plain and scalable implementations of the three spectral clustering algorithms, namely NJW, NCut, and DM(t), with the cosine similarity for clustering documents and images. Next, in Section 4.2, we compare Algorithm 2 with three existing scalable implementations – KASP [22], cSPEC [19] and LSC [3]. For all algorithms, we removed α=1% of the data

Conclusions and future work

We presented a novel scalable spectral clustering framework that can handle any kind of similarity. Our approach uses a landmark-sampling technique to convert the given data (or the landmarks) to a collection of documents so that we can use the cosine similarity to group them. Furthermore, we developed a unified scalable computing procedure for three kinds of spectral clustering algorithms with the cosine similarity based on very efficient matrix operations. We conducted extensive experiments

Conflict of interest

None.

Acknowledgments

We thank the referees for careful review and helpful feedback. G. Chen was supported by the Simons Foundation Collaboration Grant for Mathematicians while conducting this research.

References (22)

  • G. Chen

    Scalable spectral clustering with cosine similarity

    Proceedings of the 24th International Conference on Pattern Recognition (ICPR), Beijing, China

    (2018)
  • Cited by (6)

    • A modified bond energy algorithm with fuzzy merging and its application to Arabic text document clustering

      2020, Expert Systems with Applications
      Citation Excerpt :

      Furthermore, spherical k-means uses the data-driven threshold that can dynamically adjust the value of the centroid vectors because it uses the cosine similarity. Chen (2019) presented a new adaptive spectral clustering architecture, which could handle some similarities. Their method used a sampling technique to transform data (or landmarks) into a document collection in which the cosine similarity can be used.

    • Fast, Memory-Efficient Spectral Clustering with Cosine Similarity

      2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • A fast incremental spectral clustering algorithm with cosine similarity

      2023, IEEE International Conference on Data Mining Workshops, ICDMW
    View full text