Skip to main content

A Scalable Spectral Clustering Algorithm Based on Landmark-Embedding and Cosine Similarity

  • Conference paper
  • First Online:
  • 1209 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11004))

Abstract

We extend our recent work on scalable spectral clustering with cosine similarity (ICPR’18) to other kinds of similarity functions, in particular, the Gaussian RBF. In the previous work, we showed that for sparse or low-dimensional data, spectral clustering with the cosine similarity can be implemented directly through efficient operations on the data matrix such as elementwise manipulation, matrix-vector multiplication and low-rank SVD, thus completely avoiding the weight matrix. For other similarity functions, we present an embedding-based approach that uses a small set of landmark points to convert the given data into sparse feature vectors and then applies the scalable computing framework for the cosine similarity. Our algorithm is simple to implement, has clear interpretations, and naturally incorporates an outliers removal procedure. Preliminary results show that our proposed algorithm yields higher accuracy than existing scalable algorithms while running fast.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Available at http://yann.lecun.com/exdb/mnist/.

  2. 2.

    This is similarity-based feature representation. Note that there is also work on dissimilarity representation [7, 13].

  3. 3.

    https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

  4. 4.

    Code available at http://www.cad.zju.edu.cn/home/dengcai/Data/Clustering.html.

  5. 5.

    This empirical rule is derived as \(\ell =\frac{1}{2}\cdot \sqrt{\frac{n}{k}} \cdot k = \frac{1}{2}\sqrt{nk}\), with the intuition that the value of \(\ell \) should be proportional to both the (average) cluster size and number of clusters. For the data sets in Table 1, such an \(\ell \) is always a few hundred.

References

  1. Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 77–128. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-3223-4_4

    Chapter  Google Scholar 

  2. Cai, D., Chen, X.: Large scale spectral clustering via landmark-based sparse representation. IEEE Trans. Cybern. 45(8), 1669–1680 (2015)

    Article  Google Scholar 

  3. Chen, G.: Scalable spectral clustering with cosine similarity. In: Proceedings of the 24th International Conference on Pattern Recognition (ICPR), Beijing, China (2018)

    Google Scholar 

  4. Jain, S., Munos, R., Stephan, F., Zeugmann, T. (eds.): ALT 2013. LNCS (LNAI), vol. 8139. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40935-6

    Book  Google Scholar 

  5. Chung, F.R.K.: Spectral graph theory. In: CBMS Regional Conference Series in Mathematics, vol. 92. AMS (1996)

    Google Scholar 

  6. Coifman, R., Lafon, S.: Diffusion maps. Appl. Comput. Harmonic Anal. 21(1), 5–30 (2006)

    Article  MathSciNet  Google Scholar 

  7. Duin, R., Pekalska, E.: The dissimilarity space: bridging structural and statistical pattern recognition. Pattern Recogn. Lett. 33(7), 826–832 (2012)

    Article  Google Scholar 

  8. Fowlkes, C., Belongie, S., Chung, F., Malik, J.: Spectral grouping using the Nyström method. IEEE Trans. Pattern Anal. Mach. Intell. 26(2), 214–225 (2004)

    Article  Google Scholar 

  9. von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)

    Article  MathSciNet  Google Scholar 

  10. Meila, M., Shi, J.: A random walks view of spectral segmentation. In: Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics (2001)

    Google Scholar 

  11. Moazzen, Y., Tasdemir, K.: Sampling based approximate spectral clustering ensemble for partitioning data sets. In: Proceedings of the 23rd International Conference on Pattern Recognition (2016)

    Google Scholar 

  12. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: analysis and an algorithm. Adv. Neural Inf. Process. Syst. 14, 849–856 (2001)

    Google Scholar 

  13. Pekalska, E., Duin, R.: The Dissimilarity Representation for Pattern Recognition. World Scientific, Singapore (2005)

    Book  Google Scholar 

  14. Pham, K., Chen, G.: Large-scale spectral clustering using diffusion coordinates on landmark-based bipartite graphs. In: Proceedings of the 12th Workshop on Graph-based Natural Language Processing (TextGraphs-2012), pp. 28–37. Association for Computational Linguistics (2018)

    Google Scholar 

  15. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)

    Article  Google Scholar 

  16. Tasdemir, K.: Vector quantization based approximate spectral clustering of large datasets. Pattern Recogn. 45(8), 3034–3044 (2012)

    Article  Google Scholar 

  17. Wang, L., Leckie, C., Kotagiri, R., Bezdek, J.: Approximate pairwise clustering for large data sets via sampling plus extension. Pattern Recogn. 44, 222–235 (2011)

    Article  Google Scholar 

  18. Wang, L., Leckie, C., Ramamohanarao, K., Bezdek, J.: Approximate spectral clustering. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 134–146. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01307-2_15

    Chapter  Google Scholar 

  19. Yan, D., Huang, L., Jordan, M.: Fast approximate spectral clustering. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 907–916 (2009)

    Google Scholar 

Download references

Acknowledgments

We thank the anonymous reviewers for their helpful feedback. This work was motivated by a project sponsored by Verizon Wireless, which had the goal of grouping customers based on similar profile characteristics. G. Chen was supported by the Simons Foundation Collaboration Grant for Mathematicians.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guangliang Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, G. (2018). A Scalable Spectral Clustering Algorithm Based on Landmark-Embedding and Cosine Similarity. In: Bai, X., Hancock, E., Ho, T., Wilson, R., Biggio, B., Robles-Kelly, A. (eds) Structural, Syntactic, and Statistical Pattern Recognition. S+SSPR 2018. Lecture Notes in Computer Science(), vol 11004. Springer, Cham. https://doi.org/10.1007/978-3-319-97785-0_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-97785-0_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-97784-3

  • Online ISBN: 978-3-319-97785-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics