skip to main content
10.1145/2287076.2287111acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Distributed approximate spectral clustering for large-scale datasets

Published: 18 June 2012 Publication History

Abstract

Data-intensive applications are becoming important in many science and engineering fields, because of the high rates in which data are being generated and the numerous opportunities offered by the sheer amount of these data. Large-scale datasets, however, are challenging to process using many of the current machine learning algorithms due to their high time and space complexities. In this paper, we propose a novel approximation algorithm that enables kernel-based machine learning algorithms to efficiently process very large-scale datasets. While important in many applications, current kernel-based algorithms suffer from a scalability problem as they require computing a kernel matrix which takes O(N2) in time and space to compute and store. The proposed algorithm yields substantial reduction in computation and memory overhead required to compute the kernel matrix, and it does not significantly impact the accuracy of the results. In addition, the level of approximation can be controlled to tradeoff some accuracy of the results with the required computing resources. The algorithm is designed such that it is independent of the subsequently used kernel-based machine learning algorithm, and thus can be used with many of them. To illustrate the effect of the approximation algorithm, we developed a variant of the spectral clustering algorithm on top of it. Furthermore, we present the design of a MapReduce-based implementation of the proposed algorithm. We have implemented this design and run it on our own Hadoop cluster as well as on the Amazon Elastic MapReduce service. Experimental results on synthetic and real datasets demonstrate that significant time and memory savings can be achieved using our algorithm.

References

[1]
A. Anagnostopoulos, A. Dasgupta, and R. Kumar. Approximation algorithms for co-clustering. In In Proc. of Symposium on Principles of Database Systems (PODS'08), pages 201--210, Vancouver, BC, Canada, June 2008.
[2]
M. Charikar. Similarity estimation techniques from rounding algorithms. In In Proc. of ACM Symposium on Theory of Computing (STOC'02), pages 380--388, Montreal, Canada, May 2002.
[3]
W.-Y. Chen, Y. Song, H. Bai, C.-J. Lin, and E. Chang. Parallel spectral clustering in distributed systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3):568--586, March 2011.
[4]
O. Chum, J. Philbin, and A. Zisserman. Near duplicate image detection: min-hash and tf-idf weighting. In In Proc. of British Machine Vision Conference (BMVC'08), pages 25--31, Leeds, UK, September 2008.
[5]
N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and other Kernel-based Learning Methods. Cambridge University Press, 2000.
[6]
J. Cullum and R. Willoughby. Lanczos algorithms for large symmetric eigenvalue computations. IEEE Transactions on Information Theory, pages 43--49, 1985.
[7]
D. Davies and D. Bouldin. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1:224--227, 1979.
[8]
L. M. Delves and J. Walsh, editors. Numerical Solution of Integral Equations. Clarendon, Oxford, 1974.
[9]
P. Drineas and M. Mahoney. Approximating a gram matrix for improved kernel-based learning. In In Proc. of Annual Conference on Computational Learning Theory, pages 323--337, 2005.
[10]
X. Fern and C. Brodley. Random projection for high dimensional data clustering: a cluster ensemble approach. In In Proc. of International Conference on Machine Learning (ICML'03), pages 186--193, 2003.
[11]
B. Frey and D. Dueck. Clustering by passing messages between data points. Science, 315:972--976, 2007.
[12]
F. Gao. Distributed Approximate Spectral Clustering for Large-Scale Datasets. Master's thesis, Simon Fraser University, Canada, 2011.
[13]
J. A. Hartigan and M. A. Wong. A k-means clustering algorithm. Journal of the Royal Statistical Society, Series C (Applied Statistics), pages 22--29, 1979.
[14]
E. Hatcher and O. Gospodnetic. Lucene in Action. Manning Publications Co., Greenwich, CT, USA, 2004.
[15]
J. Hennessy and D. Patterson. Computer Architecture - A Quantitative Approach. Morgan Kaufmann, 2003.
[16]
A. K. Jain and R. C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988.
[17]
U. Kang, C. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system implementation and observations. In In Proc. of IEEE International Conference on Data Mining (ICDM'09), pages 229--238, Washington, DC, December 2009.
[18]
J. Kubica, J. Masiero, A. Moore, R. Jedicke, and A. Connolly. Variable kd-tree algorithms for efficient spatial pattern search. Technical Report CMU-RI-TR-05-43, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, September 2005.
[19]
R. B. Lehoucq, D. C. Sorensen, and C. Yang. Arpack users guide: Solution of large scale eigenvalue problems by implicitly restarted arnoldi methods, 1997.
[20]
J. Leskovec, L. A. Adamic, and B. A. Huberman. The dynamics of viral marketing. ACM Transactions on the Web, 1, May 2007.
[21]
J. Leskovec, D. Huttenlocher, and J. Kleinberg. Predicting positive and negative links in online social networks. In In Proc. of ACM Conference on World Wide Web (WWW'10), pages 641--650, April 2010.
[22]
J. Lin, D. Ryaboy, and K. Weil. Full-text indexing for optimizing selection operations in large-scale data analytics. In In Proc. of International Workshop on MapReduce and its Applications, pages 59--66, June 2011.
[23]
A. Matsunaga, M. Tsugawa, and J. Fortes. Cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications. In In Proc. of IEEE International Conference on eScience, pages 222--229, Indianapolis, IN, December 2008.
[24]
T. Moon and W. Stirling. Mathematical methods and algorithms for signal processing. Prentice-Hall, Inc., 2000.
[25]
R. Motwani, A. Naor, and R. Panigrahi. Lower bounds on locality sensitive hashing. In In Proc. of Annual Symposium on Computational Geometry (SCG'06), pages 154--157, 2006.
[26]
S. Munder and D. Gavrila. An experimental study on pedestrian classification. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(11):1863--1868, Nov. 2006.
[27]
A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems, pages 849--856. MIT Press, 2001.
[28]
J. Ohmer, F. Maire, and R. Brown. Implementation of kernel methods on the gpu. In In Proc. of Conference on Digital Image Computing: Techniques and Applications, page 78, Washington, DC, USA, December 2005.
[29]
S. Owen, R. Anil, T. Dunning, and E. Friedman. Mahout in Action. Manning Publications, 2011.
[30]
M. F. Porter. An Algorithm for Suffix Stripping. Program, 14(3):130--137, 1980.
[31]
B. Schlkopf, A. Smola, and K.-R. Muller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299--1319, July 1998.
[32]
J. Schuetter and T. Shi. Multi-sample data spectroscopic clustering of large datasets using Nystrom extension. Journal of Computational and Graphical Statistics, pages 531--542, 2011.
[33]
M. Seeger. Bayesian Model Selection for Support Vector Machines, Gaussian Processes and Other Kernel Classifiers, volume 12, pages 603--609. The MIT Press, 2000.
[34]
Y. Weiss. Segmentation using eigenvectors: A unifying view. In In Proc. of International Conference on Computer Vision, pages 975--982, 1999.
[35]
B. White, T. Yeh, J. Lin, and L. Davis. Web-scale computer vision using MapReduce for multimedia data mining. In In Proc. of ACM Workshop on Multimedia Data Mining, 2010.
[36]
C. K. I. Williams and M. Seeger. The effect of the input density distribution on kernel-based classifiers. In International Conference on Machine Learning, 2000.
[37]
C. K. I. Williams and M. Seeger. Using Nystrom method to speed up kernel machines, volume 13 of Advanced in Neural Information Processing Systems. MIT Press, 2001.
[38]
J. Yang and J.-Y. Yang. From image vector to matrix: a straightforward image projection technique - IMPCA vs. PCA. Pattern Recognition, 35:1997--1999, 2002.
[39]
D. Yogatama and K. Tanaka-Ishii. Multilingual spectral clustering using document similarity propagation. In In Proc. of Conference on Empirical Methods in Natural Language Processing, pages 871--879, 2009.
[40]
R. Zass and A. Shashua. Doubly stochastic normalization for spectral clustering. In Neural Information Processing Systems, pages 1569--1576, 2006.

Cited By

View all
  • (2025)Fast Parallel CPU-GPU Approximate Spectral Clustering for Transcriptomics DataInternational Journal of Parallel Programming10.1007/s10766-025-00783-653:1Online publication date: 30-Jan-2025
  • (2023)Advancement in Machine Learning: A Strategic Lookout from Cancer Identification to TreatmentArchives of Computational Methods in Engineering10.1007/s11831-023-09886-030:4(2777-2792)Online publication date: 20-Jan-2023
  • (2021)Fast Communication-Efficient Spectral Clustering over Distributed DataIEEE Transactions on Big Data10.1109/TBDATA.2019.29079857:1(158-168)Online publication date: 1-Mar-2021
  • Show More Cited By

Index Terms

  1. Distributed approximate spectral clustering for large-scale datasets

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      HPDC '12: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
      June 2012
      308 pages
      ISBN:9781450308052
      DOI:10.1145/2287076
      • General Chair:
      • Dick Epema,
      • Program Chairs:
      • Thilo Kielmann,
      • Matei Ripeanu
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 June 2012

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. distributed clustering
      2. kernel-based algorithms
      3. large data sets
      4. spectral clustering

      Qualifiers

      • Research-article

      Conference

      HPDC'12
      Sponsor:

      Acceptance Rates

      HPDC '12 Paper Acceptance Rate 23 of 143 submissions, 16%;
      Overall Acceptance Rate 166 of 966 submissions, 17%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)5
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 15 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Fast Parallel CPU-GPU Approximate Spectral Clustering for Transcriptomics DataInternational Journal of Parallel Programming10.1007/s10766-025-00783-653:1Online publication date: 30-Jan-2025
      • (2023)Advancement in Machine Learning: A Strategic Lookout from Cancer Identification to TreatmentArchives of Computational Methods in Engineering10.1007/s11831-023-09886-030:4(2777-2792)Online publication date: 20-Jan-2023
      • (2021)Fast Communication-Efficient Spectral Clustering over Distributed DataIEEE Transactions on Big Data10.1109/TBDATA.2019.29079857:1(158-168)Online publication date: 1-Mar-2021
      • (2018)Parallel swarm intelligence strategies for large-scale clustering based on MapReduce with application to epigenetics of agingApplied Soft Computing10.1016/j.asoc.2018.04.01269(771-783)Online publication date: Aug-2018
      • (2017)Machine learning on big dataNeurocomputing10.1016/j.neucom.2017.01.026237:C(350-361)Online publication date: 10-May-2017
      • (2016)A hierarchical consensus method for the approximation of the consensus state, based on clustering and spectral graph theoryEngineering Applications of Artificial Intelligence10.1016/j.engappai.2016.08.01856:C(157-174)Online publication date: 1-Nov-2016
      • (2015)Regression from Distributed Data Sources Using Discrete Neighborhood Representations and Modified Stalked Generalization ModelsIntelligent Distributed Computing VIII10.1007/978-3-319-10422-5_27(249-258)Online publication date: 2015
      • (2014)A scalable system for community discovery in Twitter during Hurricane SandyProceedings of the 14th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2014.122(893-899)Online publication date: 26-May-2014

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media