research-article

Distributed approximate spectral clustering for large-scale datasets

Authors:

Mohamed Hefeeda,

Wael Abd-AlmageedAuthors Info & Claims

HPDC '12: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing

Pages 223 - 234

https://doi.org/10.1145/2287076.2287111

Published: 18 June 2012 Publication History

Abstract

Data-intensive applications are becoming important in many science and engineering fields, because of the high rates in which data are being generated and the numerous opportunities offered by the sheer amount of these data. Large-scale datasets, however, are challenging to process using many of the current machine learning algorithms due to their high time and space complexities. In this paper, we propose a novel approximation algorithm that enables kernel-based machine learning algorithms to efficiently process very large-scale datasets. While important in many applications, current kernel-based algorithms suffer from a scalability problem as they require computing a kernel matrix which takes O(N²) in time and space to compute and store. The proposed algorithm yields substantial reduction in computation and memory overhead required to compute the kernel matrix, and it does not significantly impact the accuracy of the results. In addition, the level of approximation can be controlled to tradeoff some accuracy of the results with the required computing resources. The algorithm is designed such that it is independent of the subsequently used kernel-based machine learning algorithm, and thus can be used with many of them. To illustrate the effect of the approximation algorithm, we developed a variant of the spectral clustering algorithm on top of it. Furthermore, we present the design of a MapReduce-based implementation of the proposed algorithm. We have implemented this design and run it on our own Hadoop cluster as well as on the Amazon Elastic MapReduce service. Experimental results on synthetic and real datasets demonstrate that significant time and memory savings can be achieved using our algorithm.

References

[1]

A. Anagnostopoulos, A. Dasgupta, and R. Kumar. Approximation algorithms for co-clustering. In In Proc. of Symposium on Principles of Database Systems (PODS'08), pages 201--210, Vancouver, BC, Canada, June 2008.

Digital Library

[2]

M. Charikar. Similarity estimation techniques from rounding algorithms. In In Proc. of ACM Symposium on Theory of Computing (STOC'02), pages 380--388, Montreal, Canada, May 2002.

Digital Library

[3]

W.-Y. Chen, Y. Song, H. Bai, C.-J. Lin, and E. Chang. Parallel spectral clustering in distributed systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3):568--586, March 2011.

Digital Library

[4]

O. Chum, J. Philbin, and A. Zisserman. Near duplicate image detection: min-hash and tf-idf weighting. In In Proc. of British Machine Vision Conference (BMVC'08), pages 25--31, Leeds, UK, September 2008.

[5]

N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and other Kernel-based Learning Methods. Cambridge University Press, 2000.

Digital Library

[6]

J. Cullum and R. Willoughby. Lanczos algorithms for large symmetric eigenvalue computations. IEEE Transactions on Information Theory, pages 43--49, 1985.

[7]

D. Davies and D. Bouldin. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1:224--227, 1979.

Digital Library

[8]

L. M. Delves and J. Walsh, editors. Numerical Solution of Integral Equations. Clarendon, Oxford, 1974.

[9]

P. Drineas and M. Mahoney. Approximating a gram matrix for improved kernel-based learning. In In Proc. of Annual Conference on Computational Learning Theory, pages 323--337, 2005.

Digital Library

[10]

X. Fern and C. Brodley. Random projection for high dimensional data clustering: a cluster ensemble approach. In In Proc. of International Conference on Machine Learning (ICML'03), pages 186--193, 2003.

[11]

B. Frey and D. Dueck. Clustering by passing messages between data points. Science, 315:972--976, 2007.

[12]

F. Gao. Distributed Approximate Spectral Clustering for Large-Scale Datasets. Master's thesis, Simon Fraser University, Canada, 2011.

[13]

J. A. Hartigan and M. A. Wong. A k-means clustering algorithm. Journal of the Royal Statistical Society, Series C (Applied Statistics), pages 22--29, 1979.

[14]

E. Hatcher and O. Gospodnetic. Lucene in Action. Manning Publications Co., Greenwich, CT, USA, 2004.

Digital Library

[15]

J. Hennessy and D. Patterson. Computer Architecture - A Quantitative Approach. Morgan Kaufmann, 2003.

Digital Library

[16]

A. K. Jain and R. C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988.

Digital Library

[17]

U. Kang, C. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system implementation and observations. In In Proc. of IEEE International Conference on Data Mining (ICDM'09), pages 229--238, Washington, DC, December 2009.

Digital Library

[18]

J. Kubica, J. Masiero, A. Moore, R. Jedicke, and A. Connolly. Variable kd-tree algorithms for efficient spatial pattern search. Technical Report CMU-RI-TR-05-43, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, September 2005.

[19]

R. B. Lehoucq, D. C. Sorensen, and C. Yang. Arpack users guide: Solution of large scale eigenvalue problems by implicitly restarted arnoldi methods, 1997.

[20]

J. Leskovec, L. A. Adamic, and B. A. Huberman. The dynamics of viral marketing. ACM Transactions on the Web, 1, May 2007.

Digital Library

[21]

J. Leskovec, D. Huttenlocher, and J. Kleinberg. Predicting positive and negative links in online social networks. In In Proc. of ACM Conference on World Wide Web (WWW'10), pages 641--650, April 2010.

Digital Library

[22]

J. Lin, D. Ryaboy, and K. Weil. Full-text indexing for optimizing selection operations in large-scale data analytics. In In Proc. of International Workshop on MapReduce and its Applications, pages 59--66, June 2011.

Digital Library

[23]

A. Matsunaga, M. Tsugawa, and J. Fortes. Cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications. In In Proc. of IEEE International Conference on eScience, pages 222--229, Indianapolis, IN, December 2008.

Digital Library

[24]

T. Moon and W. Stirling. Mathematical methods and algorithms for signal processing. Prentice-Hall, Inc., 2000.

[25]

R. Motwani, A. Naor, and R. Panigrahi. Lower bounds on locality sensitive hashing. In In Proc. of Annual Symposium on Computational Geometry (SCG'06), pages 154--157, 2006.

Digital Library

[26]

S. Munder and D. Gavrila. An experimental study on pedestrian classification. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(11):1863--1868, Nov. 2006.

Digital Library

[27]

A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems, pages 849--856. MIT Press, 2001.

Digital Library

[28]

J. Ohmer, F. Maire, and R. Brown. Implementation of kernel methods on the gpu. In In Proc. of Conference on Digital Image Computing: Techniques and Applications, page 78, Washington, DC, USA, December 2005.

Digital Library

[29]

S. Owen, R. Anil, T. Dunning, and E. Friedman. Mahout in Action. Manning Publications, 2011.

Digital Library

[30]

M. F. Porter. An Algorithm for Suffix Stripping. Program, 14(3):130--137, 1980.

[31]

B. Schlkopf, A. Smola, and K.-R. Muller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299--1319, July 1998.

Digital Library

[32]

J. Schuetter and T. Shi. Multi-sample data spectroscopic clustering of large datasets using Nystrom extension. Journal of Computational and Graphical Statistics, pages 531--542, 2011.

[33]

M. Seeger. Bayesian Model Selection for Support Vector Machines, Gaussian Processes and Other Kernel Classifiers, volume 12, pages 603--609. The MIT Press, 2000.

[34]

Y. Weiss. Segmentation using eigenvectors: A unifying view. In In Proc. of International Conference on Computer Vision, pages 975--982, 1999.

Digital Library

[35]

B. White, T. Yeh, J. Lin, and L. Davis. Web-scale computer vision using MapReduce for multimedia data mining. In In Proc. of ACM Workshop on Multimedia Data Mining, 2010.

Digital Library

[36]

C. K. I. Williams and M. Seeger. The effect of the input density distribution on kernel-based classifiers. In International Conference on Machine Learning, 2000.

Digital Library

[37]

C. K. I. Williams and M. Seeger. Using Nystrom method to speed up kernel machines, volume 13 of Advanced in Neural Information Processing Systems. MIT Press, 2001.

[38]

J. Yang and J.-Y. Yang. From image vector to matrix: a straightforward image projection technique - IMPCA vs. PCA. Pattern Recognition, 35:1997--1999, 2002.

[39]

D. Yogatama and K. Tanaka-Ishii. Multilingual spectral clustering using document similarity propagation. In In Proc. of Conference on Empirical Methods in Natural Language Processing, pages 871--879, 2009.

Digital Library

[40]

R. Zass and A. Shashua. Doubly stochastic normalization for spectral clustering. In Neural Information Processing Systems, pages 1569--1576, 2006.

Cited By

Branković SSmiljković LObradović PRadonjiić MMišić M(2025)Fast Parallel CPU-GPU Approximate Spectral Clustering for Transcriptomics DataInternational Journal of Parallel Programming10.1007/s10766-025-00783-653:1Online publication date: 30-Jan-2025
https://doi.org/10.1007/s10766-025-00783-6
Bhatt MShende P(2023)Advancement in Machine Learning: A Strategic Lookout from Cancer Identification to TreatmentArchives of Computational Methods in Engineering10.1007/s11831-023-09886-030:4(2777-2792)Online publication date: 20-Jan-2023
https://doi.org/10.1007/s11831-023-09886-0
Yan DWang YWang JWu GWang H(2021)Fast Communication-Efficient Spectral Clustering over Distributed DataIEEE Transactions on Big Data10.1109/TBDATA.2019.29079857:1(158-168)Online publication date: 1-Mar-2021
https://doi.org/10.1109/TBDATA.2019.2907985
Show More Cited By

Index Terms

Distributed approximate spectral clustering for large-scale datasets
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

Divide-and-conquer based large-scale spectral clustering
Abstract
Spectral clustering is one of the most popular clustering methods. However, how to balance the efficiency and effectiveness of the large-scale spectral clustering with limited computing resources has not been properly solved for a long ...
Large-Scale Subspace Clustering via k-Factorization
KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

Subspace clustering (SC) aims to cluster data lying in a union of low-dimensional subspaces. Usually, SC learns an affinity matrix and then performs spectral clustering. Both steps suffer from high time and space complexity, which leads to difficulty in ...
Study on multi-center fuzzy C-means algorithm based on transitive closure and spectral clustering

Fuzzy C-means (FCM) clustering has been widely used successfully in many real-world applications. However, the FCM algorithm is sensitive to the initial prototypes, and it cannot handle non-traditional curved clusters. In this paper, a multi-center ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '12: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing

June 2012

308 pages

ISBN:9781450308052

DOI:10.1145/2287076

General Chair:
Dick Epema
Delft University of Technology and Eindhoven University of Technology, The Netherlands
,
Program Chairs:
Thilo Kielmann
Vrije Universiteit, The Netherlands
,
Matei Ripeanu
The University of British Columbia, Canada

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

HPDC'12

Sponsor:

University of Arizona
SIGARCH

HPDC'12: The 21st International Symposium on High-Performance Parallel and Distributed Computing

June 18 - 22, 2012

Delft, The Netherlands

Acceptance Rates

HPDC '12 Paper Acceptance Rate 23 of 143 submissions, 16%;

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
746
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Branković SSmiljković LObradović PRadonjiić MMišić M(2025)Fast Parallel CPU-GPU Approximate Spectral Clustering for Transcriptomics DataInternational Journal of Parallel Programming10.1007/s10766-025-00783-653:1Online publication date: 30-Jan-2025
https://doi.org/10.1007/s10766-025-00783-6
Bhatt MShende P(2023)Advancement in Machine Learning: A Strategic Lookout from Cancer Identification to TreatmentArchives of Computational Methods in Engineering10.1007/s11831-023-09886-030:4(2777-2792)Online publication date: 20-Jan-2023
https://doi.org/10.1007/s11831-023-09886-0
Yan DWang YWang JWu GWang H(2021)Fast Communication-Efficient Spectral Clustering over Distributed DataIEEE Transactions on Big Data10.1109/TBDATA.2019.29079857:1(158-168)Online publication date: 1-Mar-2021
https://doi.org/10.1109/TBDATA.2019.2907985
Benmounah ZMeshoul SBatouche MLio’ P(2018)Parallel swarm intelligence strategies for large-scale clustering based on MapReduce with application to epigenetics of agingApplied Soft Computing10.1016/j.asoc.2018.04.01269(771-783)Online publication date: Aug-2018
https://doi.org/10.1016/j.asoc.2018.04.012
Zhou LPan SWang JVasilakos A(2017)Machine learning on big dataNeurocomputing10.1016/j.neucom.2017.01.026237:C(350-361)Online publication date: 10-May-2017
https://dl.acm.org/doi/10.1016/j.neucom.2017.01.026
Morisi RGnecco GBemporad A(2016)A hierarchical consensus method for the approximation of the consensus state, based on clustering and spectral graph theoryEngineering Applications of Artificial Intelligence10.1016/j.engappai.2016.08.01856:C(157-174)Online publication date: 1-Nov-2016
https://dl.acm.org/doi/10.1016/j.engappai.2016.08.018
Allende-Cid HMoraga CAllende HMonge R(2015)Regression from Distributed Data Sources Using Discrete Neighborhood Representations and Modified Stalked Generalization ModelsIntelligent Distributed Computing VIII10.1007/978-3-319-10422-5_27(249-258)Online publication date: 2015
https://doi.org/10.1007/978-3-319-10422-5_27
Huang YDong HYesha YZhou SReed DSun XFoster I(2014)A scalable system for community discovery in Twitter during Hurricane SandyProceedings of the 14th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2014.122(893-899)Online publication date: 26-May-2014
https://dl.acm.org/doi/10.1109/CCGrid.2014.122

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten