research-article

Scalable Clustering by Iterative Partitioning and Point Attractor Representation

Authors:

Bertil Schmidt,

Stefan KramerAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 11, Issue 1

Article No.: 5, Pages 1 - 23

https://doi.org/10.1145/2934688

Published: 20 July 2016 Publication History

Abstract

Clustering very large datasets while preserving cluster quality remains a challenging data-mining task to date. In this paper, we propose an effective scalable clustering algorithm for large datasets that builds upon the concept of synchronization. Inherited from the powerful concept of synchronization, the proposed algorithm, CIPA (Clustering by Iterative Partitioning and Point Attractor Representations), is capable of handling very large datasets by iteratively partitioning them into thousands of subsets and clustering each subset separately. Using dynamic clustering by synchronization, each subset is then represented by a set of point attractors and outliers. Finally, CIPA identifies the cluster structure of the original dataset by clustering the newly generated dataset consisting of points attractors and outliers from all subsets. We demonstrate that our new scalable clustering approach has several attractive benefits: (a) CIPA faithfully captures the cluster structure of the original data by performing clustering on each separate data iteratively instead of using any sampling or statistical summarization technique. (b) It allows clustering very large datasets efficiently with high cluster quality. (c) CIPA is parallelizable and also suitable for distributed data. Extensive experiments demonstrate the effectiveness and efficiency of our approach.

References

[1]

Juan A. Acebron, L. L. Bonilla, Conrad J. Perez Vicente, Felix Ritort, and Renato Spigler. 2005. The Kuramoto model: A simple paradigm for synchronization phenomena. Rev. Mod. Phys. 77, 2 (Jan. 2005), 137--185.

[2]

Andrew Adinetz, Jiri Kraus, Jan Meinke, and Dirk Pleiter. 2013. GPUMAFIA: Efficient subspace clustering with MAFIA on GPUs. In Euro-Par 2013 Parallel Processing. Springer, 838--849.

Digital Library

[3]

Periklis Andritsos, Panayiotis Tsaparas, Renée J. Miller, and Kenneth C. Sevcik. 2004. Limbo: Scalable clustering of categorical data. In Advances in Database Technology-EDBT 2004. Springer, 123--146.

[4]

Alex Arenas, Albert Diaz-Guilera, Jurgen Kurths, Yamir Moreno, and Changsong Zhou. 2008. Synchronization in complex networks. Phys. Rep. 469 (2008), 93--153.

[5]

Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. 2012. Scalable k-means++. Proc. VLDB Endowment 5, 7 (2012), 622--633.

Digital Library

[6]

Arindam Banerjee and Joydeep Ghosh. 2006. Scalable clustering algorithms with balancing constraints. Data Min. Knowl. Discovery 13, 3 (2006), 365--395.

Digital Library

[7]

Christian Böhm, Robert Noll, Claudia Plant, and Bianca Wackersreuther. 2009. Density-based clustering using graphics processors. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, 661--670.

Digital Library

[8]

Christian Böhm, Claudia Plant, Junming Shao, and Qinli Yang. 2010. Clustering by synchronization. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 583--592.

Digital Library

[9]

Paul S. Bradley, Usama M. Fayyad, Cory Reina, and others. 1998. Scaling clustering algorithms to large databases. In KDD. ACM, 9--15.

[10]

Markus M. Breunig, Hans-Peter Kriegel, Peer Kröger, and Jörg Sander. 2001. Data bubbles: Quality preserving performance boosting for hierarchical clustering. In ACM SIGMOD Record, Vol. 30. ACM, 79--90.

Digital Library

[11]

Feng Cao, Anthony K. H. Tung, and Aoying Zhou. 2006. Scalable clustering using graphics processors. In Advances in Web-Age Information Management. Springer, 372--384.

Digital Library

[12]

Filip De Smet Dirk Aeyels. 2008. A mathematical model for the dynamics of clustering. Phys. D, Nonlinear Phenom. 273, 19 (2008), 2517C2530.

[13]

Xiufen Fu, Yaguang Wang, Yanna Ge, Peiwen Chen, and Shaohua Teng. 2014. Research and application of DBSCAN algorithm based on Hadoop platform. In Pervasive Computing and the Networked World. Springer, 73--87.

[14]

Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. 1998. CURE: An efficient clustering algorithm for large databases. In SIGMOD Conference. ACM, 73--84.

Digital Library

[15]

Timothy C. Havens, James C. Bezdek, Christopher Leckie, Lawrence O. Hall, and Marimuthu Palaniswami. 2012. Fuzzy c-means algorithms for very large data. IEEE T. Fuzzy Syst. 20, 6 (2012), 1130--1146.

Digital Library

[16]

Lei Hong, Shi-Min Cai, Jie Zhang, Zhao Zhuo, Zhong-Qian Fu, and Pei-Ling Zhou. 2012. Synchronization-based approach for detecting functional activation of brain. Chaos, Interdiscip. J. Nonlinear Sci. 22, 3 (2012), 033128.

[17]

Jianbin Huang, Heli Sun, Jianmei Kang, Junjie Qi, Hongbo Deng, and Qinbao Song. 2013. ESC: An efficient synchronization-based clustering algorithm. Knowl.-Based Syst. 40 (2013), 111--122.

Digital Library

[18]

Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. 2002. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Tran. Pattern Anal. Mach. Intell., 24, 7 (2002), IEEE, 881--892.

Digital Library

[19]

Leonard Kaufman and Peter J. Rousseeuw. 2009. Finding Groups in Data: An Introduction to Cluster Analysis. Vol. 344. John Wiley & Sons.

[20]

Chang Sik Kim, Cheol Soo Bae, and Hong Joon Tcha. 2008. A phase synchronization clustering algorithm for identifying interesting groups of genes from cell cycle expression data. BMC Bioinformat. 9, 56 (2008).

[21]

Younghoon Kim, Kyuseok Shim, Min-Soeng Kim, and June Sup Lee. 2014. DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce. Inf. Syst. 42 (2014), 15--35.

Digital Library

[22]

YongChul Kwon, Dylan Nunley, Jeffrey P. Gardner, Magdalena Balazinska, Bill Howe, and Sarah Loebman. 2010. Scalable clustering algorithm for n-body simulations in a shared-nothing cluster. In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management, SSDBM 2010, Germany, June 30--July 2, 2010. 132--150.

Digital Library

[23]

Simone A. Ludwig. 2015. MapReduce-based fuzzy c-means clustering algorithm: Implementation and scalability. Int. J. Mach. Learn. Cybern. (2015), 1--12.

[24]

Boriana L. Milenova and Marcos M. Campos. 2002. O-cluster. 2002. Scalable clustering of large high dimensional data sets. In IEEE International Conference on Data Mining. IEEE, 290--297.

Digital Library

[25]

Ratko Orlandic, Ying Lai, and Wai Gen Yee. 2005. Clustering high-dimensional data using an efficient and effective data space reduction. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management. ACM, 201--208.

Digital Library

[26]

William M. Rand. 1971. Objective criteria for the evaluation of clustering methods. J. Am. Stat. assoc. 66, 336 (1971), 846--850.

[27]

Junming Shao, Zahra Ahmadi, and Stefan Kramer. 2014. Prototype-based learning on concept-drifting data streams. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 412--421.

Digital Library

[28]

Junming Shao, Christian Böhm, Qinli Yang, and Claudia Plant. 2010. Synchronization based outlier detection. In Machine Learning and Knowledge Discovery in Databases. Springer, 245--260.

Digital Library

[29]

Junming Shao, Xiao He, Christian Bohm, Qinli Yang, and Claudia Plant. 2013a. Synchronization-inspired partitioning and hierarchical clustering. IEEE Trans. Knowl. Data Eng. 25, 4 (2013), 893--905.

Digital Library

[30]

Junming Shao, Xiao He, Qinli Yang, Claudia Plant, and Christian Böhm. 2013b. Robust synchronization-based graph clustering. In Advances in Knowledge Discovery and Data Mining. Springer, 249--260.

[31]

Junming Shao, Claudia Plant, Qinli Yang, and Christian Bohm. 2011. Detection of arbitrarily oriented synchronized clusters in high-dimensional data. In IEEE 11th International Conference on Data Mining (ICDM). IEEE, 607--616.

Digital Library

[32]

Alexander Strehl and Joydeep Ghosh. 2003. Cluster ensembles—A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3 (2003), 583--617.

Digital Library

[33]

Mohiuddin K. Wasif and P. J. Narayanan. 2011. Scalable clustering using multiple GPUs. In 18th International Conference on High Performance Computing (HiPC). IEEE, 1--10.

Digital Library

[34]

Wenhao Ying, Fu-Lai Chung, and Shitong Wang. 2014. Scaling up synchronization-inspired partitioning clustering. IEEE Trans. Knowl. Discovery Data Eng. 26, 8 (2014), 2045--2057.

[35]

Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. An efficient data clustering method for very large databases. In SIGMOD Conference. ACM, 103--114.

Digital Library

[36]

Weizhong Zhao, Huifang Ma, and Qing He. 2009. Parallel k-means clustering based on mapreduce. In Cloud Computing. Springer, 674--679.

Digital Library

[37]

Ying Zhao and George Karypis. 2002. Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of the 11th International Conference on Information and Knowledge Management. ACM, 515--524.

Digital Library

Cited By

Chen XMa JQiu YLiu SXu XBao X(2023)A shrinking synchronization clustering algorithm based on a linear weighted Vicsek modelJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23181745:6(9875-9897)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.3233/JIFS-231817
Njah HJamoussi SMahdi W(2021)Breaking the curse of dimensionality: hierarchical Bayesian network model for multi-view clusteringAnnals of Mathematics and Artificial Intelligence10.1007/s10472-021-09749-z89:10-11(1013-1033)Online publication date: 1-Nov-2021
https://dl.acm.org/doi/10.1007/s10472-021-09749-z
Anfyorov M(2020)Genetic clustering algorithmRussian Technological Journal10.32362/2500-316X-2019-7-6-134-1507:6(134-150)Online publication date: 10-Jan-2020
https://doi.org/10.32362/2500-316X-2019-7-6-134-150
Show More Cited By

Index Terms

Scalable Clustering by Iterative Partitioning and Point Attractor Representation
1. Information systems
  1. Information retrieval
  2. Information systems applications

Recommendations

Scalable Clustering Algorithms with Balancing Constraints

Clustering methods for data-mining problems must be extremely scalable. In addition, several data mining applications demand that the clusters obtained be balanced, i.e., of approximately the same size or importance. In this paper, we propose a general ...
Scalable clustering for EO data using efficient raster representation
Abstract
Earth Observation (EO) data is a source of a wide range of information, in vegetation, oceanography, land use, land cover and many more applications. To uncover the hidden information in the data, unsupervised learning techniques like clustering ...
Evolving soft subspace clustering

A key challenge to most conventional clustering algorithms in handling many real world problems is that, data points in different clusters are often correlated with different subsets of features. To address this problem, subspace clustering has ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 11, Issue 1

February 2017

288 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/2974720

Editor:
Philip S. Yu
University of Illinois at Chicago, USA

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2016

Accepted: 01 April 2016

Revised: 01 September 2015

Received: 01 January 2015

Published in TKDD Volume 11, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

China Postdoctoral Science Foundation
National Natural Science Foundation of China
Fundamental Research Funds for the Central Universities
Science-Technology Foundation for Young Scientist of SiChuan Province

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
340
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen XMa JQiu YLiu SXu XBao X(2023)A shrinking synchronization clustering algorithm based on a linear weighted Vicsek modelJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23181745:6(9875-9897)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.3233/JIFS-231817
Njah HJamoussi SMahdi W(2021)Breaking the curse of dimensionality: hierarchical Bayesian network model for multi-view clusteringAnnals of Mathematics and Artificial Intelligence10.1007/s10472-021-09749-z89:10-11(1013-1033)Online publication date: 1-Nov-2021
https://dl.acm.org/doi/10.1007/s10472-021-09749-z
Anfyorov M(2020)Genetic clustering algorithmRussian Technological Journal10.32362/2500-316X-2019-7-6-134-1507:6(134-150)Online publication date: 10-Jan-2020
https://doi.org/10.32362/2500-316X-2019-7-6-134-150
Tulu MFeisso SRonghui HYounas T(2020)CSE: A Content Spreading Efficiency Based Influential Nodes Selection Method in 5G Mobile Social Networks2020 3rd International Conference on Information and Computer Technologies (ICICT)10.1109/ICICT50521.2020.00082(475-479)Online publication date: Mar-2020
https://doi.org/10.1109/ICICT50521.2020.00082
Sheng JWang KSun ZWang BKhawaja FLu BZhang J(2019)Overlapping community detection via preferential learning modelPhysica A: Statistical Mechanics and its Applications10.1016/j.physa.2019.121265(121265)Online publication date: Apr-2019
https://doi.org/10.1016/j.physa.2019.121265
Liu DNie HZhao JWang Q(2019)Identifying influential spreaders in large-scale networks based on evidence theoryNeurocomputing10.1016/j.neucom.2019.06.030Online publication date: Jun-2019
https://doi.org/10.1016/j.neucom.2019.06.030
Sun ZWang BSheng JYu ZZhou RShao J(2019)Community detection based on information dynamicsNeurocomputing10.1016/j.neucom.2019.06.020Online publication date: Jun-2019
https://doi.org/10.1016/j.neucom.2019.06.020
Gao CZhao YWu RYang QShao J(2019)Semantic trajectory compression via multi-resolution synchronization-based clusteringKnowledge-Based Systems10.1016/j.knosys.2019.03.006Online publication date: Mar-2019
https://doi.org/10.1016/j.knosys.2019.03.006
Yu ZShao JYang QSun Z(2019)ProfitLeaderWorld Wide Web10.1007/s11280-018-0537-622:2(533-553)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s11280-018-0537-6
Zhang ZKang DGao CShao J(2019)SemiSync: Semi-supervised Clustering by SynchronizationDatabase Systems for Advanced Applications10.1007/978-3-030-18590-9_45(358-362)Online publication date: 22-Apr-2019
https://dl.acm.org/doi/10.1007/978-3-030-18590-9_45
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents