skip to main content
research-article

Scalable Clustering by Iterative Partitioning and Point Attractor Representation

Published: 20 July 2016 Publication History

Abstract

Clustering very large datasets while preserving cluster quality remains a challenging data-mining task to date. In this paper, we propose an effective scalable clustering algorithm for large datasets that builds upon the concept of synchronization. Inherited from the powerful concept of synchronization, the proposed algorithm, CIPA (Clustering by Iterative Partitioning and Point Attractor Representations), is capable of handling very large datasets by iteratively partitioning them into thousands of subsets and clustering each subset separately. Using dynamic clustering by synchronization, each subset is then represented by a set of point attractors and outliers. Finally, CIPA identifies the cluster structure of the original dataset by clustering the newly generated dataset consisting of points attractors and outliers from all subsets. We demonstrate that our new scalable clustering approach has several attractive benefits: (a) CIPA faithfully captures the cluster structure of the original data by performing clustering on each separate data iteratively instead of using any sampling or statistical summarization technique. (b) It allows clustering very large datasets efficiently with high cluster quality. (c) CIPA is parallelizable and also suitable for distributed data. Extensive experiments demonstrate the effectiveness and efficiency of our approach.

References

[1]
Juan A. Acebron, L. L. Bonilla, Conrad J. Perez Vicente, Felix Ritort, and Renato Spigler. 2005. The Kuramoto model: A simple paradigm for synchronization phenomena. Rev. Mod. Phys. 77, 2 (Jan. 2005), 137--185.
[2]
Andrew Adinetz, Jiri Kraus, Jan Meinke, and Dirk Pleiter. 2013. GPUMAFIA: Efficient subspace clustering with MAFIA on GPUs. In Euro-Par 2013 Parallel Processing. Springer, 838--849.
[3]
Periklis Andritsos, Panayiotis Tsaparas, Renée J. Miller, and Kenneth C. Sevcik. 2004. Limbo: Scalable clustering of categorical data. In Advances in Database Technology-EDBT 2004. Springer, 123--146.
[4]
Alex Arenas, Albert Diaz-Guilera, Jurgen Kurths, Yamir Moreno, and Changsong Zhou. 2008. Synchronization in complex networks. Phys. Rep. 469 (2008), 93--153.
[5]
Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. 2012. Scalable k-means++. Proc. VLDB Endowment 5, 7 (2012), 622--633.
[6]
Arindam Banerjee and Joydeep Ghosh. 2006. Scalable clustering algorithms with balancing constraints. Data Min. Knowl. Discovery 13, 3 (2006), 365--395.
[7]
Christian Böhm, Robert Noll, Claudia Plant, and Bianca Wackersreuther. 2009. Density-based clustering using graphics processors. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, 661--670.
[8]
Christian Böhm, Claudia Plant, Junming Shao, and Qinli Yang. 2010. Clustering by synchronization. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 583--592.
[9]
Paul S. Bradley, Usama M. Fayyad, Cory Reina, and others. 1998. Scaling clustering algorithms to large databases. In KDD. ACM, 9--15.
[10]
Markus M. Breunig, Hans-Peter Kriegel, Peer Kröger, and Jörg Sander. 2001. Data bubbles: Quality preserving performance boosting for hierarchical clustering. In ACM SIGMOD Record, Vol. 30. ACM, 79--90.
[11]
Feng Cao, Anthony K. H. Tung, and Aoying Zhou. 2006. Scalable clustering using graphics processors. In Advances in Web-Age Information Management. Springer, 372--384.
[12]
Filip De Smet Dirk Aeyels. 2008. A mathematical model for the dynamics of clustering. Phys. D, Nonlinear Phenom. 273, 19 (2008), 2517C2530.
[13]
Xiufen Fu, Yaguang Wang, Yanna Ge, Peiwen Chen, and Shaohua Teng. 2014. Research and application of DBSCAN algorithm based on Hadoop platform. In Pervasive Computing and the Networked World. Springer, 73--87.
[14]
Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. 1998. CURE: An efficient clustering algorithm for large databases. In SIGMOD Conference. ACM, 73--84.
[15]
Timothy C. Havens, James C. Bezdek, Christopher Leckie, Lawrence O. Hall, and Marimuthu Palaniswami. 2012. Fuzzy c-means algorithms for very large data. IEEE T. Fuzzy Syst. 20, 6 (2012), 1130--1146.
[16]
Lei Hong, Shi-Min Cai, Jie Zhang, Zhao Zhuo, Zhong-Qian Fu, and Pei-Ling Zhou. 2012. Synchronization-based approach for detecting functional activation of brain. Chaos, Interdiscip. J. Nonlinear Sci. 22, 3 (2012), 033128.
[17]
Jianbin Huang, Heli Sun, Jianmei Kang, Junjie Qi, Hongbo Deng, and Qinbao Song. 2013. ESC: An efficient synchronization-based clustering algorithm. Knowl.-Based Syst. 40 (2013), 111--122.
[18]
Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. 2002. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Tran. Pattern Anal. Mach. Intell., 24, 7 (2002), IEEE, 881--892.
[19]
Leonard Kaufman and Peter J. Rousseeuw. 2009. Finding Groups in Data: An Introduction to Cluster Analysis. Vol. 344. John Wiley & Sons.
[20]
Chang Sik Kim, Cheol Soo Bae, and Hong Joon Tcha. 2008. A phase synchronization clustering algorithm for identifying interesting groups of genes from cell cycle expression data. BMC Bioinformat. 9, 56 (2008).
[21]
Younghoon Kim, Kyuseok Shim, Min-Soeng Kim, and June Sup Lee. 2014. DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce. Inf. Syst. 42 (2014), 15--35.
[22]
YongChul Kwon, Dylan Nunley, Jeffrey P. Gardner, Magdalena Balazinska, Bill Howe, and Sarah Loebman. 2010. Scalable clustering algorithm for n-body simulations in a shared-nothing cluster. In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management, SSDBM 2010, Germany, June 30--July 2, 2010. 132--150.
[23]
Simone A. Ludwig. 2015. MapReduce-based fuzzy c-means clustering algorithm: Implementation and scalability. Int. J. Mach. Learn. Cybern. (2015), 1--12.
[24]
Boriana L. Milenova and Marcos M. Campos. 2002. O-cluster. 2002. Scalable clustering of large high dimensional data sets. In IEEE International Conference on Data Mining. IEEE, 290--297.
[25]
Ratko Orlandic, Ying Lai, and Wai Gen Yee. 2005. Clustering high-dimensional data using an efficient and effective data space reduction. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management. ACM, 201--208.
[26]
William M. Rand. 1971. Objective criteria for the evaluation of clustering methods. J. Am. Stat. assoc. 66, 336 (1971), 846--850.
[27]
Junming Shao, Zahra Ahmadi, and Stefan Kramer. 2014. Prototype-based learning on concept-drifting data streams. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 412--421.
[28]
Junming Shao, Christian Böhm, Qinli Yang, and Claudia Plant. 2010. Synchronization based outlier detection. In Machine Learning and Knowledge Discovery in Databases. Springer, 245--260.
[29]
Junming Shao, Xiao He, Christian Bohm, Qinli Yang, and Claudia Plant. 2013a. Synchronization-inspired partitioning and hierarchical clustering. IEEE Trans. Knowl. Data Eng. 25, 4 (2013), 893--905.
[30]
Junming Shao, Xiao He, Qinli Yang, Claudia Plant, and Christian Böhm. 2013b. Robust synchronization-based graph clustering. In Advances in Knowledge Discovery and Data Mining. Springer, 249--260.
[31]
Junming Shao, Claudia Plant, Qinli Yang, and Christian Bohm. 2011. Detection of arbitrarily oriented synchronized clusters in high-dimensional data. In IEEE 11th International Conference on Data Mining (ICDM). IEEE, 607--616.
[32]
Alexander Strehl and Joydeep Ghosh. 2003. Cluster ensembles—A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3 (2003), 583--617.
[33]
Mohiuddin K. Wasif and P. J. Narayanan. 2011. Scalable clustering using multiple GPUs. In 18th International Conference on High Performance Computing (HiPC). IEEE, 1--10.
[34]
Wenhao Ying, Fu-Lai Chung, and Shitong Wang. 2014. Scaling up synchronization-inspired partitioning clustering. IEEE Trans. Knowl. Discovery Data Eng. 26, 8 (2014), 2045--2057.
[35]
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. An efficient data clustering method for very large databases. In SIGMOD Conference. ACM, 103--114.
[36]
Weizhong Zhao, Huifang Ma, and Qing He. 2009. Parallel k-means clustering based on mapreduce. In Cloud Computing. Springer, 674--679.
[37]
Ying Zhao and George Karypis. 2002. Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of the 11th International Conference on Information and Knowledge Management. ACM, 515--524.

Cited By

View all
  • (2023)A shrinking synchronization clustering algorithm based on a linear weighted Vicsek modelJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23181745:6(9875-9897)Online publication date: 1-Jan-2023
  • (2021)Breaking the curse of dimensionality: hierarchical Bayesian network model for multi-view clusteringAnnals of Mathematics and Artificial Intelligence10.1007/s10472-021-09749-z89:10-11(1013-1033)Online publication date: 1-Nov-2021
  • (2020)Genetic clustering algorithmRussian Technological Journal10.32362/2500-316X-2019-7-6-134-1507:6(134-150)Online publication date: 10-Jan-2020
  • Show More Cited By

Index Terms

  1. Scalable Clustering by Iterative Partitioning and Point Attractor Representation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 11, Issue 1
      February 2017
      288 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/2974720
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 20 July 2016
      Accepted: 01 April 2016
      Revised: 01 September 2015
      Received: 01 January 2015
      Published in TKDD Volume 11, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. High-performance algorithm
      2. scalable clustering
      3. synchronization

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • China Postdoctoral Science Foundation
      • National Natural Science Foundation of China
      • Fundamental Research Funds for the Central Universities
      • Science-Technology Foundation for Young Scientist of SiChuan Province

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)16
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 03 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)A shrinking synchronization clustering algorithm based on a linear weighted Vicsek modelJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23181745:6(9875-9897)Online publication date: 1-Jan-2023
      • (2021)Breaking the curse of dimensionality: hierarchical Bayesian network model for multi-view clusteringAnnals of Mathematics and Artificial Intelligence10.1007/s10472-021-09749-z89:10-11(1013-1033)Online publication date: 1-Nov-2021
      • (2020)Genetic clustering algorithmRussian Technological Journal10.32362/2500-316X-2019-7-6-134-1507:6(134-150)Online publication date: 10-Jan-2020
      • (2020)CSE: A Content Spreading Efficiency Based Influential Nodes Selection Method in 5G Mobile Social Networks2020 3rd International Conference on Information and Computer Technologies (ICICT)10.1109/ICICT50521.2020.00082(475-479)Online publication date: Mar-2020
      • (2019)Overlapping community detection via preferential learning modelPhysica A: Statistical Mechanics and its Applications10.1016/j.physa.2019.121265(121265)Online publication date: Apr-2019
      • (2019)Identifying influential spreaders in large-scale networks based on evidence theoryNeurocomputing10.1016/j.neucom.2019.06.030Online publication date: Jun-2019
      • (2019)Community detection based on information dynamicsNeurocomputing10.1016/j.neucom.2019.06.020Online publication date: Jun-2019
      • (2019)Semantic trajectory compression via multi-resolution synchronization-based clusteringKnowledge-Based Systems10.1016/j.knosys.2019.03.006Online publication date: Mar-2019
      • (2019)ProfitLeaderWorld Wide Web10.1007/s11280-018-0537-622:2(533-553)Online publication date: 1-Mar-2019
      • (2019)SemiSync: Semi-supervised Clustering by SynchronizationDatabase Systems for Advanced Applications10.1007/978-3-030-18590-9_45(358-362)Online publication date: 22-Apr-2019
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media