Elsevier

Pattern Recognition

Volume 136, April 2023, 109230
Pattern Recognition

Scalable clustering by aggregating representatives in hierarchical groups

https://doi.org/10.1016/j.patcog.2022.109230Get rights and content

Highlights

  • A scalable hierarchical clustering framework for discrete data segments is proposed.

  • Multiple indices are proposed to effectively spot the most representative node in each sub-MST.

  • Swapping data points in overlap areas between pairwise sub-MSTs ensures the accuracy of the hierarchical clustering across multiple data segments.

  • The proposed model provides a good balance between accuracy and scalability of hierarchical clustering.

Abstract

Appropriately handling the scalability of clustering is a long-standing challenge for the study of clustering techniques and is of fundamental interest to researchers in the community of data mining and knowledge discovery. In comparison to other clustering methods, hierarchical clustering demonstrates better interpretability of clustering results but poor scalability while handling large-scale data. Thus, more comprehensive studies on this problem need to be conducted. This paper develops a new scalable hierarchical clustering model called Election Tree, which can detect the most representative point for each sub-cluster via the process of node election in split data and adjust the members in sub-clusters by the operations of node merging and swap. Extensive experiments on real-world datasets reveal that the proposed computational framework has better clustering accuracy as opposed to the competing baseline methods. Meanwhile, with respect to the scalability tests on incremental synthetic datasets, the results show that the new model has a significantly lower time consumption than the state-of-the-art hierarchical clustering models such as PERCH, GRINCH, SCC and other classic baselines.

Introduction

Clustering algorithms can be divided into many types based on their internal mechanism such as partition-based algorithm [1], [2], density-based algorithm [3], [4], spectral clustering [5], [6], affinity propagation algorithm [7], [8], and hierarchical algorithm [9], [10], etc. Among these, hierarchical clustering is more preferred over the rest in scenario-aware data mining and analysis [11], [12] because of the high interpretability of its clustering results [13], [14]. Therefore, hierarchical clustering algorithm has been widely applied in a variety of scientific areas, e.g., computational biology [15], communication engineering [16], complex networked systems [17], and environmental sciences [18].

With the advance of technologies such as 5G, Smart Terminals, and Industry 4.0, the scale of data has been growing dramatically, bringing new challenges to hierarchical clustering methods. To analyze the large scale data and discover practical knowledge therein, more researches in recent years have focused on scalable hierarchical clustering algorithms. Efforts in improving the scalability of hierarchical clustering methods were made mainly in two directions: (i) to reduce the cost of pairwise distance calculations between data points [19], [20]; (ii) to construct cluster tree more efficiently [21], [22]. Moreover, parallel frameworks and distributed technology are also applied to enhance the scalability of existing algorithms [23] without considering the algorithms’ fundamental time-complexity. However, clustering accuracy and scalability always go against each other. The improvement of scalability always comes with approximations such as sampling [19], localized optimization strategy [20], stepwise hybrid strategy [24], [25], which result in the drop of accuracy. The increase of clustering accuracy is accompanied by a cumbersome optimization process [22], [26], which causes additional space and time costs. As a result, balancing empirical performance and scalability of hierarchical clustering remains a long-term challenge [27].

To address the stated issue, our contributions in this paper are as follows.

(1) To better handle large-scale data, we develop a divide-and-conquer framework called election tree to implement the Reciprocal-nearest-neighbor Supported Clustering (RSC) [28] (Section 4.1).

(2) We formalize the RSC into a graph problem and devise a hybrid strategy to select representative nodes based on structure features and spatial position features, which can reduce extra memory consumption of finding the representatives [28], [29] (Section 4.3).

(3) We proposed an optimization approach to perform cost-effective re-arrangements of the tree structure, hence improving clustering accuracy at a low time cost (Section 4.4).

(4) Based on the experiments on real-world and synthetic datasets, we empirically verify that the proposed algorithm has overall advantages in clustering accuracy (measured by Rand Index) as well as scalability, compared with the baselines. (Section 6).

Section snippets

Related work

Linkage-based algorithms. Classical hierarchical clustering algorithms, such as Group Average (GA) [30] and Nearest Neighborhood method [31], can ensure the theoretical soundness of the clustering processes. These clustering algorithms apply the optimization theory to obtain the globally optimized clustering results but usually have poor scalability in handling large datasets. With the aim of improving the scalability, some more advanced algorithms were proposed by using different approximation

Term definition

Before giving the detailed description of the proposed algorithm, we first introduce several important concepts that are going to be frequently used in the later part of this paper:

Definition 1 Sub-Mini-Spanning-Trees (sub-MST for short)

A sub-MST is a subset of the mini spanning tree. Given a dataset X={xi}n and its corresponding mini spanning tree M=(V,E), the sub-MSTs {mi=(Vi,Ei)}k meet all the following conditions: miM; kVi=V; ij,mimj=. It is important to note that sub-MSTs can be aggregated into a complete MST by using the single-linkage

Election tree

In this section, we propose a reciprocal-nearest-neighbor-based scalable hierarchical clustering algorithm named Election Tree.

Complexity analysis

Given a dataset with n items and parameters θ and K, we analyze the time complexity of our algorithm step-by-step as follows.

(I) Segmentation: In this stage, n data points are randomly assigned to nθ data segments, each of which has about θ items. Obviously, it is a T1=O(n) process.

(II) Election: In this stage, two parts iterate alternately. (a) In sub-MSTs Construction part, we generate a 1-nearest-neighbor graph for the given input. Generally, for multi-dimensional data with θ items, it

Experiments

Using real-world and synthetic datasets, we conducted comprehensive experimental studies to evaluate: clustering accuracy (Rand Index), CPU Time and parameter sensitivity of our algorithm, compared with baseline methods.

Conclusion

In this paper, we propose the Election Tree algorithm that attempts to provide a solution to address the problem of scalable hierarchical clustering. Unlike traditional hierarchical clustering models, our model comprehensively solves how to better determine the root in an RNNs pair by using graph theory and spatial topology feature. We also employ a novel strategy Merging&Swap which is able to properly adjust the nodes at the edge between the clusters and significantly improve the accuracy of

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is partially supported by the National Natural Science Foundation of China under Grant Nos. 61803073, 61703074, and the Young Scholars Development Fund of SWPU under Grant No. 202199010142.

Wen-Bo Xie received the PhD degree in technology of computer application from the University of Electronic Science and Technology of China (UESTC), in 2021. He started this work when he was a PhD candidate with the UESTC. Now he is with the School of Computer Science, Southwest Petroleum University. His research interests include the fields of machine learning, data mining and graph mining and, in particular, focus on clustering and knowledge graph.

References (45)

  • S.A. Shah et al.

    Robust continuous clustering

    Proc. Natl. Acad. Sci. USA

    (2017)
  • A. Rodriguez et al.

    Clustering by fast search and find of density peaks

    Science

    (2014)
  • B.J. Frey et al.

    Clustering by passing messages between data points

    Science

    (2007)
  • C.-D. Wang et al.

    Multi-exemplar affinity propagation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • X. Han et al.

    Streaming hierarchical clustering based on point-set kernel

    Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

    (2022)
  • V. Cohen-Addad et al.

    Scalable differentially private clustering via hierarchically separated trees

    Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

    (2022)
  • L. Song et al.

    A transcription factor hierarchy defines an environmental stress response network

    Science

    (2016)
  • S.K. Anand et al.

    Experimental comparisons of clustering approaches for data representation

    ACM Comput. Surv.

    (2022)
  • J. Li et al.

    Cell clustering for spatial transcriptomics data with graph neural networks

    Nat. Comput. Sci.

    (2022)
  • Z. Nurlan et al.

    EZ-SEP: extended Z-SEP routing protocol with hierarchical clustering approach for wireless heterogeneous sensor network

    Sensors

    (2021)
  • M.E.J. Newman et al.

    Finding and evaluating community structure in networks

    Phys. Rev. E

    (2004)
  • H.A. Dugan et al.

    Salting our freshwater lakes

    Proc. Natl. Acad. Sci. USA

    (2017)
  • Cited by (9)

    • Supports estimation via graph sampling

      2024, Expert Systems with Applications
    • Boosting cluster tree with reciprocal nearest neighbors scoring

      2024, Engineering Applications of Artificial Intelligence
    View all citing articles on Scopus

    Wen-Bo Xie received the PhD degree in technology of computer application from the University of Electronic Science and Technology of China (UESTC), in 2021. He started this work when he was a PhD candidate with the UESTC. Now he is with the School of Computer Science, Southwest Petroleum University. His research interests include the fields of machine learning, data mining and graph mining and, in particular, focus on clustering and knowledge graph.

    Zhen Liu received the PhD degree in technology of computer application from the University of Electronic Science and Technology of China (UESTC), in 2007. He was a Visiting Scholar with the Data Mining Lab, Minnesota University, from 2012 to 2013. He has been an Associate Professor with the School of Computer Science and Engineering, UESTC, since 2011. He has published more than 30 peer-reviewed papers in his academic career. His current research interests include data mining, machine learning, and social network analysis.

    Debarati Das is a 4th year PhD student, in the Department of Computer Science at the University of Minnesota. Her area of research interest lies in the intersection of Natural Language Processing and Computational Social Science.

    Bin Chen received the BS degree in software engineering from Chengdu Technological University, China, in 2020. He is currently working toward the MS degree in Southwest Petroleum University. His research interests include machine learning and data mining.

    Jaideep Srivastava directs a laboratory focusing on research in applied machine learning, focused in the areas of Social Media and Health Informatics. He is a Fellow of the Institute of Electrical and Electronics Engineers (IEEE), has been an IEEE Distinguished Visitor and has been awarded the Distinguished Research Contributions Award of the PAKDD, for his lifetime contributions to the field of machine learning and data mining. He has authored over 440 papers in journals and conferences, and awarded 6 patents. Seven of his papers have won best paper awards. He has held advisory positions with the State of Minnesota, and is advisor to the UID project of the Government of India, whose goal is to provide biometrics-based identification to the 1.30+ billion citizens of India. He received his bachelors from IIT-Kanpur, and his MS and PhD from UC Berkeley, all in computer science.

    View full text