Scalable clustering by aggregating representatives in hierarchical groups
Introduction
Clustering algorithms can be divided into many types based on their internal mechanism such as partition-based algorithm [1], [2], density-based algorithm [3], [4], spectral clustering [5], [6], affinity propagation algorithm [7], [8], and hierarchical algorithm [9], [10], etc. Among these, hierarchical clustering is more preferred over the rest in scenario-aware data mining and analysis [11], [12] because of the high interpretability of its clustering results [13], [14]. Therefore, hierarchical clustering algorithm has been widely applied in a variety of scientific areas, e.g., computational biology [15], communication engineering [16], complex networked systems [17], and environmental sciences [18].
With the advance of technologies such as 5G, Smart Terminals, and Industry 4.0, the scale of data has been growing dramatically, bringing new challenges to hierarchical clustering methods. To analyze the large scale data and discover practical knowledge therein, more researches in recent years have focused on scalable hierarchical clustering algorithms. Efforts in improving the scalability of hierarchical clustering methods were made mainly in two directions: (i) to reduce the cost of pairwise distance calculations between data points [19], [20]; (ii) to construct cluster tree more efficiently [21], [22]. Moreover, parallel frameworks and distributed technology are also applied to enhance the scalability of existing algorithms [23] without considering the algorithms’ fundamental time-complexity. However, clustering accuracy and scalability always go against each other. The improvement of scalability always comes with approximations such as sampling [19], localized optimization strategy [20], stepwise hybrid strategy [24], [25], which result in the drop of accuracy. The increase of clustering accuracy is accompanied by a cumbersome optimization process [22], [26], which causes additional space and time costs. As a result, balancing empirical performance and scalability of hierarchical clustering remains a long-term challenge [27].
To address the stated issue, our contributions in this paper are as follows.
(1) To better handle large-scale data, we develop a divide-and-conquer framework called election tree to implement the Reciprocal-nearest-neighbor Supported Clustering (RSC) [28] (Section 4.1).
(2) We formalize the RSC into a graph problem and devise a hybrid strategy to select representative nodes based on structure features and spatial position features, which can reduce extra memory consumption of finding the representatives [28], [29] (Section 4.3).
(3) We proposed an optimization approach to perform cost-effective re-arrangements of the tree structure, hence improving clustering accuracy at a low time cost (Section 4.4).
(4) Based on the experiments on real-world and synthetic datasets, we empirically verify that the proposed algorithm has overall advantages in clustering accuracy (measured by Rand Index) as well as scalability, compared with the baselines. (Section 6).
Section snippets
Related work
Linkage-based algorithms. Classical hierarchical clustering algorithms, such as Group Average (GA) [30] and Nearest Neighborhood method [31], can ensure the theoretical soundness of the clustering processes. These clustering algorithms apply the optimization theory to obtain the globally optimized clustering results but usually have poor scalability in handling large datasets. With the aim of improving the scalability, some more advanced algorithms were proposed by using different approximation
Term definition
Before giving the detailed description of the proposed algorithm, we first introduce several important concepts that are going to be frequently used in the later part of this paper: Definition 1 Sub-Mini-Spanning-Trees (sub-MST for short) A sub-MST is a subset of the mini spanning tree. Given a dataset and its corresponding mini spanning tree , the sub-MSTs meet all the following conditions: ; ; . It is important to note that sub-MSTs can be aggregated into a complete MST by using the single-linkage
Election tree
In this section, we propose a reciprocal-nearest-neighbor-based scalable hierarchical clustering algorithm named Election Tree.
Complexity analysis
Given a dataset with items and parameters and , we analyze the time complexity of our algorithm step-by-step as follows.
(I) Segmentation: In this stage, data points are randomly assigned to data segments, each of which has about items. Obviously, it is a process.
(II) Election: In this stage, two parts iterate alternately. (a) In sub-MSTs Construction part, we generate a 1-nearest-neighbor graph for the given input. Generally, for multi-dimensional data with items, it
Experiments
Using real-world and synthetic datasets, we conducted comprehensive experimental studies to evaluate: clustering accuracy (Rand Index), CPU Time and parameter sensitivity of our algorithm, compared with baseline methods.
Conclusion
In this paper, we propose the Election Tree algorithm that attempts to provide a solution to address the problem of scalable hierarchical clustering. Unlike traditional hierarchical clustering models, our model comprehensively solves how to better determine the root in an RNNs pair by using graph theory and spatial topology feature. We also employ a novel strategy Merging&Swap which is able to properly adjust the nodes at the edge between the clusters and significantly improve the accuracy of
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is partially supported by the National Natural Science Foundation of China under Grant Nos. 61803073, 61703074, and the Young Scholars Development Fund of SWPU under Grant No. 202199010142.
Wen-Bo Xie received the PhD degree in technology of computer application from the University of Electronic Science and Technology of China (UESTC), in 2021. He started this work when he was a PhD candidate with the UESTC. Now he is with the School of Computer Science, Southwest Petroleum University. His research interests include the fields of machine learning, data mining and graph mining and, in particular, focus on clustering and knowledge graph.
References (45)
- et al.
Robust deep k-means: an effective and simple method for data clustering
Pattern Recognit.
(2021) - et al.
Adaptive core fusion-based density peak clustering for complex data with arbitrary shapes and densities
Pattern Recognit.
(2020) - et al.
Refining a k-nearest neighbor graph for a computationally efficient spectral clustering
Pattern Recognit.
(2021) - et al.
Self-supervised spectral clustering with exemplar constraints
Pattern Recognit.
(2022) - et al.
A hierarchical weighted low-rank representation for image clustering and classification
Pattern Recognit.
(2021) - et al.
A comprehensive survey of clustering algorithms: state-of-the-art machine learning applications, taxonomy, challenges, and future research prospects
Eng. Appl. Artif. Intel.
(2022) - et al.
CURE: an efficient clustering algorithm for large databases
Inf. Syst
(2001) - et al.
Genie: a new, fast, and outlier-resistant hierarchical clustering algorithm
Inform. Sci.
(2016) - et al.
Hierarchical clustering supported by reciprocal nearest neighbors
Inform. Sci.
(2020) - et al.
Effective hierarchical clustering based on structural similarities in nearest neighbor graphs
Knowl.-Based Syst.
(2021)
Robust continuous clustering
Proc. Natl. Acad. Sci. USA
Clustering by fast search and find of density peaks
Science
Clustering by passing messages between data points
Science
Multi-exemplar affinity propagation
IEEE Trans. Pattern Anal. Mach. Intell.
Streaming hierarchical clustering based on point-set kernel
Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
Scalable differentially private clustering via hierarchically separated trees
Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
A transcription factor hierarchy defines an environmental stress response network
Science
Experimental comparisons of clustering approaches for data representation
ACM Comput. Surv.
Cell clustering for spatial transcriptomics data with graph neural networks
Nat. Comput. Sci.
EZ-SEP: extended Z-SEP routing protocol with hierarchical clustering approach for wireless heterogeneous sensor network
Sensors
Finding and evaluating community structure in networks
Phys. Rev. E
Salting our freshwater lakes
Proc. Natl. Acad. Sci. USA
Cited by (9)
Supports estimation via graph sampling
2024, Expert Systems with ApplicationsSegmentary group-sparsity self-representation learning and spectral clustering via double L<inf>21</inf> norm
2024, Knowledge-Based SystemsBoosting cluster tree with reciprocal nearest neighbors scoring
2024, Engineering Applications of Artificial IntelligenceThe impact of isolation kernel on agglomerative hierarchical clustering algorithms
2023, Pattern RecognitionHierarchical Clustering-Based Collapse Mode Identification and Design Optimization of Energy-Dissipation Braces Inspired by the Triangular Resch Pattern
2024, Journal of Structural Engineering (United States)
Wen-Bo Xie received the PhD degree in technology of computer application from the University of Electronic Science and Technology of China (UESTC), in 2021. He started this work when he was a PhD candidate with the UESTC. Now he is with the School of Computer Science, Southwest Petroleum University. His research interests include the fields of machine learning, data mining and graph mining and, in particular, focus on clustering and knowledge graph.
Zhen Liu received the PhD degree in technology of computer application from the University of Electronic Science and Technology of China (UESTC), in 2007. He was a Visiting Scholar with the Data Mining Lab, Minnesota University, from 2012 to 2013. He has been an Associate Professor with the School of Computer Science and Engineering, UESTC, since 2011. He has published more than 30 peer-reviewed papers in his academic career. His current research interests include data mining, machine learning, and social network analysis.
Debarati Das is a 4th year PhD student, in the Department of Computer Science at the University of Minnesota. Her area of research interest lies in the intersection of Natural Language Processing and Computational Social Science.
Bin Chen received the BS degree in software engineering from Chengdu Technological University, China, in 2020. He is currently working toward the MS degree in Southwest Petroleum University. His research interests include machine learning and data mining.
Jaideep Srivastava directs a laboratory focusing on research in applied machine learning, focused in the areas of Social Media and Health Informatics. He is a Fellow of the Institute of Electrical and Electronics Engineers (IEEE), has been an IEEE Distinguished Visitor and has been awarded the Distinguished Research Contributions Award of the PAKDD, for his lifetime contributions to the field of machine learning and data mining. He has authored over 440 papers in journals and conferences, and awarded 6 patents. Seven of his papers have won best paper awards. He has held advisory positions with the State of Minnesota, and is advisor to the UID project of the Government of India, whose goal is to provide biometrics-based identification to the 1.30+ billion citizens of India. He received his bachelors from IIT-Kanpur, and his MS and PhD from UC Berkeley, all in computer science.