CAPKM++2.0: An upgraded version of the collaborative annealing power k-means++ clustering algorithm

doi:10.1016/j.knosys.2022.110241

Knowledge-Based Systems

Volume 262, 28 February 2023, 110241

https://doi.org/10.1016/j.knosys.2022.110241 Get rights and content

Abstract

The collaborative annealing power k-means++ (CAPKM++) clustering algorithm has been recently proposed based on multiple modules by minimizing annealed power-mean functions. This paper presents an upgraded version of CAPKM++ called CAPKM++2.0. Different from CAPKM++ where the anchor points of surrogate functions for majorizing the power-mean functions are re-initialized and minimized repeatedly after annealing, CAPKM++2.0 re-initializes the weights of the majorization function during annealing. In addition, unlike CAPKM++ that minimizes the majorization function of the power-mean sum, CAPKM++2.0 adds an inner loop to minimize the power-mean sum iteratively and locally at every annealing step. Ablation study results are discussed to justify the adoption of the power-mean and the collaboration of multiple modules. Experimental results on sixteen benchmark datasets are elaborated to demonstrate the superior clustering performance of the upgraded algorithm compared with its predecessor and six other mainstream algorithms in terms of cluster validity indices and algorithmic complexities.

Introduction

As an important procedure of data processing, clustering is to group similar data into homogeneous clusters [1], with widespread applications, such as image segmentation [2], [3], text mining [4], bioinformatic analysis [5]. Numerous clustering algorithms have been proposed and they may be classified into several categories from different perspectives, as shown in Fig. 1; e.g., divisive hierarchical clustering [6], agglomerative hierarchical clustering [7], $k$ -means [8], [9], [10], $k$ -medoids [11], $k$ -harmonic means [12], spectral clustering [3], [13], [14], distribution clustering [15], [16], density clustering [17], [18], subspace clustering [19], [20], [21], feature-weighted clustering [22], [23], [24], probabilistic clustering [25], fuzzy clustering [26], [27], [28], cardinality-constrained clustering [29], [30], [31], capacitated clustering [32], [33], must-link and cannot-link constrained clustering [34], and rank-constrained clustering [35].

Lloyd’s algorithm [8] is a classic $k$ -means algorithm that minimizes the sum of squared distances between each point and its assigned cluster center in a greedy way. The $k$ -means and $k$ -means-type algorithms are very popular and widely used in data clustering and analysis, owing to their efficiency and simplicity. However, their clustering qualities vary depending on the initialization of centers. To overcome this limitation, many alternative methods have been proposed, such as initialization improvements [11], [36], [37], [38] and objective function improvements [9], [12], [39], [40], [41]. Of particular interest, the power $k$ -means (PKM) algorithm [9] clusters data by minimizing the majorization function of an annealed power-mean sum. Nevertheless, its clustering performance still depends on initialization. To eliminate the initialization dependence, CAPKM++ employs multiple PKM modules initialized by using $k$ -mean++ and re-initialized repeatedly and collaboratively using a particle swarm optimization rule after each annealing process [10]. Although CAPKM++ outperforms many baselines, the clustering efficiency could be further improved by carrying out cluster re-initialization during the annealing process.

In this paper, we propose an upgraded version of CAPKM++ called CAPKM++2.0. Instead of re-initializing cluster centers after annealing, CAPKM++2.0 re-initializes the weights in the majorization function during annealing. Additionally, CAPKM++2.0 minimizes the power-mean functions directly rather than their majorization function as in PKM and CAPKM++. The novelties of this work are summarized as follows.

i.
Carrying out re-initialization during annealing rather than after it enables more efficient clustering with reduced spatial and temporal complexities.
ii.
Minimizing the power-mean sum instead of its majorization function enables to further improve cluster performance in terms of algorithmic efficiency.

The remainder of this paper is organized as follows. Section 2 provides necessary preliminary information on problem formulation, PKM, and CAPKM++. Section 3 describes the proposed CAPKM++2.0 algorithm. Section 4 reports experimental results on sixteen datasets. Section 5 concludes the paper.

Section snippets

Problem statement

Given a data set $X = {x_{1}, \dots, x_{n}}$ , a $k$ -means clustering algorithm partitions the data into $k$ clusters by minimizing the within-cluster distance with the following objective function [8]: $f (Θ) = \sum_{i = 1}^{n} min_{1 \leq j \leq k} {‖ x_{i} - θ_{j} ‖}_{2}^{2},$ where $θ_{j} \in ℜ^{m}$ is the center (centroid) of cluster $j$ ( $j = 1, 2, \dots, k$ ), $Θ = [θ_{1}, \dots, θ_{k}]$ is the cluster centers matrix.

Fundamentals of the PKM algorithm

Since it is hard to minimize (1) directly, the PKM algorithm utilizes the following smoother function as a surrogate function of (1) [9]: $f_{s} (Θ) ≔ \sum_{i = 1}^{n} {(\frac{1}{k} \sum_{j = 1}^{k} {‖ x_{i} - θ_{j} ‖}_{2}^{2 s})}^{\frac{1}{s}},$ where $s < 0$

CAPKM++2.0

In the PKM and CAPKM++ algorithms, the majorization function of power-mean sum in (5) is minimized at every step of annealing. Minimizing the power-mean function in (2) may attain better clustering results than minimizing its majorization function. To do so, in CAPKM++2.0, an inner loop is added to update the weights and the centers alternately until convergence at every step of annealing.

In CAPKM++, the re-initialization of cluster centers is performed after each annealing process until

Setups

The experimental results are based on sixteen commonly used datasets as listed in Table 1: NCI9 [44], Lymphoma [44], ORL10P [44], WarpPIE10P [44], Segment [45], SpamBase [45], PageBlocks [45], Texture [45], Optdigits [45], Satimage [45], COIL2000 [45], Penbased [45], WineQuality-Red [45] (WQ-Red), WineQuality-White [45] (WQ-White), Banana [45], Phoneme [45].

The clustering performance of CAPKM++2.0 is compared with those of the following seven baselines: $k$ -means (KM),¹

Concluding remarks

This paper presents an upgraded version of CAPKM++ for $k$ -means clustering. With reduced algorithmic complexities, CAPKM++2.0 is more efficient than CAPKM++, as it minimizes the power-mean functions directly rather than minimizing their majorization function and re-initializes the weights of the majorization function during annealing instead of re-initializing the anchor points after annealing. The experimental results on sixteen datasets demonstrate that the upgraded algorithm statistically

CRediT authorship contribution statement

Hongzong Li: Data curation, Software, Investigation, Validation, Writing – original draft. Jun Wang: Conceptualization, Methodology, Writing – review & editing, Funding acquisition, Resources, Supervision, Project administration.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (45)

LiH. et al.
Collaborative annealing power k-means++ clustering
Knowl.-Based Syst.
(2022)
ChenW. et al.
DP-GMM clustering-based ensemble learning prediction methodology for dam deformation considering spatiotemporal differentiation
Knowl.-Based Syst.
(2021)
GuoW. et al.
Density Peak Clustering with connectivity estimation
Knowl.-Based Syst.
(2022)
ZhengQ. et al.
Constrained bilinear factorization multi-view subspace clustering
Knowl.-Based Syst.
(2020)
WeiL. et al.
Subspace clustering via adaptive least square regression with smooth affinities
Knowl.-Based Syst.
(2022)
GaoY. et al.
A new robust fuzzy c-means clustering method based on adaptive elastic distance
Knowl.-Based Syst.
(2022)
ZhouP. et al.
Unsupervised feature selection for balanced clustering
Knowl.-Based Syst.
(2020)
CaraballoL.E. et al.
A polynomial algorithm for balanced clustering via graph partitioning
European J. Oper. Res.
(2021)
DaiX. et al.
Balanced clustering based on collaborative neurodynamic optimization
Knowl.-Based Syst.
(2022)
MaiF. et al.
Model-based capacitated clustering with posterior regularization
European J. Oper. Res.
(2018)

GüngörZ. et al.

K-harmonic means data clustering with simulated annealing heuristic

Appl. Math. Comput.

(2007)

YangF. et al.

An efficient hybrid data clustering method based on K-harmonic means and Particle Swarm Optimization

Expert Syst. Appl.

(2009)

LiJ. et al.

Feature selection: A data perspective

ACM Comput. Surv.

(2018)

JainA.K. et al.

Algorithms for Clustering Data

(1988)

WuZ. et al.

An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation

IEEE Trans. Pattern Anal. Mach. Intell.

(1993)

ShiJ. et al.

Normalized cuts and image segmentation

IEEE Trans. Pattern Anal. Mach. Intell.

(2000)

BerryM.W. et al.

Survey of Text Mining II: Clustering, Classification, and Retrieval

(2007)

Abu-JamousB. et al.

Integrative Cluster Analysis in Bioinformatics

(2015)

Macnaughton-SmithP. et al.

Dissimilarity analysis: a new technique of hierarchical sub-division

Nature

(1964)

JohnsonS.C.

Hierarchical clustering schemes

Psychometrika

(1967)

LloydS.

Least squares quantization in PCM

IEEE Trans. Inform. Theory

(1982)

XuJ. et al.

Power k-means clustering

Cited by (9)

Segmentary group-sparsity self-representation learning and spectral clustering via double L<inf>21</inf> norm
2024, Knowledge-Based Systems
With the rapid expansion of data dimensions, subspace representation learning, a method for mapping high-dimensional data samples to their corresponding underlying low-dimensional subspaces, has become an essential process for high-dimensional data clustering. Although the existing methods have achieved reliable data representation learning and precise clustering, few of them realized that the corrupted data points in the dataset will influence the linear representation of the others. When there are multiple heavily corrupted data in a dataset, the matrix of the self-representation coefficient would be influenced by these data. Therefore, this paper proposes the segmentary group-sparsity self-representation learning (SGSSL) and segmentary group-sparsity-based spectral clustering (SGSSC) models to eliminate their influence on representation learning and clustering results. We proposed that imposing varying degrees of row sparsity and column sparsity constraints on the representation coefficient matrix can prevent corrupted data from contaminating other data during the self-representation process, thus obtaining better spectral clustering results. Extensive experiments on several real datasets demonstrate that our proposed method can perform better than several related methods in recent years.
Unsupervised Gene-Cell Collective Representation Learning with Optimal Transport
2024, Proceedings of the AAAI Conference on Artificial Intelligence
From Soft Clustering to Hard Clustering: A Collaborative Annealing Fuzzy c-Means Algorithm
2024, IEEE Transactions on Fuzzy Systems
Segmentary Group-Sparsity Self-Representation Learning and Spectral Clustering Via Double L21 Norm
2023, SSRN
Scientist Clustering-Index: Using Machine Learning Algorithms and Google Scholar Information to Cluster Scientists
2023, SSRN
HVAC System Fault Diagnosis via Feature Selection and Classification
2023, 13th International Conference on Information Science and Technology, ICIST 2023 - Proceedings

View all citing articles on Scopus

^☆: This work was supported in part by the Research Grants Council of the Hong Kong Special Administrative Region of China under Grants 11202318, 11202019, and 11203721; and in part by the InnoHK initiative, the Government of the Hong Kong Special Administrative Region, and Laboratory for AI-Powered Financial Technologies .

View full text

CAPKM++2.0: An upgraded version of the collaborative annealing power k-means++ clustering algorithm☆

Abstract

Introduction

Section snippets

Problem statement

Fundamentals of the PKM algorithm

CAPKM++2.0

Setups

Concluding remarks

CRediT authorship contribution statement

Declaration of Competing Interest

Knowl.-Based Syst.

Knowl.-Based Syst.

Knowl.-Based Syst.

Knowl.-Based Syst.

Knowl.-Based Syst.

Knowl.-Based Syst.

Knowl.-Based Syst.

European J. Oper. Res.

Knowl.-Based Syst.

European J. Oper. Res.

Appl. Math. Comput.

Expert Syst. Appl.

ACM Comput. Surv.

Algorithms for Clustering Data

An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation

IEEE Trans. Pattern Anal. Mach. Intell.

Normalized cuts and image segmentation

IEEE Trans. Pattern Anal. Mach. Intell.

Survey of Text Mining II: Clustering, Classification, and Retrieval

Integrative Cluster Analysis in Bioinformatics

Dissimilarity analysis: a new technique of hierarchical sub-division

Nature

Hierarchical clustering schemes

Psychometrika

Least squares quantization in PCM

IEEE Trans. Inform. Theory

Power k-means clustering

CAPKM++2.0: An upgraded version of the collaborative annealing power $k$ -means++ clustering algorithm☆