Elsevier

Knowledge-Based Systems

Volume 262, 28 February 2023, 110241
Knowledge-Based Systems

CAPKM++2.0: An upgraded version of the collaborative annealing power k-means++ clustering algorithm

https://doi.org/10.1016/j.knosys.2022.110241Get rights and content

Abstract

The collaborative annealing power k-means++ (CAPKM++) clustering algorithm has been recently proposed based on multiple modules by minimizing annealed power-mean functions. This paper presents an upgraded version of CAPKM++ called CAPKM++2.0. Different from CAPKM++ where the anchor points of surrogate functions for majorizing the power-mean functions are re-initialized and minimized repeatedly after annealing, CAPKM++2.0 re-initializes the weights of the majorization function during annealing. In addition, unlike CAPKM++ that minimizes the majorization function of the power-mean sum, CAPKM++2.0 adds an inner loop to minimize the power-mean sum iteratively and locally at every annealing step. Ablation study results are discussed to justify the adoption of the power-mean and the collaboration of multiple modules. Experimental results on sixteen benchmark datasets are elaborated to demonstrate the superior clustering performance of the upgraded algorithm compared with its predecessor and six other mainstream algorithms in terms of cluster validity indices and algorithmic complexities.

Introduction

As an important procedure of data processing, clustering is to group similar data into homogeneous clusters [1], with widespread applications, such as image segmentation [2], [3], text mining [4], bioinformatic analysis [5]. Numerous clustering algorithms have been proposed and they may be classified into several categories from different perspectives, as shown in Fig. 1; e.g., divisive hierarchical clustering [6], agglomerative hierarchical clustering [7], k-means [8], [9], [10], k-medoids [11], k-harmonic means [12], spectral clustering [3], [13], [14], distribution clustering [15], [16], density clustering [17], [18], subspace clustering [19], [20], [21], feature-weighted clustering [22], [23], [24], probabilistic clustering [25], fuzzy clustering [26], [27], [28], cardinality-constrained clustering [29], [30], [31], capacitated clustering [32], [33], must-link and cannot-link constrained clustering [34], and rank-constrained clustering [35].

Lloyd’s algorithm [8] is a classic k-means algorithm that minimizes the sum of squared distances between each point and its assigned cluster center in a greedy way. The k-means and k-means-type algorithms are very popular and widely used in data clustering and analysis, owing to their efficiency and simplicity. However, their clustering qualities vary depending on the initialization of centers. To overcome this limitation, many alternative methods have been proposed, such as initialization improvements [11], [36], [37], [38] and objective function improvements [9], [12], [39], [40], [41]. Of particular interest, the power k-means (PKM) algorithm [9] clusters data by minimizing the majorization function of an annealed power-mean sum. Nevertheless, its clustering performance still depends on initialization. To eliminate the initialization dependence, CAPKM++ employs multiple PKM modules initialized by using k-mean++ and re-initialized repeatedly and collaboratively using a particle swarm optimization rule after each annealing process [10]. Although CAPKM++ outperforms many baselines, the clustering efficiency could be further improved by carrying out cluster re-initialization during the annealing process.

In this paper, we propose an upgraded version of CAPKM++ called CAPKM++2.0. Instead of re-initializing cluster centers after annealing, CAPKM++2.0 re-initializes the weights in the majorization function during annealing. Additionally, CAPKM++2.0 minimizes the power-mean functions directly rather than their majorization function as in PKM and CAPKM++. The novelties of this work are summarized as follows.

  • i.

    Carrying out re-initialization during annealing rather than after it enables more efficient clustering with reduced spatial and temporal complexities.

  • ii.

    Minimizing the power-mean sum instead of its majorization function enables to further improve cluster performance in terms of algorithmic efficiency.

The remainder of this paper is organized as follows. Section 2 provides necessary preliminary information on problem formulation, PKM, and CAPKM++. Section 3 describes the proposed CAPKM++2.0 algorithm. Section 4 reports experimental results on sixteen datasets. Section 5 concludes the paper.

Section snippets

Problem statement

Given a data set X={x1,,xn}, a k-means clustering algorithm partitions the data into k clusters by minimizing the within-cluster distance with the following objective function [8]: f(Θ)=i=1nmin1jkxiθj22,where θjm is the center (centroid) of cluster j (j=1,2,,k), Θ=[θ1,,θk] is the cluster centers matrix.

Fundamentals of the PKM algorithm

Since it is hard to minimize (1) directly, the PKM algorithm utilizes the following smoother function as a surrogate function of (1) [9]: fs(Θ)i=1n(1kj=1kxiθj22s)1s,where s<0

CAPKM++2.0

In the PKM and CAPKM++ algorithms, the majorization function of power-mean sum in (5) is minimized at every step of annealing. Minimizing the power-mean function in (2) may attain better clustering results than minimizing its majorization function. To do so, in CAPKM++2.0, an inner loop is added to update the weights and the centers alternately until convergence at every step of annealing.

In CAPKM++, the re-initialization of cluster centers is performed after each annealing process until

Setups

The experimental results are based on sixteen commonly used datasets as listed in Table 1: NCI9 [44], Lymphoma [44], ORL10P [44], WarpPIE10P [44], Segment [45], SpamBase [45], PageBlocks [45], Texture [45], Optdigits [45], Satimage [45], COIL2000 [45], Penbased [45], WineQuality-Red [45] (WQ-Red), WineQuality-White [45] (WQ-White), Banana [45], Phoneme [45].

The clustering performance of CAPKM++2.0 is compared with those of the following seven baselines: k-means (KM),1

Concluding remarks

This paper presents an upgraded version of CAPKM++ for k-means clustering. With reduced algorithmic complexities, CAPKM++2.0 is more efficient than CAPKM++, as it minimizes the power-mean functions directly rather than minimizing their majorization function and re-initializes the weights of the majorization function during annealing instead of re-initializing the anchor points after annealing. The experimental results on sixteen datasets demonstrate that the upgraded algorithm statistically

CRediT authorship contribution statement

Hongzong Li: Data curation, Software, Investigation, Validation, Writing – original draft. Jun Wang: Conceptualization, Methodology, Writing – review & editing, Funding acquisition, Resources, Supervision, Project administration.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (45)

  • GüngörZ. et al.

    K-harmonic means data clustering with simulated annealing heuristic

    Appl. Math. Comput.

    (2007)
  • YangF. et al.

    An efficient hybrid data clustering method based on K-harmonic means and Particle Swarm Optimization

    Expert Syst. Appl.

    (2009)
  • LiJ. et al.

    Feature selection: A data perspective

    ACM Comput. Surv.

    (2018)
  • JainA.K. et al.

    Algorithms for Clustering Data

    (1988)
  • WuZ. et al.

    An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1993)
  • ShiJ. et al.

    Normalized cuts and image segmentation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2000)
  • BerryM.W. et al.

    Survey of Text Mining II: Clustering, Classification, and Retrieval

    (2007)
  • Abu-JamousB. et al.

    Integrative Cluster Analysis in Bioinformatics

    (2015)
  • Macnaughton-SmithP. et al.

    Dissimilarity analysis: a new technique of hierarchical sub-division

    Nature

    (1964)
  • JohnsonS.C.

    Hierarchical clustering schemes

    Psychometrika

    (1967)
  • LloydS.

    Least squares quantization in PCM

    IEEE Trans. Inform. Theory

    (1982)
  • XuJ. et al.

    Power k-means clustering

  • Cited by (9)

    View all citing articles on Scopus

    This work was supported in part by the Research Grants Council of the Hong Kong Special Administrative Region of China under Grants 11202318, 11202019, and 11203721; and in part by the InnoHK initiative, the Government of the Hong Kong Special Administrative Region, and Laboratory for AI-Powered Financial Technologies .

    View full text