Elsevier

Information Sciences

Volume 548, 16 February 2021, Pages 177-190
Information Sciences

Improved I-nice clustering algorithm based on density peaks mechanism

https://doi.org/10.1016/j.ins.2020.09.068Get rights and content

Abstract

Recently, Masud et al. [M. A. Masud, J. Z. Huang, C. H. Wei, et al. I-nice: A new approach for identifying the number of clusters and initial cluster centres. Information Sciences 466 (2018) 129–151] proposed a parameter-free clustering algorithm, named I-nice, which can identify the number of clusters and initial cluster centres using observation points. Although the experiment presented good clustering performance of I-nice, there are two inherent limitations that can be further improved. One is that I-niceSO is sensitive to the position of the observation point, and the other is that the number of nearest neighbours affects the determination of high-density areas in I-niceMO. Inspired by density peaks clustering, we propose a density-peaks-based I-nice (I-niceDP) clustering algorithm to improve the existing I-nice clustering algorithm. In I-niceDP, we use density peaks to determine the number of clusters and cluster centres in the components of the gamma mixture model rather than the k-nearest neighbours method. The comparative results using I-niceSO and I-niceMO indicate that I-niceDP can more accurately identify the number of clusters and initial cluster centres for datasets with large cluster numbers. Furthermore, I-niceDP obtains higher normalised mutual information values in comparison with seven other clustering algorithms. The experimental results demonstrate the feasibility and effectiveness of the I-niceDP clustering algorithm.

Introduction

Currently, parameter-free clustering algorithms are one of the research hotspots in the field of unsupervised learning [6]. The main advantage of parameter-free clustering algorithms is that they do not require input parameters (e.g. the number of clusters and initial cluster centres) to be assigned by the researchers before the algorithms are trained on a given dataset. For real applications, the number of clusters and initial cluster centres are generally unknown. An inappropriate predetermination of input parameters will result in unsatisfactory clustering results, particularly for datasets with large cluster numbers. Although there exist some niche-targeting methods [5], [15], [22], [25], [26] to optimise the number of clusters and initial cluster centres, they suffer from either unstable clustering results or high computation complexity. Therefore, it is important for the academia and industry to develop a parameter-free clustering algorithm that can automatically and correctly identify the input parameters with acceptable computational complexity.

I-nice [12] is one such parameter-free clustering algorithm. Although it imitates human beings (i.e. the observation points) to observe the peaks of mountains, I-nice can automatically identify the number of clusters and select the initial cluster centres for a given dataset. I-nice includes two different versions, namely, single observation point-based (I-niceSO) and multiple observation points-based (I-niceMO). I-niceSO first determines the number of clusters and then identifies the cluster centres, whereas I-niceMO first identifies the cluster centres and then determines the number of clusters. I-nice uses gamma mixture models (GMMs) [7], [27] to represent the distributions of distances between the observation points and original data points and the k-nearest neighbours method [3], [19] to determine the high-density areas. The number of gamma components is the number of clusters, and the centres of high-density areas are the cluster centres. The experimental results [12] indicated that the I-nice algorithms significantly outperformed the state-of-the-art elbow and silhouette methods in finding the correct number of clusters for both synthetic and real-world datasets.

Although I-nice obtained better clustering performance in comparison with the existing methods, it has two inherent limitations that need to be further improved to enhance its clustering capability. One is that I-niceSO is sensitive to the position of the observation point. An improper position of the observation point generates inaccurate distributions of distances between the observation point and original data points and further causes an incorrect estimation of the cluster number. The other is that the number of nearest neighbours (i.e. k) affects the determination of high-density areas in I-niceMO. I-niceMO is an evolved version of I-niceSO and uses multiple observation points rather than one observation point in the clustering process. A larger or smaller value of k will result in an unclear distinction of high-density areas. In addition, I-niceMO uses a fixed value of k to determine the high-density areas for different gamma components.

To overcome the above-mentioned shortcomings of I-nice, we propose a density-peaks-based I-nice (I-niceDP) clustering algorithm, which improves the I-nice clustering algorithm using the density peak mechanism. Inspired by density peak clustering algorithms [10], [11], [13], [16], [23], [28], [29], we use density peaks in I-niceDP to determine the number of clusters and cluster centres in GMM components rather than the k-nearest neighbours method. In I-niceDP, the cluster centre should have a high density value. Furthermore, the minimal distances between the cluster centre and data points that have higher densities than the candidate cluster centre are larger than a predefined threshold. The main advantage of I-niceDP is its use of sophisticated statistical technology (i.e. kernel density estimation method [8], [14]) to replace the inefficiently simple count when distinguishing the high-density areas. We performed a series of experiments to demonstrate the feasibility and effectiveness of I-niceDP. The comparative results with I-niceSO and I-niceMO indicate that I-niceDP can more accurately identify the number of clusters and initial cluster centres for the datasets with large cluster numbers. In addition, our proposed I-niceDP yields better normalised mutual information (NMI) when compared with seven other clustering algorithms.

The remainder of this paper is organised as follows. In Section 2, we provide a brief review of the I-nice clustering algorithm. In Section 3, we describe the density-peaks-based I-nice clustering algorithm, that is, I-niceDP. In Section 4, we report the experimental comparisons that demonstrate the feasibility and effectiveness of I-niceDP. Finally, in Section 5, we list our conclusions and suggestions for future research.

Section snippets

Review of I-nice clustering algorithm

I-nice [12] is a k-means-type algorithm that can automatically identify the number of clusters for a given dataset and select the initial cluster centres from the high-density areas. The data space is treated as a terrain where the clusters are hill peaks. First, the observation points, which are imagined as human beings to observe the number of hill peaks, are arbitrarily allocated in the original data space. Second, the distances between the observation points and original data points are

Density-peaks-based I-nice clustering algorithm

In this section, we first analyse the limitations of the I-niceSO and I-niceMO clustering algorithms and then present our improved density-peaks-based I-nice (I-niceDP) clustering algorithm.

Experimental results and analysis

In this section, we present a series of experiments to validate the feasibility and effectiveness of the proposed density peaks-based I-niceDP clustering algorithm on eight synthetic datasets and eight benchmark datasets (UCI [9] and KEEL [21]). We used synthetic datasets to test the capability of I-niceDP to identify the number of clusters and initial cluster centres for the datasets with large cluster numbers. The synthetic datasets can be downloaded from BaiduPan1

Conclusion and future work

In this study, we used the density peaks mechanism to improve I-nice clustering algorithms and proposed a density-peaks-based I-nice (I-niceDP) clustering algorithm. I-niceDP used density peaks to determine the number of clusters and cluster centres in the components of the GMM rather than the k-nearest-neighbours method. The comparative results with I-niceSO and I-niceMO demonstrated the feasibility and effectiveness of the proposed method. Future work will focus in two directions. First, we

CRediT authorship contribution statement

Yulin He: Conceptualization, Writing - original draft. Yingyan Wu: Writing - review & editing. Honglian Qin: Methodology, Formal analysis. Joshua Zhexue Huang: Supervision. Yi Jin: Resources, Investigation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to thank the editors and anonymous reviewers, whose meticulous review and valuable suggestions helped us to significantly improve this paper after four rounds of review (R0 on 20 August, 2019, R1 on 2 June, 2020, R2 on 26 September, 2020, and R3 on 29 September, 2020). This study was supported by the National Natural Science Foundation of China (61972261), Open Foundation of Key Laboratory of Impression Evidence Examination and Identification Technology (National Police

References (29)

  • Q.B. Hu et al.

    Clustering embedded approaches for efficient information network inference

    Data Science and Engineering

    (2016)
  • E. Keogh et al.

    Towards parameter-free data mining

  • M.M.B. Ismail et al.

    Unsupervised clustering and feature weighting based on generalized Dirichlet mixture modeling

    Information Sciences

    (2014)
  • J. Jiang et al.

    A new kernel density estimator based on the minimum entropy of data set

    Information Sciences

    (2019)
  • Cited by (14)

    • Non-MapReduce computing for intelligent big data analysis

      2024, Engineering Applications of Artificial Intelligence
    • Novel kernel density estimator based on ensemble unbiased cross-validation

      2021, Information Sciences
      Citation Excerpt :

      The future works will mainly focus on two directions. First, we will seek real applications [7,16,41] for EUCV-KDE in data mining and machine learning fields. Second, we will extend EUCV-KDE to conduct a PDF estimation for big data analysis and computation [26,27].

    • NN-EVCLUS: Neural network-based evidential clustering

      2021, Information Sciences
      Citation Excerpt :

      The Belief c-means [34] and Credal c-means (CCM) [35] algorithms are alternative procedures designed to address this problem. The Belief Peak Evidential Clustering (BPEC) method [49] combines ideas from density peak clustering [44,56,21] and ECM. The Median Evidential c-means (MECM) [62] is an evidential version of the median c-means for relational data, while the Evidential c-medoid (ECMdd) with either a single prototype per class or multiple weighted prototypes [61] are inspired by the c-medoids algorithm.

    View all citing articles on Scopus
    View full text