Improved I-nice clustering algorithm based on density peaks mechanism
Introduction
Currently, parameter-free clustering algorithms are one of the research hotspots in the field of unsupervised learning [6]. The main advantage of parameter-free clustering algorithms is that they do not require input parameters (e.g. the number of clusters and initial cluster centres) to be assigned by the researchers before the algorithms are trained on a given dataset. For real applications, the number of clusters and initial cluster centres are generally unknown. An inappropriate predetermination of input parameters will result in unsatisfactory clustering results, particularly for datasets with large cluster numbers. Although there exist some niche-targeting methods [5], [15], [22], [25], [26] to optimise the number of clusters and initial cluster centres, they suffer from either unstable clustering results or high computation complexity. Therefore, it is important for the academia and industry to develop a parameter-free clustering algorithm that can automatically and correctly identify the input parameters with acceptable computational complexity.
I-nice [12] is one such parameter-free clustering algorithm. Although it imitates human beings (i.e. the observation points) to observe the peaks of mountains, I-nice can automatically identify the number of clusters and select the initial cluster centres for a given dataset. I-nice includes two different versions, namely, single observation point-based (I-niceSO) and multiple observation points-based (I-niceMO). I-niceSO first determines the number of clusters and then identifies the cluster centres, whereas I-niceMO first identifies the cluster centres and then determines the number of clusters. I-nice uses gamma mixture models (GMMs) [7], [27] to represent the distributions of distances between the observation points and original data points and the k-nearest neighbours method [3], [19] to determine the high-density areas. The number of gamma components is the number of clusters, and the centres of high-density areas are the cluster centres. The experimental results [12] indicated that the I-nice algorithms significantly outperformed the state-of-the-art elbow and silhouette methods in finding the correct number of clusters for both synthetic and real-world datasets.
Although I-nice obtained better clustering performance in comparison with the existing methods, it has two inherent limitations that need to be further improved to enhance its clustering capability. One is that I-niceSO is sensitive to the position of the observation point. An improper position of the observation point generates inaccurate distributions of distances between the observation point and original data points and further causes an incorrect estimation of the cluster number. The other is that the number of nearest neighbours (i.e. k) affects the determination of high-density areas in I-niceMO. I-niceMO is an evolved version of I-niceSO and uses multiple observation points rather than one observation point in the clustering process. A larger or smaller value of k will result in an unclear distinction of high-density areas. In addition, I-niceMO uses a fixed value of k to determine the high-density areas for different gamma components.
To overcome the above-mentioned shortcomings of I-nice, we propose a density-peaks-based I-nice (I-niceDP) clustering algorithm, which improves the I-nice clustering algorithm using the density peak mechanism. Inspired by density peak clustering algorithms [10], [11], [13], [16], [23], [28], [29], we use density peaks in I-niceDP to determine the number of clusters and cluster centres in GMM components rather than the k-nearest neighbours method. In I-niceDP, the cluster centre should have a high density value. Furthermore, the minimal distances between the cluster centre and data points that have higher densities than the candidate cluster centre are larger than a predefined threshold. The main advantage of I-niceDP is its use of sophisticated statistical technology (i.e. kernel density estimation method [8], [14]) to replace the inefficiently simple count when distinguishing the high-density areas. We performed a series of experiments to demonstrate the feasibility and effectiveness of I-niceDP. The comparative results with I-niceSO and I-niceMO indicate that I-niceDP can more accurately identify the number of clusters and initial cluster centres for the datasets with large cluster numbers. In addition, our proposed I-niceDP yields better normalised mutual information (NMI) when compared with seven other clustering algorithms.
The remainder of this paper is organised as follows. In Section 2, we provide a brief review of the I-nice clustering algorithm. In Section 3, we describe the density-peaks-based I-nice clustering algorithm, that is, I-niceDP. In Section 4, we report the experimental comparisons that demonstrate the feasibility and effectiveness of I-niceDP. Finally, in Section 5, we list our conclusions and suggestions for future research.
Section snippets
Review of I-nice clustering algorithm
I-nice [12] is a k-means-type algorithm that can automatically identify the number of clusters for a given dataset and select the initial cluster centres from the high-density areas. The data space is treated as a terrain where the clusters are hill peaks. First, the observation points, which are imagined as human beings to observe the number of hill peaks, are arbitrarily allocated in the original data space. Second, the distances between the observation points and original data points are
Density-peaks-based I-nice clustering algorithm
In this section, we first analyse the limitations of the I-niceSO and I-niceMO clustering algorithms and then present our improved density-peaks-based I-nice (I-niceDP) clustering algorithm.
Experimental results and analysis
In this section, we present a series of experiments to validate the feasibility and effectiveness of the proposed density peaks-based I-niceDP clustering algorithm on eight synthetic datasets and eight benchmark datasets (UCI [9] and KEEL [21]). We used synthetic datasets to test the capability of I-niceDP to identify the number of clusters and initial cluster centres for the datasets with large cluster numbers. The synthetic datasets can be downloaded from BaiduPan1
Conclusion and future work
In this study, we used the density peaks mechanism to improve I-nice clustering algorithms and proposed a density-peaks-based I-nice (I-niceDP) clustering algorithm. I-niceDP used density peaks to determine the number of clusters and cluster centres in the components of the GMM rather than the k-nearest-neighbours method. The comparative results with I-niceSO and I-niceMO demonstrated the feasibility and effectiveness of the proposed method. Future work will focus in two directions. First, we
CRediT authorship contribution statement
Yulin He: Conceptualization, Writing - original draft. Yingyan Wu: Writing - review & editing. Honglian Qin: Methodology, Formal analysis. Joshua Zhexue Huang: Supervision. Yi Jin: Resources, Investigation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The authors would like to thank the editors and anonymous reviewers, whose meticulous review and valuable suggestions helped us to significantly improve this paper after four rounds of review (R0 on 20 August, 2019, R1 on 2 June, 2020, R2 on 26 September, 2020, and R3 on 29 September, 2020). This study was supported by the National Natural Science Foundation of China (61972261), Open Foundation of Key Laboratory of Impression Evidence Examination and Identification Technology (National Police
References (29)
- et al.
Shared-nearest-neighbor-based clustering by fast search and find of density peaks
Information Sciences
(2018) - et al.
I-nice: A new approach for identifying the number of clusters and initial cluster centres
Information Sciences
(2018) - et al.
REDPC: A residual error-based density peak clustering algorithm
Neurocomputing
(2019) Gamma mixture models for target recognition
Pattern Recognition
(2000)- et al.
Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbours
Information Sciences
(2016) - et al.
DenPEHC: Density peak based efficient hierarchical clustering
Information Sciences
(2016) - H. Akaike, Information theory and an extension of the maximum likelihood principle, in: Selected Papers of Hirotugu...
Statistical comparisons of classifiers over multiple data sets
Journal of Machine Learning Research
(2006)- et al.
A branch and bound algorithm for computing k-nearest neighbors
IEEE Transactions on Computers
(1975) - et al.
Novel electricity pattern identification system based on improved I-nice algorithm
Accepted by Computers & Industrial Engineering
(2020)
Clustering embedded approaches for efficient information network inference
Data Science and Engineering
Towards parameter-free data mining
Unsupervised clustering and feature weighting based on generalized Dirichlet mixture modeling
Information Sciences
A new kernel density estimator based on the minimum entropy of data set
Information Sciences
Cited by (14)
Density peak clustering algorithms: A review on the decade 2014–2023
2024, Expert Systems with ApplicationsNon-MapReduce computing for intelligent big data analysis
2024, Engineering Applications of Artificial IntelligenceClustering approximation via a fusion of multiple random samples
2024, Information FusionNovel kernel density estimator based on ensemble unbiased cross-validation
2021, Information SciencesCitation Excerpt :The future works will mainly focus on two directions. First, we will seek real applications [7,16,41] for EUCV-KDE in data mining and machine learning fields. Second, we will extend EUCV-KDE to conduct a PDF estimation for big data analysis and computation [26,27].
NN-EVCLUS: Neural network-based evidential clustering
2021, Information SciencesCitation Excerpt :The Belief c-means [34] and Credal c-means (CCM) [35] algorithms are alternative procedures designed to address this problem. The Belief Peak Evidential Clustering (BPEC) method [49] combines ideas from density peak clustering [44,56,21] and ECM. The Median Evidential c-means (MECM) [62] is an evidential version of the median c-means for relational data, while the Evidential c-medoid (ECMdd) with either a single prototype per class or multiple weighted prototypes [61] are inspired by the c-medoids algorithm.