Abstract
One of the main challenges in data mining is choosing the optimal number of clusters without prior information. Notably, existing methods are usually in the philosophy of cluster validation and hence have underlying assumptions on data distribution, which prevents their application to complex data such as large-scale images and high-dimensional data from the real world. In this regard, we propose an approach named CNMBI. Leveraging the distribution information inherent in the data space, we map the target task as a dynamic comparison process between cluster centers regarding positional behavior, without relying on the complete clustering results and designing the complex validity index as before. Bipartite graph theory is then employed to efficiently model this process. Additionally, we find that different samples have different confidence levels and thereby actively remove low-confidence ones, which is, for the first time to our knowledge, considered in cluster number determination. CNMBI is robust and allows for more flexibility in the dimension and shape of the target data (e.g., CIFAR-10 and STL-10). Extensive comparisof-the-art competitors on various challenging datasets demonstrate the superiority of our method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Data points, objects, and samples are used exchangeably in this paper.
- 2.
- 3.
- 4.
- 5.
- 6.
References
Abdalameer, A.K., Alswaitti, M., Alsudani, A.A., Isa, N.A.M.: A new validity clustering index-based on finding new centroid positions using the mean of clustered data to determine the optimum number of clusters. Expert Syst. Appl. 191, 116329 (2022)
Bache, K., Lichman, M.: UCI machine learning repository (2013). https://doi.org/10.1145/2063576.2063689
Calinski, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3, 1–27 (1974). https://doi.org/10.1080/03610917408548446
Cheng, D., Zhu, Q., Huang, J., Wu, Q., Yang, L.: A novel cluster validity index based on local cores. IEEE Trans. Neural Netw. Learn. Syst. 30(4), 985–999 (2019). https://doi.org/10.1109/TNNLS.2018.2853710
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1(2), 224–227 (1979)
Dunn, J.C.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J. Cybern. 3(3), 32–57 (1973)
Li, Y., Hu, P., Liu, Z., Peng, D., Zhou, J.T., Peng, X.: Contrastive clustering. Proc. AAAI Conf. Artif. Intell. 35(10), 8547–8555 (2021). https://doi.org/10.1609/aaai.v35i10.17037
Nguyen, S.D., Nguyen, V.S.T., Pham, N.T.: Determination of the optimal number of clusters: a fuzzy-set based method. IEEE Trans. Fuzzy Syst. 30(9), 3514–3526 (2022). https://doi.org/10.1109/TFUZZ.2021.3118113
Qiu, T., Li, Y.: Fast LDP-MST: an efficient density-peak-based clustering method for large-size datasets. IEEE Transactions on Knowledge and Data Engineering 35(5), 4767–4780 (2022)
Rasool, Z., Zhou, R., Chen, L., Liu, C., Xu, J.: Index-based solutions for efficient density peak clustering. IEEE Trans. Knowl. Data Eng. 34(5), 2212–2226 (2022). https://doi.org/10.1109/TKDE.2020.3004221
Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014). https://doi.org/10.1126/science.1242072
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987). https://doi.org/10.1016/0377-0427(87)90125-7
Saha, J., Mukherjee, J.: CNAK: cluster number assisted k-means. Pattern Recogn. 110, 107625 (2021)
Salvador, S., Chan, P.: Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In: 16th IEEE International Conference on Tools with Artificial Intelligence, pp. 576–584 (2004)
Sugar, C.A., James, G.M.: Finding the number of clusters in a dataset: an information-theoretic approach. J. Am. Stat. Assoc. 98, 750–763 (2003)
Tavakkol, B., Choi, J., Jeong, M.K., Albin, S.L.: Object-based cluster validation with densities. Pattern Recogn. 121, 108223 (2022)
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. Roy. Stat. Soc. B 63(2), 411–423 (2001). https://doi.org/10.1111/1467-9868.00293
Xu, X., Ding, S., Wang, L., Wang, Y.: A robust density peaks clustering algorithm with density-sensitive similarity. Knowl.-Based Syst. 200, 106028 (2020)
Zhang, R., Miao, Z., Tian, Y., Wang, H.: A novel density peaks clustering algorithm based on hopkins statistic. Expert Syst. Appl. 201, 116892 (2022)
Zhang, R., Zheng, H.: Density clustering based on the border-peeling using space vector decomposition. Acta Automatica Sinica 49(6), 1–19 (2023)
Zhang, Y., Mańdziuk, J., Quek, C.H., Goh, B.W.: Curvature-based method for determining the number of clusters. Inf. Sci. 415–416, 414–428 (2017)
Acknowledgements
This work was supported in part by the Shenzhen Fundamental Research Fund (JCYJ20210324132212030) and the Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies (2022B1212010005).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, R., Zheng, H., Wang, H. (2023). CNMBI: Determining the Number of Clusters Using Center Pairwise Matching and Boundary Filtering. In: Yang, X., et al. Advanced Data Mining and Applications. ADMA 2023. Lecture Notes in Computer Science(), vol 14180. Springer, Cham. https://doi.org/10.1007/978-3-031-46677-9_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-46677-9_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46676-2
Online ISBN: 978-3-031-46677-9
eBook Packages: Computer ScienceComputer Science (R0)