Abstract
k-Nearest Neighbors (k-NN) graph is essential for the various graph mining tasks. In this work, we study the density-based clustering on the k-NN graph and propose FastDEC, a clustering framework by fast dominance estimation. The nearest density higher (NDH) relation and dominance-component (DC), more specifically their integration with the k-NN graph, are formally defined and theoretically analyzed. FastDEC includes two extensions to satisfy different clustering scenarios: FastDEC\(_D\) for partitioning data into clusters with arbitrary shapes, and FastDEC\(_K\) for K-Way partition. Firstly, a set of DCs is detected as the results of FastDEC\(_D\) by segmenting the given k-NN graph. Then, the K-Way partition is generated by selecting the top-K DCs in terms of the inter-dominance (ID) as the seeds, and assigning the remaining DCs to their nearest dominators.
FastDEC can be viewed as a much faster, more robust, and k-NN based variant of the classical density-based clustering algorithm: Density Peak Clustering (DPC). DPC estimates the significance of data points from the density and geometric distance factors, while FastDEC innovatively uses the global rank of the dominator as an additional factor in the significance estimation. FastDEC naturally holds several critical characteristics: (1) excellent clustering performance; (2) easy to interpret and implement; (3) efficiency and robustness. Experiments on both the artificial and real datasets demonstrate that FastDEC outperforms the state-of-the-art density methods including DPC.
G. Yang and H. Lv—Equal Contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
FastDEC is released on https://github.com/gepingyang/FastDEC.
- 2.
Density-Reachable (DR) in DBSCAN [15] is equivalent to \(\tau \) based Flat Kernel. For the sake of comparison, we use a k-NN based one.
- 3.
References
Amagata, D., Hara, T.: Fast density-peaks clustering: multicore-based parallelization approach. In: SIGMOD 2021: International Conference on Management of Data, Virtual Event, China, 20–25 Jun 2021, pp. 49–61. ACM (2021)
Angelino, C.V., Debreuve, E., Barlaud, M.: Image restoration using a kNN-variant of the mean-shift. In: 2008 15th IEEE International Conference on Image Processing (ICIP), pp. 573–576. IEEE (2008)
Cai, J., Wei, H., Yang, H., Zhao, X.: A novel clustering algorithm based on DPC and PSO. IEEE Access 8, 88200–88214 (2020)
Carreira-Perpiñán, M.Á., Wang, W.: The k-modes algorithm for clustering. arXiv preprint arXiv:1304.6478 (2013)
Chang, H., Yeung, D.: Robust path-based spectral clustering. Pattern Recognit. 41(1), 191–203 (2008)
Chaudhuri, K., Dasgupta, S.: Rates of convergence for the cluster tree. In: Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems (NIPS), pp. 343–351. Curran Associates, Inc. (2010)
Chaudhuri, K., Dasgupta, S., Kpotufe, S., von Luxburg, U.: Consistent procedures for cluster tree estimation and pruning. IEEE Trans. Inf. Theory 60(12), 7900–7912 (2014)
Cheng, Y.: Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. Intell. 17(8), 790–799 (1995)
Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002)
Dasgupta, S., Freund, Y.: Random projection trees and low dimensional manifolds. In: Proceedings of the Annual ACM Symposium on Theory of Computing (STOC), pp. 537–546 (2008)
Davidson, I., Ravi, S.S.: Agglomerative hierarchical clustering with constraints: theoretical and empirical results. In: Jorge, A.M., Torgo, L., Brazdil, P., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 59–70. Springer, Heidelberg (2005). https://doi.org/10.1007/11564126_11
Dong, W., Charikar, M., Li, K.: Efficient k-nearest neighbor graph construction for generic similarity measures. In: Proceedings of the 20th International Conference on World Wide Web (WWW), pp. 577–586. ACM (2011)
Du, M., Ding, S., Jia, H.: Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowl. Based Syst. 99, 135–145 (2016)
Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Knowledge Discovery and Data Mining (KDD), pp. 226–231 (1996)
Fränti, P., Virmajoki, O.: Iterative shrinking method for clustering problems. Pattern Recognit. 39(5), 761–775 (2006)
Fu, L., Medico, E.: FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data. BMC Bioinform. 8, 3 (2007)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. SSS, Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7
Hinneburg, A., Keim, D.A.: An efficient approach to clustering in large multimedia databases with noise. In: Knowledge Discovery and Data Mining (KDD), pp. 58–65 (1998)
Jiang, H., Jang, J., Kpotufe, S.: Quickshift++: Provably good initializations for sample-based mean shift. In: International Conference on Machine Learning (ICML), vol. 80, pp. 2299–2308. PMLR (2018)
Jiang, H., Kpotufe, S.: Modal-set estimation with an application to clustering. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), vol. 54, pp. 1197–1206. PMLR (2017)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Liu, R., Wang, H., Yu, X.: Shared-nearest-neighbor-based clustering by fast search and find of density peaks. Inf. Sci. 450, 200–226 (2018)
Myhre, J.N., Mikalsen, K.Ø., Løkse, S., Jenssen, R.: Robust clustering using a kNN mode seeking ensemble. Pattern Recognit. 76, 491–505 (2018)
Rasool, Z., Zhou, R., Chen, L., Liu, C., Xu, J.: Index-based solutions for efficient density peak clustering (extended abstract). In: 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, 19–22 Apr 2021, pp. 2342–2343. IEEE (2021)
Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014)
Sarfraz, M.S., Sharma, V., Stiefelhagen, R.: Efficient parameter-free clustering using first neighbor relations. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8934–8943 (2019)
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)
Vedaldi, A., Soatto, S.: Quick shift and kernel methods for mode seeking. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5305, pp. 705–718. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88693-8_52
Veenman, C.J., Reinders, M.J.T., Backer, E.: A maximum variance cluster algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 24(9), 1273–1280 (2002)
Wang, W., Carreira-Perpiñán, M.Á.: The laplacian k-modes algorithm for clustering. arXiv preprint arXiv:1406.3895 (2014)
Xie, J., Gao, H., Xie, W., Liu, X., Grant, P.W.: Robust clustering by detecting density peaks and assigning points based on fuzzy weighted k-nearest neighbors. Inf. Sci. 354, 19–40 (2016)
Yang, Y., et al.: GraphLSHC: towards large scale spectral hypergraph clustering. Inf. Sci. 544, 117–134 (2021)
Yang, Y., Gong, Z., Li, Q., U, L.H., Cai, R., Hao, Z.: A robust noise resistant algorithm for POI identification from flickr data. In: Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI), pp. 3294–3300. ijcai.org (2017)
Zhang, T., Ramakrishnan, R., Livny, M.: SIGMOD, pp. 103–114. ACM Press, New York (1996)
Zheng, X., Ren, C., Yang, Y., Gong, Z., Chen, X., Hao, Z.: QuickDSC: clustering by quick density subgraph estimation. Inf. Sci. 581, 403–427 (2021)
Acknowledgment
We thank the anonymous reviewers for their constructive comments and thoughtful suggestions. This work was supported in part by: National Key D &R Program of China (019YFB1600704, 2021ZD0111501), NSFC (61603101, 61876043, 61976052), NSF of Guangdong Province (2021A1515011941), State’s Key Project of Research and Development Plan (2019YFE0196400), NSF for Excellent Young Scholars (62122022), Guangzhou STIC (EF005/FST-GZG/2019/GSTIC), NSFC-Guangdong Joint Fund (U1501254), the Science and Technology Development Fund, Macau SAR (0068/2020/AGJ, 0045/2019/A1, SKL-IOTSC(UM)-2021-2023, GDST (2020B1212030003).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yang, G., Lv, H., Yang, Y., Gong, Z., Chen, X., Hao, Z. (2023). FastDEC: Clustering by Fast Dominance Estimation. In: Amini, MR., Canu, S., Fischer, A., Guns, T., Kralj Novak, P., Tsoumakas, G. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2022. Lecture Notes in Computer Science(), vol 13713. Springer, Cham. https://doi.org/10.1007/978-3-031-26387-3_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-26387-3_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26386-6
Online ISBN: 978-3-031-26387-3
eBook Packages: Computer ScienceComputer Science (R0)