Abstract
Clustering is a fundamental research topic in unsupervised learning. Similarity measure is a key factor for clustering. However, it is still challenging for existing similarity measures to cluster non-spherical data with high noise levels. Rank-order distance is proposed to well capture the structures of non-spherical data by sharing the neighboring information of the samples, but it cannot well tolerate high noise. In order to address above issue, we propose KROD, a new similarity measure incorporating rank-order distance with Gaussian kernel. By reducing the noise in the neighboring information of samples, KROD improves rank-order distance to tolerate high noise, thus the structures of non-spherical data with high noise levels can be well captured. Then, KROD strengthens these captured structures by Gaussian kernel so that the samples in the same cluster are closer to each other and can be easily clustered correctly. Experiment illustrates that KROD can effectively improve existing methods for discovering non-spherical clusters with high noise levels. The source code can be downloaded from https://github.com/grcai.
Similar content being viewed by others
References
Ashby F Gregory, Ennis Daniel M (2007) Similarity measures. Scholarpedia 2(12):4116
Bache Kevin, Lichman Moshe (2013) Uci machine learning repository
Berkhin Pavel (2006) A survey of clustering data mining techniques. In: Grouping Multidimensional Data - Recent Advances in Clustering, pp 25–71
Cai Deng, He Xiaofei, Wang Xuanhui, Bao Hujun, Han Jiawei (2009) Locality preserving nonnegative matrix factorization. In: IJCAI 2009, Proceedings of the 21st International Joint Conference on Artificial Intelligence, Pasadena, California, USA, July 11-17, 2009, volume 9, pp 1010–1015
Cai Zhiling, Yang Xiaofei, Huang Tianyi, Zhu William (2020) A new similarity combining reconstruction coefficient with pairwise distance for agglomerative clustering. Inf Sci 508:173–182
Chen Xinlei, Cai Deng (2011) Large scale spectral clustering with landmark-based representation. In: Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2011, San Francisco, California, USA, August 7–11, 2011, volume 5, pp 314–418
Cheng Yizong (1995) Mean shift, mode seeking, and clustering. IEEE Trans Pattern Anal Mach Intell 17(8):790–799
Cox Trevor F, Cox Michael AA (2000) Multidimensional scaling. Chapman and Hall/CRC, London
Deng Li (2012) The mnist database of handwritten digit images for machine learning research. IEEE Signal Process Mag 29(6):141–142
Mingjing Du, Ding Shifei, Jia Hongjie (2016) Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowl Based Syst 99:135–145
Duda Richard O, Hart Peter E (1973) Pattern classification and scene analysis. A Wiley-Interscience publication, Wiley
Ester Martin, Kriegel Hans-Peter, Sander Jörg, Xu Xiaowei (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, USA, pp 226–231
Fukunaga Keinosuke, Hostetler Larry (1975) The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans Inf Theory 21(1):32–40
Gentner Dedre, Markman Arthur B (1997) Structure mapping in analogy and similarity. Am Psychol 52(1):45–56
Guha Sudipto, Rastogi Rajeev, Shim Kyuseok (1998) Cure: an efficient clustering algorithm for large databases. In: ACM Sigmod Record, volume 27, pp 73–84
Guo Zhishuai, Huang Tianyi, Cai Zhiling, Zhu William (2018) A new local density for density peak clustering. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp 426–438
He Kaiming, Wen Fang, Sun Jian (2013) K-means hashing: An affinity-preserving quantization method for learning binary compact codes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2938–2945
He Xiaofei, Cai Deng, Niyogi Partha (2005) Laplacian score for feature selection. In: Advances in Neural Information Processing Systems 18: Annual Conference on Neural Information Processing Systems 2005, Vancouver, British Columbia, Canada, December 5-8, pp 507–514
Hein Matthias, Maier Markus (2007) Manifold denoising. Adv Neural Inf Process Syst 19:561–568
Huang Dong, Wang Chang-Dong, Wu Jiansheng, Lai Jian-Huang, Kwoh Chee Keong (2019) Ultra-scalable spectral clustering and ensemble clustering. IEEE Transactions on Knowledge and Data Engineering
Hull Jonathan J (1994) A database for handwritten text recognition research. IEEE Trans Pattern Anal Mach Intell 16(5):550–554
Jain Anil K, Murty M Narasimha, Flynn Patrick J (1999) Data clustering: a review. ACM Comput Surveys (CSUR) 31(3):264–323
Jarvis Raymond Austin, Patrick Edward A (1973) Clustering using a similarity measure based on shared near neighbors. IEEE Trans Comput 100(11):1025–1034
Jia Yangqing, Shelhamer Evan, Donahue Jeff, Karayev Sergey, Long Jonathan, Girshick Ross, Guadarrama Sergio, Darrell Trevor (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp 675–678
Jolliffe Ian (2011) Principal component analysis. In: International encyclopedia of statistical science, Springer, pp 1094–1096
Karypis George, Han Eui-Hong, Kumar Vipin (1999) Chameleon: hierarchical clustering using dynamic modeling. IEEE Comput 32(8):68–75
Kuhn Harold W (2010) The hungarian method for the assignment problem. In: 50 Years of Integer Programming 1958-2008—from the Early Years to the State-of-the-Art, pp 29–47
Li Ruijia, Yang Xiaofei, Qin Xiaolong, Zhu William (2019) Local gap density for clustering high-dimensional data with varying densities. Knowledge-Based Systems
Liang Zhou, Chen Pei (2016) Delta-density based clustering with a divide-and-conquer strategy: 3dc clustering. Pattern Recogn Lett 73:52–59
Liu Ziwei, Luo Ping, Wang Xiaogang, Tang Xiaoou (2015) Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision (ICCV)
Lyons Michael J, Akamatsu Shigeru, Kamachi Miyuki, Gyoba Jiro (1998) Coding facial expressions with gabor wavelets. In: 3rd International Conference on Face & Gesture Recognition (FG ’98), April 14–16, 1998, Nara, Japan, pp 200–205
van der Maaten Laurens, Hinton Geoffrey (2008) Visualizing data using t-sne. J Mach Learn Res 9(11):2579–2605
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, Oakland, CA, USA, pp 281–297
Milligan Glenn W, Soon Shih Chung, Sokol Lisa M (1983) The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE Trans Pattern Anal Mach Intell 1:40–47
Nayar S, Nene Sammeer A, Murase Hiroshi (1996) Columbia object image library (coil 100). department of comp. Technical Report CUCS-006-96
Nene Sameer A, Nayar Shree K, Murase Hiroshi et al (1996) Columbia object image library (coil-20)
Ng Andrew Y, Jordan Michael I, Weiss Yair (2001) On spectral clustering: Analysis and an algorithm. In: Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada], pages 849–856
Nie Feiping, Wang Xiaoqian, Jordan Michael I, Huang Heng (2016) The constrained laplacian rank algorithm for graph-based clustering. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pages 1969–1976
Otto Charles, Wang Dayong, Jain Anil K (2018) Clustering millions of faces by identity. IEEE Trans Pattern Anal Mach Intell 40(2):289–303
Pedregosa Fabian, Varoquaux Gaël, Gramfort Alexandre, Michel Vincent, Thirion Bertrand, Grisel Olivier, Blondel Mathieu, Prettenhofer Peter, Weiss Ron, Dubourg Vincent, VanderPlas Jake, Passos Alexandre, Cournapeau David, Brucher Matthieu, Perrot Matthieu, Duchesnay Edouard (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
Peterson Leif E (2009) K-nearest neighbor. Scholarpedia 4(2):1883
Ray S, Turi RH (1999) Determination of number of clusters in k-means clustering and application in colour image segmentation. In: Proceedings of the 4th International Conference on Advances in Pattern Recognition and Digital Techniques, Calcutta, India, pp 137–143
Rodriguez Alex, Laio Alessandro (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496
Sammon John W (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput 100(5):401–409
Seifoddini Hamid K (1989) Single linkage versus average linkage clustering in machine cells formation applications. Comput Ind Eng 16(3):419–426
Shi Jianbo, Malik Jitendra (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905
Singh Dinesh, Febbo Phillip G, Ross Kenneth, Jackson Donald G, Manola Judith, Ladd Christine, Tamayo Pablo, Renshaw Andrew A, D’Amico Anthony V, Richie Jerome P et al (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2):203–209
Tryon Robert Choate (1939) Cluster analysis: Correlation profile and orthometric (factor) analysis for the isolation of unities in mind and personality. Edwards brother, Incorporated, lithoprinters and publishers
Rui Xu, Wunsch Donald (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Yang Jianwei, Parikh Devi, Batra Dhruv (2016) Joint unsupervised learning of deep representations and image clusters. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, pp 5147–5156
Zhang Tian, Ramakrishnan Raghu, Livny Miron (1996) Birch: an efficient data clustering method for very large databases. In: ACM Sigmod Record, volume 25, pp 103–114
Zhang Wei, Wang Xiaogang, Zhao Deli, Tang Xiaoou (2012) Graph degree linkage: Agglomerative clustering on a directed graph. In: Computer Vision—ECCV 2012—12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part I, pp 428–441
Zhang Wei, Zhao Deli, Wang Xiaogang (2013) Agglomerative clustering via maximum incremental path integral. Pattern Recogn 46(11):3056–3065
Zhao Z, Morstatter F, Sharma S, Alelyani S, Anand A, Liu H (2010) Advancing feature selection research. ASU feature selection repository, pp 1–28
Zhu Chunhui, Wen Fang, Sun Jian (2011) A rank-order distance based clustering algorithm for face tagging. In: The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011, pp 481–488
Acknowledgements
This work is supported in part by The National Nature Science Foundation of China under Grant No. 61772120.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Huang, T., Wang, S. & Zhu, W. An adaptive kernelized rank-order distance for clustering non-spherical data with high noise. Int. J. Mach. Learn. & Cyber. 11, 1735–1747 (2020). https://doi.org/10.1007/s13042-020-01068-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-020-01068-9