Skip to main content
Log in

An adaptive kernelized rank-order distance for clustering non-spherical data with high noise

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Clustering is a fundamental research topic in unsupervised learning. Similarity measure is a key factor for clustering. However, it is still challenging for existing similarity measures to cluster non-spherical data with high noise levels. Rank-order distance is proposed to well capture the structures of non-spherical data by sharing the neighboring information of the samples, but it cannot well tolerate high noise. In order to address above issue, we propose KROD, a new similarity measure incorporating rank-order distance with Gaussian kernel. By reducing the noise in the neighboring information of samples, KROD improves rank-order distance to tolerate high noise, thus the structures of non-spherical data with high noise levels can be well captured. Then, KROD strengthens these captured structures by Gaussian kernel so that the samples in the same cluster are closer to each other and can be easily clustered correctly. Experiment illustrates that KROD can effectively improve existing methods for discovering non-spherical clusters with high noise levels. The source code can be downloaded from https://github.com/grcai.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Ashby F Gregory, Ennis Daniel M (2007) Similarity measures. Scholarpedia 2(12):4116

    Google Scholar 

  2. Bache Kevin, Lichman Moshe (2013) Uci machine learning repository

  3. Berkhin Pavel (2006) A survey of clustering data mining techniques. In: Grouping Multidimensional Data - Recent Advances in Clustering, pp 25–71

  4. Cai Deng, He Xiaofei, Wang Xuanhui, Bao Hujun, Han Jiawei (2009) Locality preserving nonnegative matrix factorization. In: IJCAI 2009, Proceedings of the 21st International Joint Conference on Artificial Intelligence, Pasadena, California, USA, July 11-17, 2009, volume 9, pp 1010–1015

  5. Cai Zhiling, Yang Xiaofei, Huang Tianyi, Zhu William (2020) A new similarity combining reconstruction coefficient with pairwise distance for agglomerative clustering. Inf Sci 508:173–182

    MathSciNet  Google Scholar 

  6. Chen Xinlei, Cai Deng (2011) Large scale spectral clustering with landmark-based representation. In: Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2011, San Francisco, California, USA, August 7–11, 2011, volume 5, pp 314–418

  7. Cheng Yizong (1995) Mean shift, mode seeking, and clustering. IEEE Trans Pattern Anal Mach Intell 17(8):790–799

    Google Scholar 

  8. Cox Trevor F, Cox Michael AA (2000) Multidimensional scaling. Chapman and Hall/CRC, London

    MATH  Google Scholar 

  9. Deng Li (2012) The mnist database of handwritten digit images for machine learning research. IEEE Signal Process Mag 29(6):141–142

    Google Scholar 

  10. Mingjing Du, Ding Shifei, Jia Hongjie (2016) Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowl Based Syst 99:135–145

    Google Scholar 

  11. Duda Richard O, Hart Peter E (1973) Pattern classification and scene analysis. A Wiley-Interscience publication, Wiley

    MATH  Google Scholar 

  12. Ester Martin, Kriegel Hans-Peter, Sander Jörg, Xu Xiaowei (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, USA, pp 226–231

  13. Fukunaga Keinosuke, Hostetler Larry (1975) The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans Inf Theory 21(1):32–40

    MathSciNet  MATH  Google Scholar 

  14. Gentner Dedre, Markman Arthur B (1997) Structure mapping in analogy and similarity. Am Psychol 52(1):45–56

    Google Scholar 

  15. Guha Sudipto, Rastogi Rajeev, Shim Kyuseok (1998) Cure: an efficient clustering algorithm for large databases. In: ACM Sigmod Record, volume 27, pp 73–84

  16. Guo Zhishuai, Huang Tianyi, Cai Zhiling, Zhu William (2018) A new local density for density peak clustering. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp 426–438

  17. He Kaiming, Wen Fang, Sun Jian (2013) K-means hashing: An affinity-preserving quantization method for learning binary compact codes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2938–2945

  18. He Xiaofei, Cai Deng, Niyogi Partha (2005) Laplacian score for feature selection. In: Advances in Neural Information Processing Systems 18: Annual Conference on Neural Information Processing Systems 2005, Vancouver, British Columbia, Canada, December 5-8, pp 507–514

  19. Hein Matthias, Maier Markus (2007) Manifold denoising. Adv Neural Inf Process Syst 19:561–568

    Google Scholar 

  20. Huang Dong, Wang Chang-Dong, Wu Jiansheng, Lai Jian-Huang, Kwoh Chee Keong (2019) Ultra-scalable spectral clustering and ensemble clustering. IEEE Transactions on Knowledge and Data Engineering

  21. Hull Jonathan J (1994) A database for handwritten text recognition research. IEEE Trans Pattern Anal Mach Intell 16(5):550–554

    Google Scholar 

  22. Jain Anil K, Murty M Narasimha, Flynn Patrick J (1999) Data clustering: a review. ACM Comput Surveys (CSUR) 31(3):264–323

    Google Scholar 

  23. Jarvis Raymond Austin, Patrick Edward A (1973) Clustering using a similarity measure based on shared near neighbors. IEEE Trans Comput 100(11):1025–1034

    Google Scholar 

  24. Jia Yangqing, Shelhamer Evan, Donahue Jeff, Karayev Sergey, Long Jonathan, Girshick Ross, Guadarrama Sergio, Darrell Trevor (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp 675–678

  25. Jolliffe Ian (2011) Principal component analysis. In: International encyclopedia of statistical science, Springer, pp 1094–1096

  26. Karypis George, Han Eui-Hong, Kumar Vipin (1999) Chameleon: hierarchical clustering using dynamic modeling. IEEE Comput 32(8):68–75

    Google Scholar 

  27. Kuhn Harold W (2010) The hungarian method for the assignment problem. In: 50 Years of Integer Programming 1958-2008—from the Early Years to the State-of-the-Art, pp 29–47

  28. Li Ruijia, Yang Xiaofei, Qin Xiaolong, Zhu William (2019) Local gap density for clustering high-dimensional data with varying densities. Knowledge-Based Systems

  29. Liang Zhou, Chen Pei (2016) Delta-density based clustering with a divide-and-conquer strategy: 3dc clustering. Pattern Recogn Lett 73:52–59

    Google Scholar 

  30. Liu Ziwei, Luo Ping, Wang Xiaogang, Tang Xiaoou (2015) Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision (ICCV)

  31. Lyons Michael J, Akamatsu Shigeru, Kamachi Miyuki, Gyoba Jiro (1998) Coding facial expressions with gabor wavelets. In: 3rd International Conference on Face & Gesture Recognition (FG ’98), April 14–16, 1998, Nara, Japan, pp 200–205

  32. van der Maaten Laurens, Hinton Geoffrey (2008) Visualizing data using t-sne. J Mach Learn Res 9(11):2579–2605

    MATH  Google Scholar 

  33. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, Oakland, CA, USA, pp 281–297

  34. Milligan Glenn W, Soon Shih Chung, Sokol Lisa M (1983) The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE Trans Pattern Anal Mach Intell 1:40–47

  35. Nayar S, Nene Sammeer A, Murase Hiroshi (1996) Columbia object image library (coil 100). department of comp. Technical Report CUCS-006-96

  36. Nene Sameer A, Nayar Shree K, Murase Hiroshi et al (1996) Columbia object image library (coil-20)

  37. Ng Andrew Y, Jordan Michael I, Weiss Yair (2001) On spectral clustering: Analysis and an algorithm. In: Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada], pages 849–856

  38. Nie Feiping, Wang Xiaoqian, Jordan Michael I, Huang Heng (2016) The constrained laplacian rank algorithm for graph-based clustering. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pages 1969–1976

  39. Otto Charles, Wang Dayong, Jain Anil K (2018) Clustering millions of faces by identity. IEEE Trans Pattern Anal Mach Intell 40(2):289–303

    Google Scholar 

  40. Pedregosa Fabian, Varoquaux Gaël, Gramfort Alexandre, Michel Vincent, Thirion Bertrand, Grisel Olivier, Blondel Mathieu, Prettenhofer Peter, Weiss Ron, Dubourg Vincent, VanderPlas Jake, Passos Alexandre, Cournapeau David, Brucher Matthieu, Perrot Matthieu, Duchesnay Edouard (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  41. Peterson Leif E (2009) K-nearest neighbor. Scholarpedia 4(2):1883

    Google Scholar 

  42. Ray S, Turi RH (1999) Determination of number of clusters in k-means clustering and application in colour image segmentation. In: Proceedings of the 4th International Conference on Advances in Pattern Recognition and Digital Techniques, Calcutta, India, pp 137–143

  43. Rodriguez Alex, Laio Alessandro (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496

    Google Scholar 

  44. Sammon John W (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput 100(5):401–409

    Google Scholar 

  45. Seifoddini Hamid K (1989) Single linkage versus average linkage clustering in machine cells formation applications. Comput Ind Eng 16(3):419–426

    Google Scholar 

  46. Shi Jianbo, Malik Jitendra (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905

    Google Scholar 

  47. Singh Dinesh, Febbo Phillip G, Ross Kenneth, Jackson Donald G, Manola Judith, Ladd Christine, Tamayo Pablo, Renshaw Andrew A, D’Amico Anthony V, Richie Jerome P et al (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2):203–209

    Google Scholar 

  48. Tryon Robert Choate (1939) Cluster analysis: Correlation profile and orthometric (factor) analysis for the isolation of unities in mind and personality. Edwards brother, Incorporated, lithoprinters and publishers

  49. Rui Xu, Wunsch Donald (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678

    Google Scholar 

  50. Yang Jianwei, Parikh Devi, Batra Dhruv (2016) Joint unsupervised learning of deep representations and image clusters. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, pp 5147–5156

  51. Zhang Tian, Ramakrishnan Raghu, Livny Miron (1996) Birch: an efficient data clustering method for very large databases. In: ACM Sigmod Record, volume 25, pp 103–114

  52. Zhang Wei, Wang Xiaogang, Zhao Deli, Tang Xiaoou (2012) Graph degree linkage: Agglomerative clustering on a directed graph. In: Computer Vision—ECCV 2012—12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part I, pp 428–441

  53. Zhang Wei, Zhao Deli, Wang Xiaogang (2013) Agglomerative clustering via maximum incremental path integral. Pattern Recogn 46(11):3056–3065

    MATH  Google Scholar 

  54. Zhao Z, Morstatter F, Sharma S, Alelyani S, Anand A, Liu H (2010) Advancing feature selection research. ASU feature selection repository, pp 1–28

  55. Zhu Chunhui, Wen Fang, Sun Jian (2011) A rank-order distance based clustering algorithm for face tagging. In: The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011, pp 481–488

Download references

Acknowledgements

This work is supported in part by The National Nature Science Foundation of China under Grant No. 61772120.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to William Zhu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, T., Wang, S. & Zhu, W. An adaptive kernelized rank-order distance for clustering non-spherical data with high noise. Int. J. Mach. Learn. & Cyber. 11, 1735–1747 (2020). https://doi.org/10.1007/s13042-020-01068-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-020-01068-9

Keywords

Navigation