Skip to main content
Log in

A hierarchical clustering algorithm based on noise removal

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Noise is irrelevant or meaningless data and hinders most types of data analysis. The existing clustering algorithms seldom take the noise points into consideration and cannot detect arbitrary-shaped clusters. This paper presents a Hierarchical Clustering algorithm Based on Noise Removal (HCBNR). It is robust against noise points and good at discovering clusters with arbitrary shapes. In this work, natural neighbor-based density is applied to remove noise points in a data set firstly. Then we construct a saturated neighbor graph on the rest points, and a novel modularity-based graph partitioning algorithm is used to divide the graph into small clusters. Finally, the small clusters are repeatedly merged according to a novel similarity metric between clusters until the desired cluster number is obtained. The experimental results on synthetic data sets and real data sets show that our method can accurately identify noise points and obtain better clustering results than existing clustering algorithms when discovering arbitrary-shaped clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Breunig MM, Kriegel HP, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. Acm Sigmod Record 29(2):93–104

    Article  Google Scholar 

  2. Chen WY, Song Y, Bai H, Lin CJ, Chang EY (2011) Parallel spectral clustering in distributed systems. IEEE Trans Pattern Anal Mach Intell 33(3):568–586

    Article  Google Scholar 

  3. Cheng D, Zhu Q, Huang J, Yang L, Wu Q (2017) Natural neighbor-based clustering algorithm with local representatives. Knowl Based Syst 123C:238–253

    Article  Google Scholar 

  4. Ester M, Kriegel HP, Xu X (1996) A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: International Conference on Knowledge Discovery and Data Mining, pp 226–231

  5. Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315(5814):972–976

    Article  MathSciNet  MATH  Google Scholar 

  6. Guha S, Rastogi R, Shim K (2000) Rock: a robust clustering algorithm for categorical attributes. Inf Syst 25(5):345–366

    Article  Google Scholar 

  7. Guha S, Rastogi R, Shim K (2001) Cure: an efficient clustering algorithm for large databases. Inf Syst 26(1):35–58

    Article  MATH  Google Scholar 

  8. Ha J, Seok S, Lee JS (2014) Robust outlier detection using the instability factor. Knowl Based Syst 63(2):15–23

    Article  Google Scholar 

  9. Huang J, Zhu Q, Yang L, Cheng D, Wu Q (2017) Qcc: a novel clustering algorithm based on quasi-cluster centers. Mach Learn 106(3):337–357

    Article  MathSciNet  MATH  Google Scholar 

  10. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. Acm Comput Surv 31(3):264–323

    Article  Google Scholar 

  11. Karypis G, Aggarwal R, Kumar V, Shekhar S (2002) Multilevel hypergraph partitioning: applications in vlsi domain. IEEE Trans Very Large Scale Integr Syst 7(1):69–79

    Article  Google Scholar 

  12. Karypis G, Han EH, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. IEEE Computer Society Press

  13. Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis. John Wiley, Hoboken

    MATH  Google Scholar 

  14. King B (1967) Step-wise clustering procedures. J Am Stat Assoc 62(317):86–101

    Article  Google Scholar 

  15. Lv Y, Ma T, Tang M, Cao J, Tian Y, Al-Rodhaan M (2015) An efficient and scalable density-based clustering algorithm for datasets with complex structures. Neurocomputing 171C:9–22

    Google Scholar 

  16. Macqueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of Berkeley Symposium on Mathematical Statistics and Probability, pp 281–297

  17. Newman ME, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E Stat Nonlinear Soft Matter Phys . https://doi.org/10.1103/PhysRevE.69.026113

  18. Newman MEJ (2004) Analysis of weighted networks. Phys Rev E Stat Nonlinear Soft Matter Phys 70(5):1–9

    Google Scholar 

  19. Newman MEJ (2006) Modularity and community structure in networks. Proc Natl Acad Sci USA 103(23):8577–8582

    Article  Google Scholar 

  20. Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492

    Article  Google Scholar 

  21. Sneath PH, Sokal RR (1962) Numerical taxonomy. Nature 193:855–860

    Article  Google Scholar 

  22. Veenman CJ, Reinders MJT, Backer E (2002) A maximum variance cluster algorithm. IEEE Trans Pattern Anal Mach Intell 24(9):1273–1280

    Article  Google Scholar 

  23. Wang G, Song Q (2016) Automatic clustering via outward statistical testing on density metrics. IEEE Trans Knowl Data Eng 28(8):1971–1985

    Article  Google Scholar 

  24. Wang X, Wang XL, Chen C, Wilkes DM (2013) Enhancing minimum spanning tree-based clustering by removing density-based outliers. Digital Signal Process 23(5):1523–1538

    Article  MathSciNet  Google Scholar 

  25. Xie JY, Gao HC, Xie WX, Liu XH, Grant PW (2016) Robust clustering by detecting density peaks and assigning points based on fuzzy weighted k-nearest neighbors. Inf Sci 354:19–40

    Article  Google Scholar 

  26. Xiong H, Pandey G, Steinbach M, Kumar V (2006) Enhancing data analysis with noise removal. IEEE Trans Knowl Data Eng 18(3):304–319

    Article  Google Scholar 

  27. Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. In: ACM SIGMOD International Conference on Management of Data, pp 103–114

  28. Zhu Q, Feng J, Huang J (2016) Natural neighbor: a self-adaptive neighborhood method without parameter k. Pattern Recognit Lett 80:30–36

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China, No. 61702060 and No. 61502060, Science and Technology Project of Chongqing Municipal Education Commission, No. KJ15012014 and Project of Chongqing Education Commission, No. KJZH17104.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qingsheng Zhu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheng, D., Zhu, Q., Huang, J. et al. A hierarchical clustering algorithm based on noise removal. Int. J. Mach. Learn. & Cyber. 10, 1591–1602 (2019). https://doi.org/10.1007/s13042-018-0836-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-018-0836-3

Keywords

Navigation