Skip to main content
Log in

Clustering data with the presence of attribute noise: a study of noise completely at random and ensemble of multiple k-means clusterings

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

In general practice, the perception of noise has been inevitably negative. Specific to data analytic, most of the existing techniques developed thus far comply with a noise-free assumption. Without an assistance of data pre-processing, it is hard for those models to discover reliable patterns. This is also true for k-means, one of the most well known algorithms for cluster analysis. Based on several works in the literature, they suggest that the ensemble approach can deliver accurate results from multiple clusterings of data with noise completely at random. Provided this motivation, the paper presents the study of using different consensus clustering techniques to analyze noisy data, with k-means being exploited as base clusterings. The empirical investigation reveals that the ensemble approach can be robust to low level of noise, while some exhibit improvement over the noise-free cases. This finding is in line with the recent published work that underlines the benefit of small noise to centroid-based clustering methods. In addition, the outcome of this research provides a guideline to analyzing a new data collection of uncertain quality level.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  1. Agrawal P, Sarma AD, Ullman J, Widom J (2010) Foundations of uncertain-data integration. Proc VLDB Endow 3(1–2):1080–1090

    Google Scholar 

  2. Aidos H, Carreiras C, Silva H, Fred A (2013) Evidence accumulation approach applied to EEQ analysis. In: Proceedings of international conference on pattern recognition applications and methods, pp 479–484

  3. Asuncion A, Newman DJ (2007) UCI machine learning repository. Irvine University of California, Irvine

    Google Scholar 

  4. Balcan MF, Liang Y, Gupta P (2014) Robust hierarchical clustering. J Mach Learn Res 15:4011–4051

    MathSciNet  MATH  Google Scholar 

  5. Bernecker T, Kriegel HP, Renz M, Verhein F, Zufle A (2009) Probabilistic frequent itemset mining in uncertain databases. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 119–128

  6. Bshouty NH, Jackson JC, Tamon C (2003) Uniform-distribution attribute noise learnability. Inf Comput 187(2):277–290

    MathSciNet  MATH  Google Scholar 

  7. Chan E, Ching W, Ng M, Huang J (2004) An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognit 37(5):943–952

    MATH  Google Scholar 

  8. Cooke EJ, Savage RS, Kirk PDW, Darkins R, Wild DL (2011) Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements. BMC Bioinform 12(399):1–12

    Google Scholar 

  9. Deshpande A, Guestrin C, Madden SR, Hellerstein JM, Hong W (2005) Model-based approximate querying in sensor networks. Int J Very Large Data Bases 14(4):417–443

    Google Scholar 

  10. Domeniconi C, Al-Razgan M (2009) Weighted cluster ensembles: methods and analysis. ACM Trans Knowl Discov Data 2(4):1–40

    Google Scholar 

  11. Fern XZ, Brodley CE (2004) Solving cluster ensemble problems by bipartite graph partitioning. In: Proceedings of international conference on machine learning, pp 36–43

  12. Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2:139–172

    Google Scholar 

  13. Fred ALN, Jain AK (2005) Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6):835–850

    Google Scholar 

  14. Frenay B, Verleysen M (2014) Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learn Syst 25(5):845–869

    MATH  Google Scholar 

  15. Garcia-Escudero LA, Gordaliza A, Matran C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 36(3):1324–1345

    MathSciNet  MATH  Google Scholar 

  16. Ghinita G, Karras P, Kalnis P, Mamoulis N (2007) Fast data anonymization with low information loss. In: Proceedings of international conference on very large data bases, pp 758–769

  17. Gionis A, Mannila H, Tsaparas P (2007) Clustering aggregation. ACM Trans Knowl Discov Data 1(1):4

    Google Scholar 

  18. Gullo F, Tagarelli A (2012) Uncertain centroid based partitional clustering of uncertain data. Proc VLDB Endow 5(7):610–621

    Google Scholar 

  19. Gullo F, Ponti G, Tagarelli A (2013) Minimizing the variance of cluster mixture models for clustering uncertain objects. Stat Anal Data Min 6(2):116–135

    MathSciNet  Google Scholar 

  20. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 770–778

  21. Huang D, Lai J, Wang CD (2016) Ensemble clustering using factor graph. Pattern Recognit 50:131–142

    MATH  Google Scholar 

  22. Huang D, Lai JH, Wang CD (2016) Robust ensemble clustering using probability trajectories. IEEE Trans Knowl Data Eng 28(5):1312–1326

    Google Scholar 

  23. Huang D, Wang CD, Lai JH (2018) Locally weighted ensemble clustering. IEEE Trans Cybern 48(5):1460–1473

    Google Scholar 

  24. Huang J, Ng M, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(5):657–668

    Google Scholar 

  25. Huang X, Ye Y, Zhang H (2014) Extensions of kmeans-type algorithms: a new clustering framework by integrating intracluster compactness and intercluster separation. IEEE Trans Neural Netw Learn Syst 25(8):1433–1446

    Google Scholar 

  26. Hulse JDV, Khoshgoftaar TM, Huang H (2007) The pairwise attribute noise detection algorithm. Knowl Inf Syst 11(2):171–190

    Google Scholar 

  27. Iam-On N, Boongoen T (2013) Pairwise similarity for cluster ensemble problem: link-based and approximate approaches. Trans Large Scale Data Knowl Centered Syst 9:95–122

    Google Scholar 

  28. Iam-On N, Boongoen T (2015) Comparative study of matrix refinement approaches for ensemble clustering. Mach Learn 98(1–2):269–300

    MathSciNet  MATH  Google Scholar 

  29. Iam-On N, Boongoen T, Garrett S (2010) LCE: a link-based cluster ensemble method for improved gene expression data analysis. Bioinformatics 26(12):1513–1519

    Google Scholar 

  30. Iam-On N, Boongoen T, Garrett S, Price C (2011) A link-based approach to the cluster ensemble problem. IEEE Trans Pattern Anal Mach Intell 33(12):2396–2409

    Google Scholar 

  31. Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666

    Google Scholar 

  32. Jiang B, Pei J, Tao Y, Lin X (2013) Clustering uncertain data based on probability distribution similarity. IEEE Trans Knowl Data Eng 25(4):751–763

    Google Scholar 

  33. Jurek A, Nugent C, Bi Y, Wu S (2014) Clustering-based ensemble learning for activity recognition in smart homes. Sensors 14:12,285–12,304

    Google Scholar 

  34. Kao B, Lee SD, Cheung DW, Ho WS, Chan KF (2008) Clustering uncertain data using voronoi diagrams. In: Proceedings of IEEE international conference on data mining, pp 333–342

  35. Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392

    MathSciNet  MATH  Google Scholar 

  36. Karypis G, Kumar V (1998) Multilevel k-way partitioning scheme for irregular graphs. J Parallel Distrib Comput 48(1):96–129

    MATH  Google Scholar 

  37. Karypis G, Kumar V (1998) A parallel algorithm for multilevel graph-partitioning and sparse matrix ordering. J Parallel Distrib Comput 48(1):71–95

    Google Scholar 

  38. Karypis G, Aggarwal R, Kumar V, Shekhar S (1999) Multilevel hypergraph partitioning: applications in VLSI domain. IEEE Trans VLSI Syst 7(1):69–79

    Google Scholar 

  39. Kerr MK, Churchill G (2001) Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc Natl Acad Sci 98:8961–8965

    MATH  Google Scholar 

  40. Kim E, Kim S, Ashlock D, Nam D (2009) MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering. BMC Bioinform 10:260

    Google Scholar 

  41. Kim H, Thiagarajan JJ, Bremer P (2014) Image segmentation using consensus from hierarchical segmentation ensembles. In: Proceedings of IEEE international conference on image processing, pp 3272 – 3276

  42. Kriegel HP, Kroger P, Sander J, Zimek A (2011) Density-based clustering. WIREs Data Min Knowl Discov 1(3):231–240

    Google Scholar 

  43. Mantas CJ, Abellan J, Castellano JG (2016) Analysis of credal-c4.5 for classification in noisy domains. Expert Syst Appl 61:314–326

    Google Scholar 

  44. McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley symposium on mathematical statistics and probability, pp 281–297

  45. Medvedovic M, Yeung KY, Bumgarner RE (2004) Bayesian mixture model based clustering of replicated microarray data. Bioinformatics 20:1222–1232

    Google Scholar 

  46. Mirkin B (2001) Reinterpreting the category utility function. Mach Learn 45:219–228

    MATH  Google Scholar 

  47. Mirylenka K, Giannakopoulos G, Do LM, Palpanas T (2017) On classifier behavior in the presence of mislabeling noise. Data Min Knowl Discov 31(3):661–701

    MathSciNet  MATH  Google Scholar 

  48. Monti S, Tamayo P, Mesirov JP, Golub TR (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52(1–2):91–118

    MATH  Google Scholar 

  49. Ng A, Jordan M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. Adv Neural Inf Process Syst 14:849–856

    Google Scholar 

  50. Ngai WK, Kao B, Chui CK, Cheng R, Chau M, Yip KY (2006) Efficient clustering of uncertain data. In: Proceedings of IEEE international conference on data mining, pp 436–445

  51. Nguyen N, Caruana R (2007) Consensus clusterings. In: Proceedings of IEEE international conference on data mining, pp 607–612

  52. Osoba O, Kosko B (2013) Noise-enhanced clustering and competitive learning algorithms. Neural Netw 37:132–140

    MATH  Google Scholar 

  53. Osoba O, Kosko B (2016) The noisy expectation-maximization algorithm for multiplicative noise injection. Fluct Noise Lett 15(1):1–23

    Google Scholar 

  54. Ronan T, Qi Z, Naegle KM (2016) Avoiding common pitfalls when clustering biological data. Sci Signal 9(432):1–13

    Google Scholar 

  55. Santos CP, Carvalho DM, Nascimento M (2016) A consensus graph clustering algorithm for directed networks. Expert Syst Appl 54:121–135

    Google Scholar 

  56. Sloutsky R, Jimenez N, Swamidass SJ, Naegle KM (2013) Accounting for noise when clustering biological data. Brief Bioinform 14:423–436

    Google Scholar 

  57. Sluban B, Gamberger D, Lavrac N (2014) Ensemble-based noise detection: noise ranking and visual performance evaluation. Data Min Knowl Discov 28(2):265–303

    MathSciNet  MATH  Google Scholar 

  58. Strehl A, Ghosh J (2002) Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617

    MathSciNet  MATH  Google Scholar 

  59. Sun L, Cheng R, Cheung DW, Cheng J (2010) Mining uncertain data with probabilistic guarantees. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 273–282

  60. Tijms H (2004) Understanding probability: chance rules in everyday life. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  61. Topchy AP, Jain AK, Punch WF (2005) Clustering ensembles: models of consensus and weak partitions. IEEE Trans Pattern Anal Mach Intell 27(12):1866–1881

    Google Scholar 

  62. Weng F, Jiang Q, Chen L, Hong Z (2007) Clustering ensemble based on the fuzzy KNN algorithm. In: Proceedings of international conference on software engineering, artificial intelligence, networking, and parallel/distributed computing, pp 1001–1006

  63. Xiao W, Yang Y, Wang H, Li T, Xing H (2016) Semi-supervised hierarchical clustering ensemble and its application. Neurocomputing 173:362–1376

    Google Scholar 

  64. Yu Z, Wong HS (2009) Class discovery from gene expression data based on perturbation and cluster ensemble. IEEE Trans NanoBiosci 8(2):147–160

    Google Scholar 

  65. Zhang H, Chow TWS, Wu QMJ (2016) Organizing books and authors by multilayer som. IEEE Trans Neural Netw Learn Syst 27(12):2537–2550

    Google Scholar 

  66. Zhong C, Yue X, Zhang Z, Lei J (2015) A clustering ensemble: two-level-refined co-association matrix with path-based transformation. Pattern Recognit 48:2699–2709

    MATH  Google Scholar 

  67. Zhu X, Wu X (2004) Class noise vs attribute noise: a quantitative study of their impacts. Artif Intell Rev 22(3–4):177–210

    MATH  Google Scholar 

  68. Zimek A, Schubert E, Kriegel HP (2012) A survey on unsupervised outlier detection in high-dimensional numerical data. Stat Anal Data Min 5(5):363–387

    MathSciNet  Google Scholar 

Download references

Acknowledgements

This work is funded by IAPP1-100077 (Newton RAE-TRF): Anomaly Traffic Identification through Artificial Intelligence, Cyber Security and Big Data Analytics Technologies. It is also partly supported by Mae Fah Luang University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Natthakan Iam-On.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Iam-On, N. Clustering data with the presence of attribute noise: a study of noise completely at random and ensemble of multiple k-means clusterings. Int. J. Mach. Learn. & Cyber. 11, 491–509 (2020). https://doi.org/10.1007/s13042-019-00989-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-019-00989-4

Keywords

Navigation