Clustering data with the presence of attribute noise: a study of noise completely at random and ensemble of multiple k-means clusterings

Iam-On, Natthakan

doi:10.1007/s13042-019-00989-4

Clustering data with the presence of attribute noise: a study of noise completely at random and ensemble of multiple k-means clusterings

Original Article
Published: 29 July 2019

Volume 11, pages 491–509, (2020)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Natthakan Iam-On¹

545 Accesses
14 Citations
Explore all metrics

Abstract

In general practice, the perception of noise has been inevitably negative. Specific to data analytic, most of the existing techniques developed thus far comply with a noise-free assumption. Without an assistance of data pre-processing, it is hard for those models to discover reliable patterns. This is also true for k-means, one of the most well known algorithms for cluster analysis. Based on several works in the literature, they suggest that the ensemble approach can deliver accurate results from multiple clusterings of data with noise completely at random. Provided this motivation, the paper presents the study of using different consensus clustering techniques to analyze noisy data, with k-means being exploited as base clusterings. The empirical investigation reveals that the ensemble approach can be robust to low level of noise, while some exhibit improvement over the noise-free cases. This finding is in line with the recent published work that underlines the benefit of small noise to centroid-based clustering methods. In addition, the outcome of this research provides a guideline to analyzing a new data collection of uncertain quality level.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

References

Agrawal P, Sarma AD, Ullman J, Widom J (2010) Foundations of uncertain-data integration. Proc VLDB Endow 3(1–2):1080–1090
Google Scholar
Aidos H, Carreiras C, Silva H, Fred A (2013) Evidence accumulation approach applied to EEQ analysis. In: Proceedings of international conference on pattern recognition applications and methods, pp 479–484
Asuncion A, Newman DJ (2007) UCI machine learning repository. Irvine University of California, Irvine
Google Scholar
Balcan MF, Liang Y, Gupta P (2014) Robust hierarchical clustering. J Mach Learn Res 15:4011–4051
MathSciNet MATH Google Scholar
Bernecker T, Kriegel HP, Renz M, Verhein F, Zufle A (2009) Probabilistic frequent itemset mining in uncertain databases. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 119–128
Bshouty NH, Jackson JC, Tamon C (2003) Uniform-distribution attribute noise learnability. Inf Comput 187(2):277–290
MathSciNet MATH Google Scholar
Chan E, Ching W, Ng M, Huang J (2004) An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognit 37(5):943–952
MATH Google Scholar
Cooke EJ, Savage RS, Kirk PDW, Darkins R, Wild DL (2011) Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements. BMC Bioinform 12(399):1–12
Google Scholar
Deshpande A, Guestrin C, Madden SR, Hellerstein JM, Hong W (2005) Model-based approximate querying in sensor networks. Int J Very Large Data Bases 14(4):417–443
Google Scholar
Domeniconi C, Al-Razgan M (2009) Weighted cluster ensembles: methods and analysis. ACM Trans Knowl Discov Data 2(4):1–40
Google Scholar
Fern XZ, Brodley CE (2004) Solving cluster ensemble problems by bipartite graph partitioning. In: Proceedings of international conference on machine learning, pp 36–43
Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2:139–172
Google Scholar
Fred ALN, Jain AK (2005) Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6):835–850
Google Scholar
Frenay B, Verleysen M (2014) Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learn Syst 25(5):845–869
MATH Google Scholar
Garcia-Escudero LA, Gordaliza A, Matran C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 36(3):1324–1345
MathSciNet MATH Google Scholar
Ghinita G, Karras P, Kalnis P, Mamoulis N (2007) Fast data anonymization with low information loss. In: Proceedings of international conference on very large data bases, pp 758–769
Gionis A, Mannila H, Tsaparas P (2007) Clustering aggregation. ACM Trans Knowl Discov Data 1(1):4
Google Scholar
Gullo F, Tagarelli A (2012) Uncertain centroid based partitional clustering of uncertain data. Proc VLDB Endow 5(7):610–621
Google Scholar
Gullo F, Ponti G, Tagarelli A (2013) Minimizing the variance of cluster mixture models for clustering uncertain objects. Stat Anal Data Min 6(2):116–135
MathSciNet Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 770–778
Huang D, Lai J, Wang CD (2016) Ensemble clustering using factor graph. Pattern Recognit 50:131–142
MATH Google Scholar
Huang D, Lai JH, Wang CD (2016) Robust ensemble clustering using probability trajectories. IEEE Trans Knowl Data Eng 28(5):1312–1326
Google Scholar
Huang D, Wang CD, Lai JH (2018) Locally weighted ensemble clustering. IEEE Trans Cybern 48(5):1460–1473
Google Scholar
Huang J, Ng M, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(5):657–668
Google Scholar
Huang X, Ye Y, Zhang H (2014) Extensions of kmeans-type algorithms: a new clustering framework by integrating intracluster compactness and intercluster separation. IEEE Trans Neural Netw Learn Syst 25(8):1433–1446
Google Scholar
Hulse JDV, Khoshgoftaar TM, Huang H (2007) The pairwise attribute noise detection algorithm. Knowl Inf Syst 11(2):171–190
Google Scholar
Iam-On N, Boongoen T (2013) Pairwise similarity for cluster ensemble problem: link-based and approximate approaches. Trans Large Scale Data Knowl Centered Syst 9:95–122
Google Scholar
Iam-On N, Boongoen T (2015) Comparative study of matrix refinement approaches for ensemble clustering. Mach Learn 98(1–2):269–300
MathSciNet MATH Google Scholar
Iam-On N, Boongoen T, Garrett S (2010) LCE: a link-based cluster ensemble method for improved gene expression data analysis. Bioinformatics 26(12):1513–1519
Google Scholar
Iam-On N, Boongoen T, Garrett S, Price C (2011) A link-based approach to the cluster ensemble problem. IEEE Trans Pattern Anal Mach Intell 33(12):2396–2409
Google Scholar
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666
Google Scholar
Jiang B, Pei J, Tao Y, Lin X (2013) Clustering uncertain data based on probability distribution similarity. IEEE Trans Knowl Data Eng 25(4):751–763
Google Scholar
Jurek A, Nugent C, Bi Y, Wu S (2014) Clustering-based ensemble learning for activity recognition in smart homes. Sensors 14:12,285–12,304
Google Scholar
Kao B, Lee SD, Cheung DW, Ho WS, Chan KF (2008) Clustering uncertain data using voronoi diagrams. In: Proceedings of IEEE international conference on data mining, pp 333–342
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392
MathSciNet MATH Google Scholar
Karypis G, Kumar V (1998) Multilevel k-way partitioning scheme for irregular graphs. J Parallel Distrib Comput 48(1):96–129
MATH Google Scholar
Karypis G, Kumar V (1998) A parallel algorithm for multilevel graph-partitioning and sparse matrix ordering. J Parallel Distrib Comput 48(1):71–95
Google Scholar
Karypis G, Aggarwal R, Kumar V, Shekhar S (1999) Multilevel hypergraph partitioning: applications in VLSI domain. IEEE Trans VLSI Syst 7(1):69–79
Google Scholar
Kerr MK, Churchill G (2001) Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc Natl Acad Sci 98:8961–8965
MATH Google Scholar
Kim E, Kim S, Ashlock D, Nam D (2009) MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering. BMC Bioinform 10:260
Google Scholar
Kim H, Thiagarajan JJ, Bremer P (2014) Image segmentation using consensus from hierarchical segmentation ensembles. In: Proceedings of IEEE international conference on image processing, pp 3272 – 3276
Kriegel HP, Kroger P, Sander J, Zimek A (2011) Density-based clustering. WIREs Data Min Knowl Discov 1(3):231–240
Google Scholar
Mantas CJ, Abellan J, Castellano JG (2016) Analysis of credal-c4.5 for classification in noisy domains. Expert Syst Appl 61:314–326
Google Scholar
McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley symposium on mathematical statistics and probability, pp 281–297
Medvedovic M, Yeung KY, Bumgarner RE (2004) Bayesian mixture model based clustering of replicated microarray data. Bioinformatics 20:1222–1232
Google Scholar
Mirkin B (2001) Reinterpreting the category utility function. Mach Learn 45:219–228
MATH Google Scholar
Mirylenka K, Giannakopoulos G, Do LM, Palpanas T (2017) On classifier behavior in the presence of mislabeling noise. Data Min Knowl Discov 31(3):661–701
MathSciNet MATH Google Scholar
Monti S, Tamayo P, Mesirov JP, Golub TR (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52(1–2):91–118
MATH Google Scholar
Ng A, Jordan M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. Adv Neural Inf Process Syst 14:849–856
Google Scholar
Ngai WK, Kao B, Chui CK, Cheng R, Chau M, Yip KY (2006) Efficient clustering of uncertain data. In: Proceedings of IEEE international conference on data mining, pp 436–445
Nguyen N, Caruana R (2007) Consensus clusterings. In: Proceedings of IEEE international conference on data mining, pp 607–612
Osoba O, Kosko B (2013) Noise-enhanced clustering and competitive learning algorithms. Neural Netw 37:132–140
MATH Google Scholar
Osoba O, Kosko B (2016) The noisy expectation-maximization algorithm for multiplicative noise injection. Fluct Noise Lett 15(1):1–23
Google Scholar
Ronan T, Qi Z, Naegle KM (2016) Avoiding common pitfalls when clustering biological data. Sci Signal 9(432):1–13
Google Scholar
Santos CP, Carvalho DM, Nascimento M (2016) A consensus graph clustering algorithm for directed networks. Expert Syst Appl 54:121–135
Google Scholar
Sloutsky R, Jimenez N, Swamidass SJ, Naegle KM (2013) Accounting for noise when clustering biological data. Brief Bioinform 14:423–436
Google Scholar
Sluban B, Gamberger D, Lavrac N (2014) Ensemble-based noise detection: noise ranking and visual performance evaluation. Data Min Knowl Discov 28(2):265–303
MathSciNet MATH Google Scholar
Strehl A, Ghosh J (2002) Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
MathSciNet MATH Google Scholar
Sun L, Cheng R, Cheung DW, Cheng J (2010) Mining uncertain data with probabilistic guarantees. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 273–282
Tijms H (2004) Understanding probability: chance rules in everyday life. Cambridge University Press, Cambridge
MATH Google Scholar
Topchy AP, Jain AK, Punch WF (2005) Clustering ensembles: models of consensus and weak partitions. IEEE Trans Pattern Anal Mach Intell 27(12):1866–1881
Google Scholar
Weng F, Jiang Q, Chen L, Hong Z (2007) Clustering ensemble based on the fuzzy KNN algorithm. In: Proceedings of international conference on software engineering, artificial intelligence, networking, and parallel/distributed computing, pp 1001–1006
Xiao W, Yang Y, Wang H, Li T, Xing H (2016) Semi-supervised hierarchical clustering ensemble and its application. Neurocomputing 173:362–1376
Google Scholar
Yu Z, Wong HS (2009) Class discovery from gene expression data based on perturbation and cluster ensemble. IEEE Trans NanoBiosci 8(2):147–160
Google Scholar
Zhang H, Chow TWS, Wu QMJ (2016) Organizing books and authors by multilayer som. IEEE Trans Neural Netw Learn Syst 27(12):2537–2550
Google Scholar
Zhong C, Yue X, Zhang Z, Lei J (2015) A clustering ensemble: two-level-refined co-association matrix with path-based transformation. Pattern Recognit 48:2699–2709
MATH Google Scholar
Zhu X, Wu X (2004) Class noise vs attribute noise: a quantitative study of their impacts. Artif Intell Rev 22(3–4):177–210
MATH Google Scholar
Zimek A, Schubert E, Kriegel HP (2012) A survey on unsupervised outlier detection in high-dimensional numerical data. Stat Anal Data Min 5(5):363–387
MathSciNet Google Scholar

Download references

Acknowledgements

This work is funded by IAPP1-100077 (Newton RAE-TRF): Anomaly Traffic Identification through Artificial Intelligence, Cyber Security and Big Data Analytics Technologies. It is also partly supported by Mae Fah Luang University.

Author information

Authors and Affiliations

IQD-IT Research Group, School of Information Technology, Mae Fah Luang University, Chiang Rai, 57100, Thailand
Natthakan Iam-On

Authors

Natthakan Iam-On
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Natthakan Iam-On.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Iam-On, N. Clustering data with the presence of attribute noise: a study of noise completely at random and ensemble of multiple k-means clusterings. Int. J. Mach. Learn. & Cyber. 11, 491–509 (2020). https://doi.org/10.1007/s13042-019-00989-4

Download citation

Received: 26 May 2018
Accepted: 22 July 2019
Published: 29 July 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s13042-019-00989-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering data with the presence of attribute noise: a study of noise completely at random and ensemble of multiple k-means clusterings

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Clustering data with the presence of attribute noise: a study of noise completely at random and ensemble of multiple k-means clusterings

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation