Skip to main content
Log in

Outlier detection using an ensemble of clustering algorithms

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Outlier detection is an important research area in the field of machine learning and data science. The presence of outliers in a dataset limits its true usefulness in a real-life scenario. Due to the varied challenges, researchers strive to find a general method to be useful for different datasets. In this paper, we have proposed an outlier detection technique based on unsupervised learning using an ensemble of three clustering algorithms, namely K-means, K-means++ and Fuzzy C-means. We have proposed a unique way to deal with clustered outliers. Outcomes of the three aforementioned clustering algorithms are combined intelligently to accumulate all the complementary information. To combine the decisions of the hard and soft clustering algorithms, we have proposed a novel probability-based technique, which assigns a membership value to each data point in the case of a hard clustering algorithm. Three cluster validity indices are used as our evaluation metrics, which measure the goodness of a cluster. Significant improvement of cluster validity indices is observed after removing the outliers, which ensures the removal of outliers has resulted in stringent clusters. The method is evaluated on eight datasets, among which, three datasets are comparatively large. Source code of this work is available at: https://github.com/biswarup9/Outlier-Detection-Using-an-Ensemble-of-Clustering-Algorithms-.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Agarwal S, Yadav S, Singh K (2012) K-means versus k-means clustering technique. In: 2012 Students Conference on Engineering and Systems, SCES 2012

  2. Aggarwal CC, Aggarwal CC (2017) Supervised outlier detection. In: Outlier analysis. Springer International Publishing, Berlin, pp 219–248

  3. Ahmed M, Mahmood AN (2013) A novel approach for outlier detection and clustering improvement. In: Proceedings of the (2013) IEEE 8th Conference on Industrial Electronics and Applications, ICIEA 2013, pp 577–582

  4. Ahmed S, Ghosh KK, Singh PK, Geem ZW, Sarkar R (2020) Hybrid of harmony search algorithm and ring theory-based evolutionary algorithm for feature selection. IEEE Access 8:102629–102645

    Article  Google Scholar 

  5. Ali A, Zhu Y, Zakarya M (2021) A data aggregation based approach to exploit dynamic spatio-temporal correlations for citywide crowd flows prediction in fog computing. Multimed Tools Appl 1–33

  6. Bera SK, Ghosh S, Bhowmik S, Sarkar R, Nasipuri M (2020) A non-parametric binarization method based on ensemble of clustering algorithms. Multimed Tools Appl 80(5):7653–7673

  7. Boddy AJ, Hurst W, MacKay M, Rhalibi AE (2019) Density-based outlier detection for safeguarding electronic patient record systems. IEEE Access 7:40285–40294

    Article  Google Scholar 

  8. Boodhun N, Jayabalan M (2018) Risk prediction in life insurance industry using supervised learning algorithms. Complex Intell Syst4(2):145–154

  9. Boukerche A, Zheng L, Alfandi O (2020) Outlier detection: methods, models, and classification. ACM Comput Surv 53(3):1–37

  10. Boukerche A, Zheng L, Alfandi O (2020) Outlier detection: Methods, models, and classification. ACM Comput Surv 53(3)

  11. Chakraborty D, Narayanan V, Ghosh A (2019) Integration of deep feature extraction and ensemble learning for outlier detection. Pattern Recognit 89:161–171

    Article  Google Scholar 

  12. Chawla S, Gionisy A (2013) κ-means-: A unified approach to clustering and outlier detection. In: Proceedings of the (2013) SIAM International Conference on Data Mining, SDM 2013, pp 189–197

  13. Chopra P, Yadav SK (2015) Fault detection and classification by unsupervised feature extraction and dimensionality reduction. Complex Intell Syst 1(1–4):25–33

    Article  Google Scholar 

  14. Chopra P, Yadav SK (2015) Erratum to: Fault detection and classification by unsupervised feature extraction and dimensionality reduction. Complex Intell Syst 1(1–4):35–35

  15. Daneshpazhouh A, Sami A (2013) Semi-supervised outlier detection with only positive and unlabeled data based on fuzzy clustering. In: IKT 2013 - 2013 5th Conference on Information and Knowledge Technology, pp 344–348

  16. Daneshpazhouh A, Sami A (2014) Entropy-based outlier detection using semi-supervised approach with few positive examples. Pattern Recognit Lett 49:77–84

  17. Du H, Zhao S, Zhang D, Wu J (2016) Novel clustering-based approach for Local Outlier Detection. In: Proceedings - IEEE INFOCOM, vol 2016-Septe, pp 802–811

  18. ForestCover/Covertype dataset – ODDS. [Online]. Available: http://odds.cs.stonybrook.edu/forestcovercovertype-dataset/. Accessed 28 Jun 2020

  19. Ghosh S, Bhattacharya R, Majhi S, Bhowmik S, Malakar S, Sarkar R (2019) Textual content retrieval from filled-in form images. Commun Comput Inf Sci 1020:27–37

    Google Scholar 

  20. Ghosh S, Chatterjee A, Singh PK, Bhowmik S, Sarkar R (2021) Language-invariant novel feature descriptors for handwritten numeral recognition. Vis Comput 37(7):1781–1803

  21. http (KDDCUP99) dataset – ODDS. [Online]. Available: http://odds.cs.stonybrook.edu/http-kddcup99-dataset/. Accessed 28 Jun 2020

  22. Hoque N, Singh M, Bhattacharyya DK (2018) EFS-MI: an ensemble feature selection method for classification. Complex Intell Syst 4(2):105–118

  23. Hussien AG, Hassanien AE, Houssein EH, Bhattacharyya S, Amin M (2019) S-shaped binary whale optimization algorithm for feature selection. Adv Intell Syst Comput 727:79–87

    Google Scholar 

  24. Ijaz MF, Attique M, Son Y (2020) Data-driven cervical cancer prediction model with outlier detection and over-sampling methods. Sensors (Switzerland) 20(10):2809

    Article  Google Scholar 

  25. Jana P, Ghosh S, Bera SK, Sarkar R (2018) Handwritten document image binarization: An adaptive K-means based approach. In: (2017) IEEE Calcutta Conference, CALCON 2017 - Proceedings, vol 2018-Janua, pp 226–230

  26. Jana P, Ghosh S, Sarkar R, Nasipuri M (2018) A fuzzy C-means based approach towards efficient document image binarization. In: (2017) 9th International Conference on Advances in Pattern Recognition, ICAPR 2017, pp 332–337

  27. Jiang MF, Tseng SS, Su CM (2001) Two-phasee clustering process for outliers detection. Pattern Recognit Lett 22:6–7

    Google Scholar 

  28. Kieu T, Yang B, Jensen CS (2018) Outlier detection for multidimensional time series using deep neural networks. In: Proceedings - IEEE International Conference on Mobile Data Management, vol 2018-June, pp 125–134

  29. Kumar Dwivedi R, Pandey S, Kumar R (2018) A study on machine learning approaches for outlier detection in wireless sensor network. In: Proceedings of the 8th International Conference Confluence (2018) on Cloud Computing, Data Science and Engineering, Confluence 2018, pp 189–192

  30. Li Y, Wang Y, Ma X, Qian C, Li X (2019) A graph-based method for active outlier detection with limited expert feedback. IEEE Access 7:152267–152277

    Article  Google Scholar 

  31. Liu Y, Li Z, Zhou C, Jiang Y, Sun J, Wang M, He X (2019) Generative adversarial active learning for unsupervised outlier detection. IEEE Trans Knowl Data Eng 32(8):1517–1528

  32. Mandal A et al (2018) A case study of genetic algorithm coupled multi-layer perceptron. In: International Conference on Emerging Technologies for Sustainable Development (ICETSD ’19)

  33. Markou M, Singh S (2003) Novelty detection: A review - Part 1: Statistical approaches. Signal Process 83(12):2481–2497

    Article  Google Scholar 

  34. Markou M, Singh S (2003) Novelty detection: A review - Part 2: Neural network based approaches. Sig Process 83(12):2499–2521

    Article  Google Scholar 

  35. Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654

  36. Mishra G, Agarwal S, Jain PK, Pamula R (2019) Outlier detection using subset formation of clustering based method. Adv Intell Syst Comput 870:521–528

    Google Scholar 

  37. Munoz-Organero M (2019) Outlier detection in wearable sensor data for Human Activity Recognition (HAR) based on DRNNs. IEEE Access 7:74422–74436

    Article  Google Scholar 

  38. Nakai K, Kanehisa M (1991) Expert system for predicting protein localization sites in gram-negative bacteria. Proteins Struct Funct Bioinforma 11(2):95–110

    Article  Google Scholar 

  39. Panwar LK, Reddy S, Verma KA, Panigrahi BK, Kumar R (2018) Binary Grey Wolf Optimizer for large scale unit commitment problem. Swarm Evol Comput 38:251–266

    Article  Google Scholar 

  40. Pendharkar PC, Rodger JA (2004) An empirical study of impact of crossover operators on the performance of non-binary genetic algorithm based neural approaches for classification. Comput Oper Res 31(4):481–498

  41. Peng CYJ, Lee KL, Ingersoll GM (2002) An introduction to logistic regression analysis and reporting. J Educ Res 96(1):3–14

    Article  Google Scholar 

  42. Rish I (2014) An empirical study of the naïve bayes classifier an empirical study of the naive bayes classifier. no. January 2001, pp 41–46

  43. Saha S et al (2020) Feature selection for facial emotion recognition using cosine similarity-based harmony search algorithm. Appl Sci 10(8):2816

  44. Saha A, Chatterjee A, Ghosh S, Kumar N, Sarkar R (2020) An ensemble approach to outlier detection using some conventional clustering algorithms. Multimed Tools Appl :1–25

  45. Sharma D, Willy C, Bischoff J (2020) Optimal subset selection for causal inference using machine learning ensembles and particle swarm optimization. Complex Intell Syst 1:3

  46. Shuttle dataset – ODDS. [Online]. Available: http://odds.cs.stonybrook.edu/shuttle-dataset/. Accessed 02 Jun 2020

  47. Sigillito VG, Wing SP, Hutton LV, Baker KB (1989) Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Tech Dig (Appl Phys Lab) 10(3):262–266

    Google Scholar 

  48. Smtp (KDDCUP99) dataset – ODDS. [Online]. Available: http://odds.cs.stonybrook.edu/smtp-kddcup99-dataset/. Accessed 28 Jun 2020

  49. Starczewski A, Krzyzak A (2015) Performance evaluation of the silhouette index. In: Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science), vol 9120, pp 49–58

  50. Stucker C, Richard A, Wegner JD, Schindler K (2018) Supervised outlier detection in large-scale MVS point clouds for 3D city modeling applications. ISPRS Ann Photogramm Remote Sens Spat Inf Sci 4(2):263–270

  51. Thomas R, Judith JE (2020) Voting-based ensemble of unsupervised outlier detectors. Adv Commun Syst Netw 656:501–511

    Article  Google Scholar 

  52. UCI Machine Learning Repository: Statlog (Landsat Satellite) Data Set. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite). Accessed 11 May 2020

  53. Wahid A, Rao ACS (2021) ODRA: an outlier detection algorithm based on relevant attribute analysis method. Cluster Comput 24(1):569–585

  54. Wang K, Zhou Z (2019) Distance ratio-based weighted rank outlier detection on wearable health data. In: Proceedings of 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference, ITNEC 2019, pp 583–588

  55. Wang YF, Jiong Y, Su GP, Qian YR (2019) A new outlier detection method based on OPTICS. Sustain Cities Soc 45:197–212

  56. Wang ZM, Song GH, Gao C (2019) An isolation-based distributed outlier detection framework using nearest neighbor ensembles for wireless sensor networks. IEEE Access 7:96319–96333

    Article  Google Scholar 

  57. Whang JJ, Dhillon IS, Gleich DF (2015) Non-exhaustive, Overlapping k -means. In: Proceedings of the (2015) SIAM International Conference on Data Mining, pp 936–944

  58. Wilcoxon F (1992) Individual comparisons by ranking methods. Springer, New York, pp 196–202

    Google Scholar 

  59. Yan H, Wang L, Lu Y (2019) Identifying cluster centroids from decision graph automatically using a statistical outlier detection method. Neurocomputing 329:348–358

  60. Yi Y, Zhou W, Shi Y, Dai J (2018) Speedup two-class supervised outlier detection. IEEE Access 6:63923–63933

    Article  Google Scholar 

  61. Yu Q, Luo Y, Chen C, Ding X (2016) Outlier-eliminated k-means clustering algorithm based on differential privacy preservation. Appl Intell 45(4):1179–1191

  62. Zhang K, Hutter M, Jin H (2009) A new local distance-based outlier detection approach for scattered real-world data. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol 5476 LNAI, pp 813–822

  63. Zhang Y, Meratnia N, Havinga P (2010) Outlier detection techniques for wireless sensor networks: A survey. IEEE Commun Surv Tutorials 12(2):159–170

    Article  Google Scholar 

  64. Zhao Y, Hryniewicki MK (2018) XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning. In: Proceedings of the International Joint Conference on Neural Networks, vol 2018-July

  65. Zhou Y, Yu H, Cai X (2009) A novel k-means algorithm for clustering and outlier detection. In: (2009) 2nd International Conference on Future Information Technology and Management Engineering, FITME 2009, pp 476–480

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ram Sarkar.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ray, B., Ghosh, S., Ahmed, S. et al. Outlier detection using an ensemble of clustering algorithms. Multimed Tools Appl 81, 2681–2709 (2022). https://doi.org/10.1007/s11042-021-11671-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11671-9

Keywords

Navigation