Skip to main content
Log in

An ensemble approach to outlier detection using some conventional clustering algorithms

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Outlier detection is an important requirement in data mining and machine learning. When data mining and machine learning algorithms are applied on the datasets with outliers, it leads to erroneous conclusion about the data. Therefore, researchers have been working in this field to remove outliers from dataset so that meaningful information from the datasets can be retrieved. In this paper, we take a cluster based ensemble approach for outlier detection, the backbone of which are some conventional clustering algorithms. Keeping in mind the drawbacks of supervised and semi supervised learning, we have relied on unsupervised learning algorithms. For our cluster based ensemble approach, we use three clustering algorithms, namely K-means, K-means++, and Fuzzy C-means. Our model intelligently combines results from individual clustering algorithms, assigning probabilities to each data point in order to decide its belongingness to a certain cluster. We have proposed a technique to assign a membership value to a data point in case of hard clustering algorithms, as we want to keep the flexibility of combining hard and soft clustering algorithms. From the probabilities assigned by the ensemble model, we then identify the outliers from the dataset. After removing these data points from the dataset, we obtain better values of cluster validity indices, thus reaffirming that removal of outliers has resulted in more stringent clusters of data. We have used five different cluster validity indices in our work to measure the goodness of the clusters formed, considering eight widely used datasets for evaluation of the proposed model amongst which three are large datasets. We have noticed a significant improvement in the cluster validity indices after applying our outlier detection algorithm. The experimental results prove that the proposed method is empirically sound.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Garg S, Kumar N, Rodrigues JJPC, Rodrigues JJPC (2019) Hybrid deep-learning-based anomaly detection scheme for suspicious flow detection in SDN: A social multimedia perspective. IEEE Trans Multimed 21(3):566–578

    Article  Google Scholar 

  2. Garg S, Kaur K, Kumar N, Kaddoum G, Zomaya AY, Ranjan R (2019) A Hybrid deep learning based model for anomaly detection in cloud datacentre networks. Manag, IEEE Trans Netw Serv

    Google Scholar 

  3. Prastawa M, Bullitt E, Ho S, Gerig G (2004) A brain tumor segmentation framework based on outlier detection. Med Image Anal 8(3):275–283

    Article  Google Scholar 

  4. Stucker C, Richard A, Wegner JD, Schindler K (2018) Supervised Outlier detection in large-scale MVS point clouds for 3D city modeling applications. ISPRS Ann Photogramm Remote Sens Spat Inf Sci 4(2):263–270

    Article  Google Scholar 

  5. Yi Y, Zhou W, Shi Y, Dai J (2018) Speedup two-class supervised outlier detection. IEEE Access 6:63923–63933

    Article  Google Scholar 

  6. Dasgupta D, Majumdar NS (2002) Anomaly detection in multidimensional data using negative selection algorithm. In: Proceedings of the 2002 Congress on Evolutionary Computation, CEC 2002, vol 2, pp 1039–1044

    Google Scholar 

  7. Markou M, Singh S (2003) Novelty detection: A review - Part 1: Statistical approaches. Signal Process 83(12):2481–2497

    Article  Google Scholar 

  8. Campos GO et al (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Discov 30(4):891–927

    Article  MathSciNet  Google Scholar 

  9. Zhang J, Zulkernine M (2006) Anomaly based network intrusion detection with unsupervised outlier detection. IEEE International Conference on Communications vol 5:2388–2393

    Google Scholar 

  10. Yu Q, Luo Y, Chen C, Ding X (2016) Outlier-eliminated k-means clustering algorithm based on differential privacy preservation. Appl Intell 45(4):1179–1191

    Article  Google Scholar 

  11. Jiang MF, Tseng SS, Su CM (2001) Two-phasee clustering process for outliers detection. Pattern Recognit Lett 22(6–7):691–700

    Article  Google Scholar 

  12. Hautamäki V, Cherednichenko S, Kärkkäinen I, Kinnunen T, Fränti P (2005) Improving K-means by outlier removal. Lect Notes Comput Sci 3540:978–987

    Article  Google Scholar 

  13. He Z, Xu X, Deng S (2003) Discovering cluster-based local outliers. Pattern Recognit Lett 24(9–10):1641–1650

    Article  Google Scholar 

  14. Jiang SY, An QB (2008) Clustering-based outlier detection method. In: Proceedings – 5th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2008, vol 2, pp 429–433

    Chapter  Google Scholar 

  15. Zhou Y, Yu H, Cai X A novel k-means algorithm for clustering and outlier detection. In: (2009) 2nd International Conference on Future Information Technology and Management Engineering, FITME 2009, vol 2009, pp 476–480

  16. Zhang K, Hutter M, Jin H (2009) A new local distance-based outlier detection approach for scattered real-world data. Lect Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 5476(LNAI):813–822

    Google Scholar 

  17. Ahmed M, Mahmood AN (2013) A novel approach for outlier detection and clustering improvement,” In Proceedings of the (2013) IEEE 8th Conference on Industrial Electronics and Applications, ICIEA 2013, pp 577–582

  18. Chawla S, Gionisy A (2013) κ-means-: A unified approach to clustering and outlier detection. Proceedings of the (2013) SIAM International Conference on Data Mining, SDM 2013, pp 189–197

  19. Whang JJ, Dhillon IS, Gleich DF (2015) Non-exhaustive, Overlapping k -means. In: Proceedings of the (2015) SIAM International Conference on Data Mining, pp 936–944

  20. Liu Y et al (2019) Generative adversarial active learning for unsupervised outlier detection. IEEE Trans Knowl Data Eng:1–1

  21. Chakraborty D, Narayanan V, Ghosh A (2019) Integration of deep feature extraction and ensemble learning for outlier detection. Pattern Recognit 89:161–171

    Article  Google Scholar 

  22. Qadri YA, Nauman A, Bin Zikria Y, Vasilakos AV, Kim SW (2020) The future of healthcare internet of things: a survey of emerging technologies. IEEE Commun Surv Tutorials

    Google Scholar 

  23. Wang YF, Jiong Y, Su GP, Qian YR (2019) A new outlier detection method based on OPTICS. Sustain Cities Soc 45:197–212

    Article  Google Scholar 

  24. Yan H, Wang L, Lu Y (2019) Identifying cluster centroids from decision graph automatically using a statistical outlier detection method. Neurocomputing 329:348–358

    Article  Google Scholar 

  25. Bzdok D, Krzywinski M, Altman N (2018) Machine learning: Supervised methods. Nat Methods 15(1):5–6 (Nature Publishing Group)

  26. Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002) An efficient k-means clustering algorithms: Analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24(7):881–892

    Article  Google Scholar 

  27. Jana P, Ghosh S, Sarkar R, Nasipuri M (Nature Publishing Group) A fuzzy C-means based approach towards efficient document image binarization. (2017) 9th International Conference on Advances in Pattern Recognition, ICAPR 2017, pp 332–337

  28. Onan A, Korukoğlu S, Bulut H (2016) A multiobjective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification. Expert Syst Appl 62:1–16

    Article  Google Scholar 

  29. Boddy AJ, Hurst W, MacKay M, Rhalibi AE (2019) Density-based outlier detection for safeguarding electronic patient record systems. IEEE Access 7:40285–40294

    Article  Google Scholar 

  30. Nakai K, Kanehisa M (1991) Expert system for predicting protein localization sites in gram-negative bacteria. Proteins Struct Funct Bioinforma 11(2):95–110

    Article  Google Scholar 

  31. UCI Machine Learning Repository: Statlog (Landsat Satellite) Data Set. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite). Accessed 2 Jun 2020

  32. Sigillito VG, Wing SP, Hutton LV, Baker KB (1989) Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Tech Dig (Applied Phys Lab) 10(3):262–266

    Google Scholar 

  33. Shuttle dataset – ODDS. [Online]. Available: http://odds.cs.stonybrook.edu/shuttle-dataset/. Accessed 2 Jun 2020

  34. Smtp (KDDCUP99) dataset – ODDS. [Online]. Available: http://odds.cs.stonybrook.edu/smtp-kddcup99-dataset/. Accessed 28 Jun 2020

  35. ForestCover/Covertype dataset – ODDS. [Online]. Available: http://odds.cs.stonybrook.edu/forestcovercovertype-dataset/. Accessed 28 Jun 2020

  36. http (KDDCUP99) dataset – ODDS. [Online]. Available: http://odds.cs.stonybrook.edu/http-kddcup99-dataset/. Accessed 28 Jun 2020

  37. Dunn JC (1974) Well-separated clusters and optimal fuzzy partitions. J Cybern 4(1):95–104

    Article  MathSciNet  Google Scholar 

  38. Starczewski A, Krzyzak A (2015) Performance evaluation of the silhouette index. Lect Notes Artif Intell 49–58(Subseries of Lecture Notes in Computer Science):9120

    Google Scholar 

  39. Maulik U, Bandyopadhyay S Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654

  40. Bezdek JC, Moshtaghi M, Runkler T, Leckie C (2016) The generalized c index for internal fuzzy cluster validity. IEEE Trans Fuzzy Syst 24(6):1500–1512

    Article  Google Scholar 

  41. Saha S et al (2020) Feature selection for facial emotion recognition using cosine similarity-based harmony search algorithm. Appl Sci 10(8):2816

    Article  Google Scholar 

  42. Rish I (2014) An empirical study of the naïve bayes classifier an empirical study of the naive Bayes classifier., no. January 2001:41–46

    Google Scholar 

  43. Belgiu M, Drăgu L (2016) Random forest in remote sensing: A review of applications and future directions. ISPRS J Photogramm Remote Sens 114:24–321. Elsevier B.V.

  44. Mandal A et al (2018) A case study of genetic algorithm coupled multi-layer perceptron, International Conference on Emerging Technologies for Sustainable Development (ICETSD ’19) edn

    Google Scholar 

  45. Peng CYJ, Lee KL, Ingersoll GM (2002) An introduction to logistic regression analysis and reporting. J Educ Res 96(1):3–14

    Article  Google Scholar 

  46. Pendharkar PC, Rodger JA (2004) An empirical study of impact of crossover operators on the performance of non-binary genetic algorithm based neural approaches for classification. Comput Oper Res 31(4):481–498

    Article  Google Scholar 

  47. Panwar LK, Reddy S, Verma KA, Panigrahi BK, Kumar R (2018) Binary Grey Wolf Optimizer for large scale unit commitment problem. Swarm Evol Comput 38:251–266

    Article  Google Scholar 

  48. Ahmed S, Ghosh KK, Singh PK, Geem ZW, Sarkar R (2020) Hybrid of harmony search algorithm and ring theory-based evolutionary algorithm for feature selection. IEEE Access 8:102629–102645

    Article  Google Scholar 

  49. Hussien AG, Hassanien AE, Houssein EH, Bhattacharyya S, Amin M (2019) S-shaped binary whale optimization algorithm for feature selection. Adv Intell Syst Comput 727:79–87

    Google Scholar 

  50. Zhou Y, He F, Hou N, Qiu Y (2018) Parallel ant colony optimization on multi-core SIMD CPUs. Futur Gener Comput Syst 79:473–487

    Article  Google Scholar 

  51. Li K, He F, Yu H, Chen X (2019) A parallel and robust object tracking approach synthesizing adaptive Bayesian learning and improved incremental subspace learning. Front Comput Sci 13(5):1116–1135

    Article  Google Scholar 

Download references

Acknowledgement

We would like to thank the Center for Microprocessor Application for Training Education and Research (CMATER) Research Laboratory of the Computer Science and Engineering Department, Jadavpur University, Kolkata, India, for providing us the infrastructural support to carry out this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Neeraj Kumar.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Saha, A., Chatterjee, A., Ghosh, S. et al. An ensemble approach to outlier detection using some conventional clustering algorithms. Multimed Tools Appl 80, 35145–35169 (2021). https://doi.org/10.1007/s11042-020-09628-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09628-5

Keywords