Abstract
Outlier detection is an important research area in the field of machine learning and data science. The presence of outliers in a dataset limits its true usefulness in a real-life scenario. Due to the varied challenges, researchers strive to find a general method to be useful for different datasets. In this paper, we have proposed an outlier detection technique based on unsupervised learning using an ensemble of three clustering algorithms, namely K-means, K-means++ and Fuzzy C-means. We have proposed a unique way to deal with clustered outliers. Outcomes of the three aforementioned clustering algorithms are combined intelligently to accumulate all the complementary information. To combine the decisions of the hard and soft clustering algorithms, we have proposed a novel probability-based technique, which assigns a membership value to each data point in the case of a hard clustering algorithm. Three cluster validity indices are used as our evaluation metrics, which measure the goodness of a cluster. Significant improvement of cluster validity indices is observed after removing the outliers, which ensures the removal of outliers has resulted in stringent clusters. The method is evaluated on eight datasets, among which, three datasets are comparatively large. Source code of this work is available at: https://github.com/biswarup9/Outlier-Detection-Using-an-Ensemble-of-Clustering-Algorithms-.
Similar content being viewed by others
References
Agarwal S, Yadav S, Singh K (2012) K-means versus k-means clustering technique. In: 2012 Students Conference on Engineering and Systems, SCES 2012
Aggarwal CC, Aggarwal CC (2017) Supervised outlier detection. In: Outlier analysis. Springer International Publishing, Berlin, pp 219–248
Ahmed M, Mahmood AN (2013) A novel approach for outlier detection and clustering improvement. In: Proceedings of the (2013) IEEE 8th Conference on Industrial Electronics and Applications, ICIEA 2013, pp 577–582
Ahmed S, Ghosh KK, Singh PK, Geem ZW, Sarkar R (2020) Hybrid of harmony search algorithm and ring theory-based evolutionary algorithm for feature selection. IEEE Access 8:102629–102645
Ali A, Zhu Y, Zakarya M (2021) A data aggregation based approach to exploit dynamic spatio-temporal correlations for citywide crowd flows prediction in fog computing. Multimed Tools Appl 1–33
Bera SK, Ghosh S, Bhowmik S, Sarkar R, Nasipuri M (2020) A non-parametric binarization method based on ensemble of clustering algorithms. Multimed Tools Appl 80(5):7653–7673
Boddy AJ, Hurst W, MacKay M, Rhalibi AE (2019) Density-based outlier detection for safeguarding electronic patient record systems. IEEE Access 7:40285–40294
Boodhun N, Jayabalan M (2018) Risk prediction in life insurance industry using supervised learning algorithms. Complex Intell Syst4(2):145–154
Boukerche A, Zheng L, Alfandi O (2020) Outlier detection: methods, models, and classification. ACM Comput Surv 53(3):1–37
Boukerche A, Zheng L, Alfandi O (2020) Outlier detection: Methods, models, and classification. ACM Comput Surv 53(3)
Chakraborty D, Narayanan V, Ghosh A (2019) Integration of deep feature extraction and ensemble learning for outlier detection. Pattern Recognit 89:161–171
Chawla S, Gionisy A (2013) κ-means-: A unified approach to clustering and outlier detection. In: Proceedings of the (2013) SIAM International Conference on Data Mining, SDM 2013, pp 189–197
Chopra P, Yadav SK (2015) Fault detection and classification by unsupervised feature extraction and dimensionality reduction. Complex Intell Syst 1(1–4):25–33
Chopra P, Yadav SK (2015) Erratum to: Fault detection and classification by unsupervised feature extraction and dimensionality reduction. Complex Intell Syst 1(1–4):35–35
Daneshpazhouh A, Sami A (2013) Semi-supervised outlier detection with only positive and unlabeled data based on fuzzy clustering. In: IKT 2013 - 2013 5th Conference on Information and Knowledge Technology, pp 344–348
Daneshpazhouh A, Sami A (2014) Entropy-based outlier detection using semi-supervised approach with few positive examples. Pattern Recognit Lett 49:77–84
Du H, Zhao S, Zhang D, Wu J (2016) Novel clustering-based approach for Local Outlier Detection. In: Proceedings - IEEE INFOCOM, vol 2016-Septe, pp 802–811
ForestCover/Covertype dataset – ODDS. [Online]. Available: http://odds.cs.stonybrook.edu/forestcovercovertype-dataset/. Accessed 28 Jun 2020
Ghosh S, Bhattacharya R, Majhi S, Bhowmik S, Malakar S, Sarkar R (2019) Textual content retrieval from filled-in form images. Commun Comput Inf Sci 1020:27–37
Ghosh S, Chatterjee A, Singh PK, Bhowmik S, Sarkar R (2021) Language-invariant novel feature descriptors for handwritten numeral recognition. Vis Comput 37(7):1781–1803
http (KDDCUP99) dataset – ODDS. [Online]. Available: http://odds.cs.stonybrook.edu/http-kddcup99-dataset/. Accessed 28 Jun 2020
Hoque N, Singh M, Bhattacharyya DK (2018) EFS-MI: an ensemble feature selection method for classification. Complex Intell Syst 4(2):105–118
Hussien AG, Hassanien AE, Houssein EH, Bhattacharyya S, Amin M (2019) S-shaped binary whale optimization algorithm for feature selection. Adv Intell Syst Comput 727:79–87
Ijaz MF, Attique M, Son Y (2020) Data-driven cervical cancer prediction model with outlier detection and over-sampling methods. Sensors (Switzerland) 20(10):2809
Jana P, Ghosh S, Bera SK, Sarkar R (2018) Handwritten document image binarization: An adaptive K-means based approach. In: (2017) IEEE Calcutta Conference, CALCON 2017 - Proceedings, vol 2018-Janua, pp 226–230
Jana P, Ghosh S, Sarkar R, Nasipuri M (2018) A fuzzy C-means based approach towards efficient document image binarization. In: (2017) 9th International Conference on Advances in Pattern Recognition, ICAPR 2017, pp 332–337
Jiang MF, Tseng SS, Su CM (2001) Two-phasee clustering process for outliers detection. Pattern Recognit Lett 22:6–7
Kieu T, Yang B, Jensen CS (2018) Outlier detection for multidimensional time series using deep neural networks. In: Proceedings - IEEE International Conference on Mobile Data Management, vol 2018-June, pp 125–134
Kumar Dwivedi R, Pandey S, Kumar R (2018) A study on machine learning approaches for outlier detection in wireless sensor network. In: Proceedings of the 8th International Conference Confluence (2018) on Cloud Computing, Data Science and Engineering, Confluence 2018, pp 189–192
Li Y, Wang Y, Ma X, Qian C, Li X (2019) A graph-based method for active outlier detection with limited expert feedback. IEEE Access 7:152267–152277
Liu Y, Li Z, Zhou C, Jiang Y, Sun J, Wang M, He X (2019) Generative adversarial active learning for unsupervised outlier detection. IEEE Trans Knowl Data Eng 32(8):1517–1528
Mandal A et al (2018) A case study of genetic algorithm coupled multi-layer perceptron. In: International Conference on Emerging Technologies for Sustainable Development (ICETSD ’19)
Markou M, Singh S (2003) Novelty detection: A review - Part 1: Statistical approaches. Signal Process 83(12):2481–2497
Markou M, Singh S (2003) Novelty detection: A review - Part 2: Neural network based approaches. Sig Process 83(12):2499–2521
Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654
Mishra G, Agarwal S, Jain PK, Pamula R (2019) Outlier detection using subset formation of clustering based method. Adv Intell Syst Comput 870:521–528
Munoz-Organero M (2019) Outlier detection in wearable sensor data for Human Activity Recognition (HAR) based on DRNNs. IEEE Access 7:74422–74436
Nakai K, Kanehisa M (1991) Expert system for predicting protein localization sites in gram-negative bacteria. Proteins Struct Funct Bioinforma 11(2):95–110
Panwar LK, Reddy S, Verma KA, Panigrahi BK, Kumar R (2018) Binary Grey Wolf Optimizer for large scale unit commitment problem. Swarm Evol Comput 38:251–266
Pendharkar PC, Rodger JA (2004) An empirical study of impact of crossover operators on the performance of non-binary genetic algorithm based neural approaches for classification. Comput Oper Res 31(4):481–498
Peng CYJ, Lee KL, Ingersoll GM (2002) An introduction to logistic regression analysis and reporting. J Educ Res 96(1):3–14
Rish I (2014) An empirical study of the naïve bayes classifier an empirical study of the naive bayes classifier. no. January 2001, pp 41–46
Saha S et al (2020) Feature selection for facial emotion recognition using cosine similarity-based harmony search algorithm. Appl Sci 10(8):2816
Saha A, Chatterjee A, Ghosh S, Kumar N, Sarkar R (2020) An ensemble approach to outlier detection using some conventional clustering algorithms. Multimed Tools Appl :1–25
Sharma D, Willy C, Bischoff J (2020) Optimal subset selection for causal inference using machine learning ensembles and particle swarm optimization. Complex Intell Syst 1:3
Shuttle dataset – ODDS. [Online]. Available: http://odds.cs.stonybrook.edu/shuttle-dataset/. Accessed 02 Jun 2020
Sigillito VG, Wing SP, Hutton LV, Baker KB (1989) Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Tech Dig (Appl Phys Lab) 10(3):262–266
Smtp (KDDCUP99) dataset – ODDS. [Online]. Available: http://odds.cs.stonybrook.edu/smtp-kddcup99-dataset/. Accessed 28 Jun 2020
Starczewski A, Krzyzak A (2015) Performance evaluation of the silhouette index. In: Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science), vol 9120, pp 49–58
Stucker C, Richard A, Wegner JD, Schindler K (2018) Supervised outlier detection in large-scale MVS point clouds for 3D city modeling applications. ISPRS Ann Photogramm Remote Sens Spat Inf Sci 4(2):263–270
Thomas R, Judith JE (2020) Voting-based ensemble of unsupervised outlier detectors. Adv Commun Syst Netw 656:501–511
UCI Machine Learning Repository: Statlog (Landsat Satellite) Data Set. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite). Accessed 11 May 2020
Wahid A, Rao ACS (2021) ODRA: an outlier detection algorithm based on relevant attribute analysis method. Cluster Comput 24(1):569–585
Wang K, Zhou Z (2019) Distance ratio-based weighted rank outlier detection on wearable health data. In: Proceedings of 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference, ITNEC 2019, pp 583–588
Wang YF, Jiong Y, Su GP, Qian YR (2019) A new outlier detection method based on OPTICS. Sustain Cities Soc 45:197–212
Wang ZM, Song GH, Gao C (2019) An isolation-based distributed outlier detection framework using nearest neighbor ensembles for wireless sensor networks. IEEE Access 7:96319–96333
Whang JJ, Dhillon IS, Gleich DF (2015) Non-exhaustive, Overlapping k -means. In: Proceedings of the (2015) SIAM International Conference on Data Mining, pp 936–944
Wilcoxon F (1992) Individual comparisons by ranking methods. Springer, New York, pp 196–202
Yan H, Wang L, Lu Y (2019) Identifying cluster centroids from decision graph automatically using a statistical outlier detection method. Neurocomputing 329:348–358
Yi Y, Zhou W, Shi Y, Dai J (2018) Speedup two-class supervised outlier detection. IEEE Access 6:63923–63933
Yu Q, Luo Y, Chen C, Ding X (2016) Outlier-eliminated k-means clustering algorithm based on differential privacy preservation. Appl Intell 45(4):1179–1191
Zhang K, Hutter M, Jin H (2009) A new local distance-based outlier detection approach for scattered real-world data. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol 5476 LNAI, pp 813–822
Zhang Y, Meratnia N, Havinga P (2010) Outlier detection techniques for wireless sensor networks: A survey. IEEE Commun Surv Tutorials 12(2):159–170
Zhao Y, Hryniewicki MK (2018) XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning. In: Proceedings of the International Joint Conference on Neural Networks, vol 2018-July
Zhou Y, Yu H, Cai X (2009) A novel k-means algorithm for clustering and outlier detection. In: (2009) 2nd International Conference on Future Information Technology and Management Engineering, FITME 2009, pp 476–480
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ray, B., Ghosh, S., Ahmed, S. et al. Outlier detection using an ensemble of clustering algorithms. Multimed Tools Appl 81, 2681–2709 (2022). https://doi.org/10.1007/s11042-021-11671-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11671-9