Abstract
Outlier detection is an important requirement in data mining and machine learning. When data mining and machine learning algorithms are applied on the datasets with outliers, it leads to erroneous conclusion about the data. Therefore, researchers have been working in this field to remove outliers from dataset so that meaningful information from the datasets can be retrieved. In this paper, we take a cluster based ensemble approach for outlier detection, the backbone of which are some conventional clustering algorithms. Keeping in mind the drawbacks of supervised and semi supervised learning, we have relied on unsupervised learning algorithms. For our cluster based ensemble approach, we use three clustering algorithms, namely K-means, K-means++, and Fuzzy C-means. Our model intelligently combines results from individual clustering algorithms, assigning probabilities to each data point in order to decide its belongingness to a certain cluster. We have proposed a technique to assign a membership value to a data point in case of hard clustering algorithms, as we want to keep the flexibility of combining hard and soft clustering algorithms. From the probabilities assigned by the ensemble model, we then identify the outliers from the dataset. After removing these data points from the dataset, we obtain better values of cluster validity indices, thus reaffirming that removal of outliers has resulted in more stringent clusters of data. We have used five different cluster validity indices in our work to measure the goodness of the clusters formed, considering eight widely used datasets for evaluation of the proposed model amongst which three are large datasets. We have noticed a significant improvement in the cluster validity indices after applying our outlier detection algorithm. The experimental results prove that the proposed method is empirically sound.






Similar content being viewed by others
References
Garg S, Kumar N, Rodrigues JJPC, Rodrigues JJPC (2019) Hybrid deep-learning-based anomaly detection scheme for suspicious flow detection in SDN: A social multimedia perspective. IEEE Trans Multimed 21(3):566–578
Garg S, Kaur K, Kumar N, Kaddoum G, Zomaya AY, Ranjan R (2019) A Hybrid deep learning based model for anomaly detection in cloud datacentre networks. Manag, IEEE Trans Netw Serv
Prastawa M, Bullitt E, Ho S, Gerig G (2004) A brain tumor segmentation framework based on outlier detection. Med Image Anal 8(3):275–283
Stucker C, Richard A, Wegner JD, Schindler K (2018) Supervised Outlier detection in large-scale MVS point clouds for 3D city modeling applications. ISPRS Ann Photogramm Remote Sens Spat Inf Sci 4(2):263–270
Yi Y, Zhou W, Shi Y, Dai J (2018) Speedup two-class supervised outlier detection. IEEE Access 6:63923–63933
Dasgupta D, Majumdar NS (2002) Anomaly detection in multidimensional data using negative selection algorithm. In: Proceedings of the 2002 Congress on Evolutionary Computation, CEC 2002, vol 2, pp 1039–1044
Markou M, Singh S (2003) Novelty detection: A review - Part 1: Statistical approaches. Signal Process 83(12):2481–2497
Campos GO et al (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Discov 30(4):891–927
Zhang J, Zulkernine M (2006) Anomaly based network intrusion detection with unsupervised outlier detection. IEEE International Conference on Communications vol 5:2388–2393
Yu Q, Luo Y, Chen C, Ding X (2016) Outlier-eliminated k-means clustering algorithm based on differential privacy preservation. Appl Intell 45(4):1179–1191
Jiang MF, Tseng SS, Su CM (2001) Two-phasee clustering process for outliers detection. Pattern Recognit Lett 22(6–7):691–700
Hautamäki V, Cherednichenko S, Kärkkäinen I, Kinnunen T, Fränti P (2005) Improving K-means by outlier removal. Lect Notes Comput Sci 3540:978–987
He Z, Xu X, Deng S (2003) Discovering cluster-based local outliers. Pattern Recognit Lett 24(9–10):1641–1650
Jiang SY, An QB (2008) Clustering-based outlier detection method. In: Proceedings – 5th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2008, vol 2, pp 429–433
Zhou Y, Yu H, Cai X A novel k-means algorithm for clustering and outlier detection. In: (2009) 2nd International Conference on Future Information Technology and Management Engineering, FITME 2009, vol 2009, pp 476–480
Zhang K, Hutter M, Jin H (2009) A new local distance-based outlier detection approach for scattered real-world data. Lect Notes Comput Sci (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 5476(LNAI):813–822
Ahmed M, Mahmood AN (2013) A novel approach for outlier detection and clustering improvement,” In Proceedings of the (2013) IEEE 8th Conference on Industrial Electronics and Applications, ICIEA 2013, pp 577–582
Chawla S, Gionisy A (2013) κ-means-: A unified approach to clustering and outlier detection. Proceedings of the (2013) SIAM International Conference on Data Mining, SDM 2013, pp 189–197
Whang JJ, Dhillon IS, Gleich DF (2015) Non-exhaustive, Overlapping k -means. In: Proceedings of the (2015) SIAM International Conference on Data Mining, pp 936–944
Liu Y et al (2019) Generative adversarial active learning for unsupervised outlier detection. IEEE Trans Knowl Data Eng:1–1
Chakraborty D, Narayanan V, Ghosh A (2019) Integration of deep feature extraction and ensemble learning for outlier detection. Pattern Recognit 89:161–171
Qadri YA, Nauman A, Bin Zikria Y, Vasilakos AV, Kim SW (2020) The future of healthcare internet of things: a survey of emerging technologies. IEEE Commun Surv Tutorials
Wang YF, Jiong Y, Su GP, Qian YR (2019) A new outlier detection method based on OPTICS. Sustain Cities Soc 45:197–212
Yan H, Wang L, Lu Y (2019) Identifying cluster centroids from decision graph automatically using a statistical outlier detection method. Neurocomputing 329:348–358
Bzdok D, Krzywinski M, Altman N (2018) Machine learning: Supervised methods. Nat Methods 15(1):5–6 (Nature Publishing Group)
Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002) An efficient k-means clustering algorithms: Analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24(7):881–892
Jana P, Ghosh S, Sarkar R, Nasipuri M (Nature Publishing Group) A fuzzy C-means based approach towards efficient document image binarization. (2017) 9th International Conference on Advances in Pattern Recognition, ICAPR 2017, pp 332–337
Onan A, Korukoğlu S, Bulut H (2016) A multiobjective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification. Expert Syst Appl 62:1–16
Boddy AJ, Hurst W, MacKay M, Rhalibi AE (2019) Density-based outlier detection for safeguarding electronic patient record systems. IEEE Access 7:40285–40294
Nakai K, Kanehisa M (1991) Expert system for predicting protein localization sites in gram-negative bacteria. Proteins Struct Funct Bioinforma 11(2):95–110
UCI Machine Learning Repository: Statlog (Landsat Satellite) Data Set. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite). Accessed 2 Jun 2020
Sigillito VG, Wing SP, Hutton LV, Baker KB (1989) Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Tech Dig (Applied Phys Lab) 10(3):262–266
Shuttle dataset – ODDS. [Online]. Available: http://odds.cs.stonybrook.edu/shuttle-dataset/. Accessed 2 Jun 2020
Smtp (KDDCUP99) dataset – ODDS. [Online]. Available: http://odds.cs.stonybrook.edu/smtp-kddcup99-dataset/. Accessed 28 Jun 2020
ForestCover/Covertype dataset – ODDS. [Online]. Available: http://odds.cs.stonybrook.edu/forestcovercovertype-dataset/. Accessed 28 Jun 2020
http (KDDCUP99) dataset – ODDS. [Online]. Available: http://odds.cs.stonybrook.edu/http-kddcup99-dataset/. Accessed 28 Jun 2020
Dunn JC (1974) Well-separated clusters and optimal fuzzy partitions. J Cybern 4(1):95–104
Starczewski A, Krzyzak A (2015) Performance evaluation of the silhouette index. Lect Notes Artif Intell 49–58(Subseries of Lecture Notes in Computer Science):9120
Maulik U, Bandyopadhyay S Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654
Bezdek JC, Moshtaghi M, Runkler T, Leckie C (2016) The generalized c index for internal fuzzy cluster validity. IEEE Trans Fuzzy Syst 24(6):1500–1512
Saha S et al (2020) Feature selection for facial emotion recognition using cosine similarity-based harmony search algorithm. Appl Sci 10(8):2816
Rish I (2014) An empirical study of the naïve bayes classifier an empirical study of the naive Bayes classifier., no. January 2001:41–46
Belgiu M, Drăgu L (2016) Random forest in remote sensing: A review of applications and future directions. ISPRS J Photogramm Remote Sens 114:24–321. Elsevier B.V.
Mandal A et al (2018) A case study of genetic algorithm coupled multi-layer perceptron, International Conference on Emerging Technologies for Sustainable Development (ICETSD ’19) edn
Peng CYJ, Lee KL, Ingersoll GM (2002) An introduction to logistic regression analysis and reporting. J Educ Res 96(1):3–14
Pendharkar PC, Rodger JA (2004) An empirical study of impact of crossover operators on the performance of non-binary genetic algorithm based neural approaches for classification. Comput Oper Res 31(4):481–498
Panwar LK, Reddy S, Verma KA, Panigrahi BK, Kumar R (2018) Binary Grey Wolf Optimizer for large scale unit commitment problem. Swarm Evol Comput 38:251–266
Ahmed S, Ghosh KK, Singh PK, Geem ZW, Sarkar R (2020) Hybrid of harmony search algorithm and ring theory-based evolutionary algorithm for feature selection. IEEE Access 8:102629–102645
Hussien AG, Hassanien AE, Houssein EH, Bhattacharyya S, Amin M (2019) S-shaped binary whale optimization algorithm for feature selection. Adv Intell Syst Comput 727:79–87
Zhou Y, He F, Hou N, Qiu Y (2018) Parallel ant colony optimization on multi-core SIMD CPUs. Futur Gener Comput Syst 79:473–487
Li K, He F, Yu H, Chen X (2019) A parallel and robust object tracking approach synthesizing adaptive Bayesian learning and improved incremental subspace learning. Front Comput Sci 13(5):1116–1135
Acknowledgement
We would like to thank the Center for Microprocessor Application for Training Education and Research (CMATER) Research Laboratory of the Computer Science and Engineering Department, Jadavpur University, Kolkata, India, for providing us the infrastructural support to carry out this research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Saha, A., Chatterjee, A., Ghosh, S. et al. An ensemble approach to outlier detection using some conventional clustering algorithms. Multimed Tools Appl 80, 35145–35169 (2021). https://doi.org/10.1007/s11042-020-09628-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-09628-5