Abstract
Outlier detection is an important data mining task with many contemporary applications. Clustering based methods for outlier detection try to identify the data objects that deviate from the normal data. However, the uncertainty regarding the cluster membership of an outlier object has to be handled appropriately during the clustering process. Additionally, carrying out the clustering process on data described using categorical attributes is challenging, due to the difficulty in defining requisite methods and measures dealing with such data. Addressing these issues, a novel algorithm for clustering categorical data aimed at outlier detection is proposed here by modifying the standard \(k\)-modes algorithm. The uncertainty regarding the clustering process is addressed by considering a soft computing approach based on rough sets. Accordingly, the modified clustering algorithm incorporates the lower and upper approximation properties of rough sets. The efficacy of the proposed rough \(k\)-modes clustering algorithm for outlier detection is demonstrated using various benchmark categorical data sets.








Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Albanese A, Pal SK, Petrosino A (2014) Rough sets, kernel set, and spatio-temporal outlier detection. IEEE Trans Knowl Data Eng 26(1):194–207
Asharaf S, Murty MN, Shevade SK (2006) Rough set based incremental clustering of interval data. Pattern Recogn Lett 27:515–519
Bache K, Lichman M (2013) UCI machine learning repository. URL http://archive.ics.uci.edu/ml
Bock HH (2002) The classical data situation. In: Analysis of Symbolic Data. Springer, Berlin, pp 139–152
Cao F, Liang J, Bai L (2009) A new initialization method for categorical data clustering. Expert Syst Appl 36:10223–10228
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3)
Fawcett T (2006) An introduction to roc analysis. Pattern Recogn Lett 27:861–874
Huang Z (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. In: SIGMOD DMKD Workshop, pp 1–8
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666
Jiang F, Sui Y, Cao C (2009) Some issues about outlier detection in rough set theory. Expert Syst Appl 36:4680–4687
Joshi M, Lingras P (2013) Enhancing rough clustering with outlier detection based on evidential clustering. RSFDGrC, Springer, LNCS 8170, pp 127–137
Lai JZC, Juan EYT, Lai FJC (2013) Rough clustering using generalized fuzzy clustering algorithm. Pattern Recogn 46:2538–2547
Li M, Deng S, Wang L, Feng S, Fan J (2014) Hierarchical clustering algorithm for categorical data using a probabilistic rough set model. Knowl-Based Syst 65:60–71
Lingras P (2002) Rough set clustering for web mining. In: IEEE FUZZ, pp 1039–1044
Lingras P, Peters G (2012) Applying rough set concepts to clustering. Rough Sets: selected methods and applications in management and engineering. Springer, London, pp 23–38
Lingras P, West C (2004) Interval set clustering of web users with rough k-means. J Intell Inform Syst 23(1):5–16
Maji P, Pal SK (2008) RFCM: a hybrid algorithm using rough and fuzzy sets. Fundam Inform 80(4):475–496
Maji P, Pal SK (2010) Fuzzy-rough sets for information measures and selection of relevant genes from microarray data. IEEE Trans Syst Man Cybern Part B 40(3):741–752
Maji P, Paul S (2013) Rough-fuzzy clustering for grouping functionally similar genes from microarray data. IEEE/ACM Trans Comput Biol Bioinform 10(2):286
Masson M, Denoeux T (2008) ECM: An evidential version of the fuzzy c-means algorithm. Pattern Recogn 41:1384–1397
Mi H (2011) Discovering local outlier based on rough clustering. In: 3rd International workshop on intelligent systems and applications (ISA), IEEE, pp 1–4
Ng MK, Li MJ, Huang JZ, He Z (2007) On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 29(3):503–507
Nguyen HS, Pal SK, Skowron A (2011) Rough sets and fuzzy sets in natural computing. Theor Comput Sci 412(42):5816–5819
Obtulowicz A (2003) Mathematical models of uncertainty with a regard to membrane systems. Nat Comput 2(3):251–263
Parmer D, Wu T, Blackhurst J (2007) MMR: an algorithm for clustering categorical data using rough set theory. Data Knowl Eng 63:879–893
Pawlak Z (1982) Rough sets. Int J Comput Inform Sci 11:341–356
Peters G (2006) Some refinements of rough k-means clustering. Pattern Recogn 39:1481–1491
Peters G (2014) Is there any need for rough clustering? Pattern Recognition Letters online. doi:10.1016/j.patrec.2014.11.003
Skowron A, Jankowski A, Swiniarski RW (2013) 30 years of rough sets and future perspectives. In: RSFDGrC, Springer, Halifax, Canada, LNCS 8170, pp 1–10
Suri NNRR, Murty MN, Athithan G (2011) Data mining techniques for outlier detection, chap 2. In: Zhang Q, Segall RS, Cao M (eds) Visual analytics and interactive technologies: data, text and web mining applications. IGI Global, New York, pp 22–38
Suri NNRR, Murty MN, Athithan G (2012) An algorithm for mining outliers in categorical data through ranking. In: Proceedings of 12th international conference on hybrid intelligent systems (HIS), IEEE, Pune, India, pp 247–252
Suri NNRR, Murty MN, Athithan G (2013) A rough clustering algorithm for mining outliers in categorical data. In: Proceedings of 4th international conference on pattern recognition and machine intelligence (PReMI), Springer, Kolkata, India, LNCS 8251, pp 170–175
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Suri, N.N.R.R., Murty, M.N. & Athithan, G. Detecting outliers in categorical data through rough clustering. Nat Comput 15, 385–394 (2016). https://doi.org/10.1007/s11047-015-9489-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11047-015-9489-2