Skip to main content
Log in

Detecting outliers in categorical data through rough clustering

  • Published:
Natural Computing Aims and scope Submit manuscript

Abstract

Outlier detection is an important data mining task with many contemporary applications. Clustering based methods for outlier detection try to identify the data objects that deviate from the normal data. However, the uncertainty regarding the cluster membership of an outlier object has to be handled appropriately during the clustering process. Additionally, carrying out the clustering process on data described using categorical attributes is challenging, due to the difficulty in defining requisite methods and measures dealing with such data. Addressing these issues, a novel algorithm for clustering categorical data aimed at outlier detection is proposed here by modifying the standard \(k\)-modes algorithm. The uncertainty regarding the clustering process is addressed by considering a soft computing approach based on rough sets. Accordingly, the modified clustering algorithm incorporates the lower and upper approximation properties of rough sets. The efficacy of the proposed rough \(k\)-modes clustering algorithm for outlier detection is demonstrated using various benchmark categorical data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Albanese A, Pal SK, Petrosino A (2014) Rough sets, kernel set, and spatio-temporal outlier detection. IEEE Trans Knowl Data Eng 26(1):194–207

    Article  Google Scholar 

  • Asharaf S, Murty MN, Shevade SK (2006) Rough set based incremental clustering of interval data. Pattern Recogn Lett 27:515–519

    Article  Google Scholar 

  • Bache K, Lichman M (2013) UCI machine learning repository. URL http://archive.ics.uci.edu/ml

  • Bock HH (2002) The classical data situation. In: Analysis of Symbolic Data. Springer, Berlin, pp 139–152

  • Cao F, Liang J, Bai L (2009) A new initialization method for categorical data clustering. Expert Syst Appl 36:10223–10228

    Article  Google Scholar 

  • Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3)

  • Fawcett T (2006) An introduction to roc analysis. Pattern Recogn Lett 27:861–874

    Article  Google Scholar 

  • Huang Z (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. In: SIGMOD DMKD Workshop, pp 1–8

  • Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666

    Article  Google Scholar 

  • Jiang F, Sui Y, Cao C (2009) Some issues about outlier detection in rough set theory. Expert Syst Appl 36:4680–4687

    Article  Google Scholar 

  • Joshi M, Lingras P (2013) Enhancing rough clustering with outlier detection based on evidential clustering. RSFDGrC, Springer, LNCS 8170, pp 127–137

    Google Scholar 

  • Lai JZC, Juan EYT, Lai FJC (2013) Rough clustering using generalized fuzzy clustering algorithm. Pattern Recogn 46:2538–2547

    Article  Google Scholar 

  • Li M, Deng S, Wang L, Feng S, Fan J (2014) Hierarchical clustering algorithm for categorical data using a probabilistic rough set model. Knowl-Based Syst 65:60–71

    Article  Google Scholar 

  • Lingras P (2002) Rough set clustering for web mining. In: IEEE FUZZ, pp 1039–1044

  • Lingras P, Peters G (2012) Applying rough set concepts to clustering. Rough Sets: selected methods and applications in management and engineering. Springer, London, pp 23–38

  • Lingras P, West C (2004) Interval set clustering of web users with rough k-means. J Intell Inform Syst 23(1):5–16

    Article  MATH  Google Scholar 

  • Maji P, Pal SK (2008) RFCM: a hybrid algorithm using rough and fuzzy sets. Fundam Inform 80(4):475–496

    MathSciNet  MATH  Google Scholar 

  • Maji P, Pal SK (2010) Fuzzy-rough sets for information measures and selection of relevant genes from microarray data. IEEE Trans Syst Man Cybern Part B 40(3):741–752

    Article  Google Scholar 

  • Maji P, Paul S (2013) Rough-fuzzy clustering for grouping functionally similar genes from microarray data. IEEE/ACM Trans Comput Biol Bioinform 10(2):286

    Article  Google Scholar 

  • Masson M, Denoeux T (2008) ECM: An evidential version of the fuzzy c-means algorithm. Pattern Recogn 41:1384–1397

    Article  MATH  Google Scholar 

  • Mi H (2011) Discovering local outlier based on rough clustering. In: 3rd International workshop on intelligent systems and applications (ISA), IEEE, pp 1–4

  • Ng MK, Li MJ, Huang JZ, He Z (2007) On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 29(3):503–507

    Article  Google Scholar 

  • Nguyen HS, Pal SK, Skowron A (2011) Rough sets and fuzzy sets in natural computing. Theor Comput Sci 412(42):5816–5819

    Article  MathSciNet  Google Scholar 

  • Obtulowicz A (2003) Mathematical models of uncertainty with a regard to membrane systems. Nat Comput 2(3):251–263

    Article  MathSciNet  MATH  Google Scholar 

  • Parmer D, Wu T, Blackhurst J (2007) MMR: an algorithm for clustering categorical data using rough set theory. Data Knowl Eng 63:879–893

    Article  Google Scholar 

  • Pawlak Z (1982) Rough sets. Int J Comput Inform Sci 11:341–356

    Article  MathSciNet  MATH  Google Scholar 

  • Peters G (2006) Some refinements of rough k-means clustering. Pattern Recogn 39:1481–1491

    Article  MATH  Google Scholar 

  • Peters G (2014) Is there any need for rough clustering? Pattern Recognition Letters online. doi:10.1016/j.patrec.2014.11.003

    Google Scholar 

  • Skowron A, Jankowski A, Swiniarski RW (2013) 30 years of rough sets and future perspectives. In: RSFDGrC, Springer, Halifax, Canada, LNCS 8170, pp 1–10

  • Suri NNRR, Murty MN, Athithan G (2011) Data mining techniques for outlier detection, chap 2. In: Zhang Q, Segall RS, Cao M (eds) Visual analytics and interactive technologies: data, text and web mining applications. IGI Global, New York, pp 22–38

    Google Scholar 

  • Suri NNRR, Murty MN, Athithan G (2012) An algorithm for mining outliers in categorical data through ranking. In: Proceedings of 12th international conference on hybrid intelligent systems (HIS), IEEE, Pune, India, pp 247–252

  • Suri NNRR, Murty MN, Athithan G (2013) A rough clustering algorithm for mining outliers in categorical data. In: Proceedings of 4th international conference on pattern recognition and machine intelligence (PReMI), Springer, Kolkata, India, LNCS 8251, pp 170–175

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to N. N. R. Ranga Suri.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Suri, N.N.R.R., Murty, M.N. & Athithan, G. Detecting outliers in categorical data through rough clustering. Nat Comput 15, 385–394 (2016). https://doi.org/10.1007/s11047-015-9489-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11047-015-9489-2

Keywords

Navigation