Abstract
Outlier detection can usually be considered as a preprocessing step for locating, from a data set, the objects that do not conform to well defined notions of expected behaviors. It is a major issue of data mining for discovering novel or rare events, actions and phenomena. We investigate outlier detection from a categorical data set. The problem is especially challenging because of difficulty in defining a meaningful similarity measure for categorical data. In this paper, we propose a formal definition of outliers and formulize outlier detection as an optimization problem. To solve the optimization problem, we design a practical and parameter-free method, named ITB. Experimental results show that the ITB method is much more effective and efficient than existing main-stream methods.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Ferreira, P., Alves, R., Belo, O., Cortesao, L.: Establishing Fraud Detection Patterns Based on Signatures. In: Industrial Conference on Data Mining 2006 (2006)
Chandola, V., Banerjee, A., Kumar, V.: Anomaly Detection: A Survey. ACM Computing Surveys (2009)
Cover, T., Thomas, J.: Elements of Information Theory. John Wiley & Sons, Chichester
Otey, M.E., Ghoting, A., Parthasarathy, S.: Fast Distributed Outlier Detection in Mixed-Attribute Data Sets. DMKD 12, 203–228 (2006)
He, Z., Xu, X., Huang, Z.J., Deng, S.: FP-outlier: Frequent pattern based outlier detection. Computer Sci. and Info. Sys. 2, 103–118 (2005)
Li, S., Lee, R., Lang, S.: Mining Distance-based Outliers from Categorical Data. In: ICDM 2007 (2007)
Bohm, C., Haegler, K., Muller, N.S., Plant, C.: CoCo: Coding Cost for Parameter-Free Outlier Detection. In: KDD 2009 (2009)
Wu, M., Song, X., Jermaine, C., Ranka, S., Gums, J.: A LRT Framework for Fast Spatial Anomaly Detection. In: KDD 2009 (2009)
Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules between Sets of Items in Large Databases. In: SIGMOD 1993 (1993)
Li, T., Ma, S., Ogihara, M.: Entropy-Based Criterion in Categorical Cluster. In: ICML 2004 (2004)
Srinivasa, S.: A Review on Multivariate Mutual Information. Univ. of Notre Dame (2008)
Watanabe, S.: Information Theoretical Analysis of Multivariate Correlation. IBM Journal of Research and Development 4, 66–82 (1960)
Wei, L., Qian, W., Zhou, A., Jin, W., Yu, J.X.: HOT: Hypergraph-Based Outlier Test for Categorical Data. In: PAKDD 2003 (2003)
Breunig, M., Kriegel, H.-P., Ng, R., Sander, J.: LOF: Identifying Density-based Local Outliers. In: ACM SIGMOD 2000 (2000)
Chan, P.K., Mahoney, M.V., Arshad, M.H.: A machine learning approach to anomaly detection, Technical Report CS-2003-06, Florida Institute of Technology (2003)
Fox, M., Gramajo, G., Koufakou, A., Georgiopoulos, M.: Detecting Outliers in Categorical Data Sets Using Non-Derivable Itemsets, Technical Report TR-2008-04, The AMALTHEA REU Program (2008)
Koufakou, A., Ortiz, E.G., Georgiopoulos, M., et al.: A Scalable and Efficient Outlier Detection Strategy for Categorical Data. In: ICTAI 2007 (2007)
Han, J., Kamber, M.: Data Mining - Concepts and Techniques. Elsevier, Amsterdam (2006)
Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C.: Loci: Fast outlier detection using thelocal correlation integral. In: ICDE 2003 (2003)
UCI Machine Learning Repository, http://www.ics.uci.edu/mlearn/MLRepository.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wu, S., Wang, S. (2011). Parameter-Free Anomaly Detection for Categorical Data. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2011. Lecture Notes in Computer Science(), vol 6871. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23199-5_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-23199-5_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23198-8
Online ISBN: 978-3-642-23199-5
eBook Packages: Computer ScienceComputer Science (R0)