Abstract
Anomaly detection is the process of identifying cases, or groups of cases, that are in some way unusual and do not fit the general patterns present in the dataset. Numerous algorithms use discretization of numerical data in their detection processes. This study investigates the effect of the employed discretization method on the unsupervised detection of each of the six anomaly types acknowledged in a recent typology of data anomalies. To this end, experiments are conducted with various datasets and SECODA, a general-purpose algorithm for unsupervised non-parametric anomaly detection in datasets with numerical and categorical attributes. This algorithm employs discretization of continuous attributes, exponentially increasing weights and discretization cut points, and a pruning heuristic to detect anomalies with an optimal number of iterations. The empirical results of experiments with synthetic and real-world data demonstrate that standard SECODA can detect all six types, but that different discretization methods favor the discovery of certain anomaly types. These main findings also hold for other detection techniques using discretization.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley, Chichester (1994)
Goldstein, M., Uchida, S.: A comparative evaluation of unsupervised anomaly detection algorithms. PLoS ONE 11(4), e0152173 (2016)
Foorthuis, R.: A typology of data anomalies. In: Medina, J., et al. (eds.) IPMU 2018. CCIS, vol. 854, pp. 26–38. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91476-3_3
Pang, G., Cao, L., Chin, L.: Outlier detection in complex categorical data by modelling the feature value couplings. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence (2016)
Riahi, F., Schulte, O.: Propositionalization for unsupervised outlier detection in multi-relational data. In: Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference (2016)
Hengst, F., den Hoogendoorn, M.: Detecting interesting outliers: active learning for anomaly detection. In: Proceedings of the 28th Benelux Conference on Artificial Intelligence, Amsterdam, The Netherlands (2016)
Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Boston (2005)
Noble, C.C., Cook, D.J.: Graph-based anomaly detection. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2003)
Schubert, E., Weiler, M., Zimek, A.: Outlier detection and trend detection: two sides of the same coin. In: Proceedings of the 15th IEEE International Conference on Data Mining Workshops (2015)
Hubert, M., Rousseeuw, P., Segaert, P.: Multivariate functional outlier detection. Stat. Methods Appl. 24(2), 177–202 (2015)
Ranshous, S., Shen, S., Koutra, D., Harenberg, S., Faloutsos, C., Samatova, N.F.: Anomaly detection in dynamic networks: a survey. WIREs Comput. Stat. 7(3), 223–247 (2015)
Fielding, J., Gilbert, N.: Understanding Social Statistics. Sage Publications, London (2000)
Gartner: Hype Cycle for Data Science and Machine Learning, 2017. Gartner, Inc (2017)
Forrester: The Forrester Wave: Security Analytics Platforms, Q1 2017. Forrester Research, Inc. (2017)
Leys, C., Ley, C., Klein, O., Bernard, P., Licata, L.: Detecting outliers: do not use standard deviation around the mean, use absolute deviation around the median. J. Exp. Soc. Psychol. 49(4), 764–766 (2013)
Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: VLDB 1998, Proceedings of the 24th International Conference on Very Large Data Bases (1998)
Breunig, M.M., Kriegel, H., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the ACM SIGMOD Conference on Management of Data (2000)
Campos, G.O., et al.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Discovery 30(4), 891–927 (2016)
Schölkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., Platt, J.: Support vector method for novelty detection. Adv. Neural Inf. Process. 12, 582–588 (2000)
Liu, F.T., Ting, K.M., Zhou, Z.: Isolation-based anomaly detection. ACM Trans. Knowl. Discov. Data 6(1), 3 (2012)
Shyu, M.L., Chen, S.C., Sarinnapakorn, K., Chang, L.W.: A novel anomaly detection scheme based on principal component classifier. In: Proceedings of the ICDM Foundation and New Direction of Data Mining workshop, pp. 172–179 (2003)
Pimentel, M.A.F., Clifton, D.A., Clifton, L., Tarassenko, L.: A review of novelty detection. Signal Process. 99, 215–249 (2014)
Keogh, E., Lonardi, S., Ratanamahatana, C.A.: Towards parameter-free data mining. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle (2004)
Goldstein, M., Dengel, A.: Histogram-based outlier score (HBOS): a fast unsupervised anomaly detection algorithm. In: Proceedings of the 35th German Conference on Artificial Intelligence (KI-2012), pp. 59–63 (2012)
Foorthuis, R.: SECODA: segmentation- and combination-based detection of anomalies. In: Proceedings of the 4th IEEE International Conference on Data Science and Advanced Analytics (DSAA 2017), pp. 755–764, Tokyo (2017). https://doi.org/10.1109/dsaa.2017.35
Aggarwal, C.C., Yu, P.S.: An effective and efficient algorithm for high-dimensional outlier detection. VLDB J. 14(2), 211–221 (2005)
Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Proceedings of the Twelfth International Conference on Machine Learning (1995)
Kotsiantis, S., Kanellopoulos, D.: Discretization techniques: a recent survey. GESTS Int. Trans. Comput. Sci. Eng. 32, 47–58 (2006)
Foorthuis, R.: Anomaly detection with SECODA. In: Poster Presentation at the 4th IEEE International Conference on Data Science and Advanced Analytics (DSAA), Tokyo (2017). https://doi.org/10.13140/rg.2.2.21212.08325
Yang, Y., Webb, G.I., Wu, X.: Discretization methods. In: Maimon, O., Rockach, L. (eds.) Data Mining and Knowledge Discovery Handbook. Kluwer Academic Publishers (2005)
Li, H., Hussain, F., Tan, C.M., Dash, M.: Discretization: an enabling technique. Data Min. Knowl. Disc. 6(4), 393–423 (2002)
Wolpert, D.H., Macready, W.G.: No Free Lunch Theorems for Search. Technical report SFI-TR-95-02-010, Santa Fe Institute (1996)
Clarke, B., Fokoué, E., Zhang, H.H.: Principles and Theory for Data Mining and Machine Learning. Springer, New York (2009). https://doi.org/10.1007/978-0-387-98135-2
Rokach, L., Maimon, O.: Data Mining With Decision Trees: Theory and Applications, 2nd edn. World Scientific Publishing, Singapore (2015)
Janssens, J.H.M.: Outlier Selection and One-Class Classification. Ph.D. thesis, Tilburg University (2013)
Maxion, R.A., Tan, K.M.C.: Benchmarking anomaly-based detection systems. In: International Conference on Dependable Systems and Networks, New York (2000)
LAK: Anomaly Detection at the Dutch Alliance on Income Data and Taxes (2018). www.loonaangifteketen.nl
Pijnenburg, M., Kowalczyk, W.: Singular outliers: finding common observations with an uncommon feature. In: Medina, J., Ojeda-Aciego, M., Verdegay, J.L., Perfilieva, I., Bouchon-Meunier, B., Yager, R.R. (eds.) IPMU 2018. CCIS, vol. 855, pp. 492–503. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91479-4_41
Greenacre, M., Ayhan, H.: Identifying Inliers. Barcelona GSE Working Paper Series (2014)
Foorthuis, R.: (Un)certain anomalies in income data. In: Presentation at the Mini-Symposium on Uncertainty in Data-Driven Systems, Utrecht University, 28 January 2019
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Foorthuis, R. (2019). All or In-cloud: How the Identification of Six Types of Anomalies Is Affected by the Discretization Method. In: Atzmueller, M., Duivesteijn, W. (eds) Artificial Intelligence. BNAIC 2018. Communications in Computer and Information Science, vol 1021. Springer, Cham. https://doi.org/10.1007/978-3-030-31978-6_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-31978-6_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31977-9
Online ISBN: 978-3-030-31978-6
eBook Packages: Computer ScienceComputer Science (R0)