All or In-cloud: How the Identification of Six Types of Anomalies Is Affected by the Discretization Method

Foorthuis, Ralph

doi:10.1007/978-3-030-31978-6_3

Ralph Foorthuis ORCID: orcid.org/0000-0003-1132-4767⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1021))

Included in the following conference series:

Benelux Conference on Artificial Intelligence

746 Accesses
2 Citations

Abstract

Anomaly detection is the process of identifying cases, or groups of cases, that are in some way unusual and do not fit the general patterns present in the dataset. Numerous algorithms use discretization of numerical data in their detection processes. This study investigates the effect of the employed discretization method on the unsupervised detection of each of the six anomaly types acknowledged in a recent typology of data anomalies. To this end, experiments are conducted with various datasets and SECODA, a general-purpose algorithm for unsupervised non-parametric anomaly detection in datasets with numerical and categorical attributes. This algorithm employs discretization of continuous attributes, exponentially increasing weights and discretization cut points, and a pruning heuristic to detect anomalies with an optimal number of iterations. The empirical results of experiments with synthetic and real-world data demonstrate that standard SECODA can detect all six types, but that different discretization methods favor the discovery of certain anomaly types. These main findings also hold for other detection techniques using discretization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley, Chichester (1994)
MATH Google Scholar
Goldstein, M., Uchida, S.: A comparative evaluation of unsupervised anomaly detection algorithms. PLoS ONE 11(4), e0152173 (2016)
Article Google Scholar
Foorthuis, R.: A typology of data anomalies. In: Medina, J., et al. (eds.) IPMU 2018. CCIS, vol. 854, pp. 26–38. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91476-3_3
Chapter Google Scholar
Pang, G., Cao, L., Chin, L.: Outlier detection in complex categorical data by modelling the feature value couplings. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence (2016)
Google Scholar
Riahi, F., Schulte, O.: Propositionalization for unsupervised outlier detection in multi-relational data. In: Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference (2016)
Google Scholar
Hengst, F., den Hoogendoorn, M.: Detecting interesting outliers: active learning for anomaly detection. In: Proceedings of the 28th Benelux Conference on Artificial Intelligence, Amsterdam, The Netherlands (2016)
Google Scholar
Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Boston (2005)
Google Scholar
Noble, C.C., Cook, D.J.: Graph-based anomaly detection. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2003)
Google Scholar
Schubert, E., Weiler, M., Zimek, A.: Outlier detection and trend detection: two sides of the same coin. In: Proceedings of the 15th IEEE International Conference on Data Mining Workshops (2015)
Google Scholar
Hubert, M., Rousseeuw, P., Segaert, P.: Multivariate functional outlier detection. Stat. Methods Appl. 24(2), 177–202 (2015)
Article MathSciNet Google Scholar
Ranshous, S., Shen, S., Koutra, D., Harenberg, S., Faloutsos, C., Samatova, N.F.: Anomaly detection in dynamic networks: a survey. WIREs Comput. Stat. 7(3), 223–247 (2015)
Article MathSciNet Google Scholar
Fielding, J., Gilbert, N.: Understanding Social Statistics. Sage Publications, London (2000)
Google Scholar
Gartner: Hype Cycle for Data Science and Machine Learning, 2017. Gartner, Inc (2017)
Google Scholar
Forrester: The Forrester Wave: Security Analytics Platforms, Q1 2017. Forrester Research, Inc. (2017)
Google Scholar
Leys, C., Ley, C., Klein, O., Bernard, P., Licata, L.: Detecting outliers: do not use standard deviation around the mean, use absolute deviation around the median. J. Exp. Soc. Psychol. 49(4), 764–766 (2013)
Article Google Scholar
Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: VLDB 1998, Proceedings of the 24th International Conference on Very Large Data Bases (1998)
Google Scholar
Breunig, M.M., Kriegel, H., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the ACM SIGMOD Conference on Management of Data (2000)
Google Scholar
Campos, G.O., et al.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Discovery 30(4), 891–927 (2016)
Article MathSciNet Google Scholar
Schölkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., Platt, J.: Support vector method for novelty detection. Adv. Neural Inf. Process. 12, 582–588 (2000)
Google Scholar
Liu, F.T., Ting, K.M., Zhou, Z.: Isolation-based anomaly detection. ACM Trans. Knowl. Discov. Data 6(1), 3 (2012)
Article Google Scholar
Shyu, M.L., Chen, S.C., Sarinnapakorn, K., Chang, L.W.: A novel anomaly detection scheme based on principal component classifier. In: Proceedings of the ICDM Foundation and New Direction of Data Mining workshop, pp. 172–179 (2003)
Google Scholar
Pimentel, M.A.F., Clifton, D.A., Clifton, L., Tarassenko, L.: A review of novelty detection. Signal Process. 99, 215–249 (2014)
Article Google Scholar
Keogh, E., Lonardi, S., Ratanamahatana, C.A.: Towards parameter-free data mining. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle (2004)
Google Scholar
Goldstein, M., Dengel, A.: Histogram-based outlier score (HBOS): a fast unsupervised anomaly detection algorithm. In: Proceedings of the 35th German Conference on Artificial Intelligence (KI-2012), pp. 59–63 (2012)
Google Scholar
Foorthuis, R.: SECODA: segmentation- and combination-based detection of anomalies. In: Proceedings of the 4th IEEE International Conference on Data Science and Advanced Analytics (DSAA 2017), pp. 755–764, Tokyo (2017). https://doi.org/10.1109/dsaa.2017.35
Aggarwal, C.C., Yu, P.S.: An effective and efficient algorithm for high-dimensional outlier detection. VLDB J. 14(2), 211–221 (2005)
Article Google Scholar
Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Proceedings of the Twelfth International Conference on Machine Learning (1995)
Google Scholar
Kotsiantis, S., Kanellopoulos, D.: Discretization techniques: a recent survey. GESTS Int. Trans. Comput. Sci. Eng. 32, 47–58 (2006)
Google Scholar
Foorthuis, R.: Anomaly detection with SECODA. In: Poster Presentation at the 4th IEEE International Conference on Data Science and Advanced Analytics (DSAA), Tokyo (2017). https://doi.org/10.13140/rg.2.2.21212.08325
Yang, Y., Webb, G.I., Wu, X.: Discretization methods. In: Maimon, O., Rockach, L. (eds.) Data Mining and Knowledge Discovery Handbook. Kluwer Academic Publishers (2005)
Google Scholar
Li, H., Hussain, F., Tan, C.M., Dash, M.: Discretization: an enabling technique. Data Min. Knowl. Disc. 6(4), 393–423 (2002)
Article MathSciNet Google Scholar
Wolpert, D.H., Macready, W.G.: No Free Lunch Theorems for Search. Technical report SFI-TR-95-02-010, Santa Fe Institute (1996)
Google Scholar
Clarke, B., Fokoué, E., Zhang, H.H.: Principles and Theory for Data Mining and Machine Learning. Springer, New York (2009). https://doi.org/10.1007/978-0-387-98135-2
Book MATH Google Scholar
Rokach, L., Maimon, O.: Data Mining With Decision Trees: Theory and Applications, 2nd edn. World Scientific Publishing, Singapore (2015)
MATH Google Scholar
Janssens, J.H.M.: Outlier Selection and One-Class Classification. Ph.D. thesis, Tilburg University (2013)
Google Scholar
Maxion, R.A., Tan, K.M.C.: Benchmarking anomaly-based detection systems. In: International Conference on Dependable Systems and Networks, New York (2000)
Google Scholar
LAK: Anomaly Detection at the Dutch Alliance on Income Data and Taxes (2018). www.loonaangifteketen.nl
Pijnenburg, M., Kowalczyk, W.: Singular outliers: finding common observations with an uncommon feature. In: Medina, J., Ojeda-Aciego, M., Verdegay, J.L., Perfilieva, I., Bouchon-Meunier, B., Yager, R.R. (eds.) IPMU 2018. CCIS, vol. 855, pp. 492–503. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91479-4_41
Chapter Google Scholar
Greenacre, M., Ayhan, H.: Identifying Inliers. Barcelona GSE Working Paper Series (2014)
Google Scholar
Foorthuis, R.: (Un)certain anomalies in income data. In: Presentation at the Mini-Symposium on Uncertainty in Data-Driven Systems, Utrecht University, 28 January 2019
Google Scholar

Download references

Author information

Authors and Affiliations

UWV / Heineken, Amsterdam, The Netherlands
Ralph Foorthuis

Authors

Ralph Foorthuis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ralph Foorthuis .

Editor information

Editors and Affiliations

Tilburg University, Tilburg, The Netherlands
Martin Atzmueller
Eindhoven University of Technology, Eindhoven, The Netherlands
Wouter Duivesteijn

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Foorthuis, R. (2019). All or In-cloud: How the Identification of Six Types of Anomalies Is Affected by the Discretization Method. In: Atzmueller, M., Duivesteijn, W. (eds) Artificial Intelligence. BNAIC 2018. Communications in Computer and Information Science, vol 1021. Springer, Cham. https://doi.org/10.1007/978-3-030-31978-6_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-31978-6_3
Published: 25 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31977-9
Online ISBN: 978-3-030-31978-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics