Ensemble of Local Decision Trees for Anomaly Detection in Mixed Data

Aryal, Sunil; Wells, Jonathan R.

doi:10.1007/978-3-030-86486-6_42

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12975))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

2303 Accesses

Abstract

Anomaly Detection (AD) is used in many real-world applications such as cybersecurity, banking, and national intelligence. Though many AD algorithms have been proposed in the literature, their effectiveness in practical real-world problems are rather limited. It is mainly because most of them: (i) examine anomalies globally w.r.t. the entire data, but some anomalies exhibit suspicious characteristics w.r.t. their local neighbourhood (local context) only and they appear to be normal in the global context; and (ii) assume that data features are all numeric, but real-world data have numeric/quantitative and categorical/qualitative features. In this paper, we propose a simple robust solution to address the above-mentioned issues. The main idea is to partition the data space and build local models in different regions rather than building a global model for the entire data space. To cover sufficient local context around a test data instance, multiple local models from different partitions (an ensemble of local models) are used. We used classical decision trees that can handle numeric and categorical features well as local models. Our results show that an Ensemble of Local Decision Trees (ELDT) produces better and more consistent detection accuracies compared to popular state-of-the-art AD methods, particularly in datasets with mixed types of features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://archive.ics.uci.edu/ml.

References

Aggarwal, C.C., Sathe, S.: Outlier Ensembles: An Introduction. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54765-7
Book Google Scholar
Akoglu, L., Tong, H., Vreeken, J., Faloutsos, C.: Fast and reliable anomaly detection in categorical data. In: Proceedings of the 21st ACM Conference on Information and Knowledge Management (CIKM), pp. 415–424 (2012)
Google Scholar
Aryal, S.: Anomaly detection technique robust to units and scales of measurement. In: Phung, D., Tseng, V.S., Webb, G.I., Ho, B., Ganji, M., Rashidi, L. (eds.) PAKDD 2018. LNCS (LNAI), vol. 10937, pp. 589–601. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93034-3_47
Chapter Google Scholar
Aryal, S., Ting, K.M., Haffari, G.: Revisiting attribute independence assumption in probabilistic unsupervised anomaly detection. In: Chau, M., Wang, G.A., Chen, H. (eds.) PAISI 2016. LNCS, vol. 9650, pp. 73–86. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31863-9_6
Chapter Google Scholar
Aryal, S., Ting, K.M., Wells, J.R., Washio, T.: Improving iForest with relative mass. In: Tseng, V.S., Ho, T.B., Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014. LNCS (LNAI), vol. 8444, pp. 510–521. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06605-9_42
Chapter Google Scholar
Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of the Ninth ACM International Conference on Knowledge Discovery and Data Mining, pp. 29–38 (2003)
Google Scholar
Bentley, J.L., Friedman, J.H.: Data structures for range searching. ACM Comput. Surv. 11(4), 397–409 (1979)
Article Google Scholar
Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the Eighth SIAM International Conference on Data Mining, pp. 243–254 (2008)
Google Scholar
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996). https://doi.org/10.1007/BF00058655
Article MATH Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Article MATH Google Scholar
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: In Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 93–104 (2000)
Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15:1–15:58 (2009)
Google Scholar
Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-45014-9_1
Chapter Google Scholar
Dua, D., Graff, C.: UCI machine learning repository (2019). http://archive.ics.uci.edu/ml
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, Hoboken (2000)
MATH Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Article Google Scholar
He, Z., Xu, X., Huang, J.Z., Deng, S.: FP-outlier: frequent pattern based outlier detection. Comput. Sci. Inf. Syst. 2(1), 103–118 (2005)
Article Google Scholar
Hilario, A.F., López, S.C., Galar, M., Prati, R., Krawczyk, B., Herrera, F.: Learning from Imbalanced Data Sets. Springer, Heidelberg (2018). https://doi.org/10.1007/978-3-319-98074-4
Book Google Scholar
Hinneburg, A., Aggarwal, C.C., Keim, D.A.: What is the nearest neighbor in high dimensional spaces? In: Proceedings of the 26th International Conference on Very Large Data Bases, VLDB ’00, pp. 506–515 (2000)
Google Scholar
Lazarevic, A., Kumar, V.: Feature bagging for outlier detection. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD), pp. 157–166 (2005)
Google Scholar
Liu, F., Ting, K.M., Zhou, Z.H.: Isolation forest. In: In Proceedings of the Eighth IEEE International Conference on Data Mining, pp. 413–422 (2008)
Google Scholar
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986). https://doi.org/10.1007/BF00116251
Article Google Scholar
Schölkopf, B., Platt, J.C., Shawe-Taylor, J.C., Smola, A.J., Williamson, R.C.: Estimating the support of a high-dimensional distribution. Neural Comput. 13(7), 1443–1471 (2001)
Article Google Scholar
Shi, T., Horvath, S.: Unsupervised learning with random forest predictors. J. Comput. Graph. Stat. 15(1), 118–138 (2006)
Article MathSciNet Google Scholar
Sugiyama, M., Borgwardt, K.M.: Rapid distance-based outlier detection via sampling. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems, pp. 467–475 (2013)
Google Scholar
Taha, A., Hadi, A.S.: Anomaly detection methods for categorical data: a review. ACM Comput. Surv. 52(2), 38:1–38:35 (2019)
Google Scholar
Tax, D.M., Duin, R.P.: Support vector data description. Mach. Learn. 54, 45–66 (2004). https://doi.org/10.1023/B:MACH.0000008084.60811.49
Article MATH Google Scholar
Ting, K.M., Wells, J.R., Tan, S.C., Teng, S.W., Webb, G.I.: Feature-subspace aggregating: ensembles for stable and unstable learners. Mach. Learn. 82(3), 375–397 (2011). https://doi.org/10.1007/s10994-010-5224-5
Article MathSciNet Google Scholar
Zimek, A., Gaudet, M., Campello, R.J., Sander, J.: Subsampling for efficient and effective unsupervised outlier detection ensembles. In: Proceedings of KDD, pp. 428–436 (2013)
Google Scholar

Download references

Acknowledgement

This research was funded by the Department of Defence and the Office of National Intelligence under the AI for Decision Making Program, delivered in partnership with the Defence Science Institute in Victoria, Australia.

Author information

Authors and Affiliations

School of Information Technology, Deakin University, Geelong, VIC, Australia
Sunil Aryal & Jonathan R. Wells

Authors

Sunil Aryal
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan R. Wells
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sunil Aryal .

Editor information

Editors and Affiliations

ELLIS - The European Laboratory for Learning and Intelligent Systems, Alicante, Spain
Nuria Oliver
ETHZ and EPFL, Zürich, Switzerland
Fernando Pérez-Cruz
Johannes Gutenberg University of Mainz, Mainz, Germany
Stefan Kramer
École Polytechnique, Palaiseau, France
Jesse Read
Basque Center for Applied Mathematics, Bilbao, Spain
Jose A. Lozano

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aryal, S., Wells, J.R. (2021). Ensemble of Local Decision Trees for Anomaly Detection in Mixed Data. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12975. Springer, Cham. https://doi.org/10.1007/978-3-030-86486-6_42

Download citation

DOI: https://doi.org/10.1007/978-3-030-86486-6_42
Published: 10 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86485-9
Online ISBN: 978-3-030-86486-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Ensemble of Local Decision Trees for Anomaly Detection in Mixed Data