Abstract
Anomaly detection is a major issue for several applications such as industrial failure detection, cybersecurity or transport. Several approaches, such as statistical methods, machine learning and sketch, have been explored by different research communities to detect anomalies in an increasingly challenging context. Indeed, facing the huge volume of data generated at an increasingly fast speed, the response time of the algorithms and their distributivity have become determining criteria, in addition to their accuracy in detecting anomalies. We focus in this paper on the unsupervised anomaly detection algorithm based on binary trees: Isolation Forest. It is a very powerful algorithm with an excellent accuracy and a very low execution time thanks to its linear complexity. In particular, we study the architecture of two distribution solutions of Isolation Forest based on the Apache Spark framework. We then compare the performance of these two solutions by testing them against 4 real commonly used datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Ldiforest (2019). https://github.com/linkedin/isolation-forest
Aggarwal, C.C.: Outlier Analysis, 2nd edn. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-47578-3
Asad, M., Moustafa, A., Ito, T.: Federated learning versus classical machine learning: a convergence comparison. arXiv:2107.10976 (2021)
Bogatinovski, J., Nedelkoski, S.: Multi-source anomaly detection in distributed IT systems. In: Hacid, H., et al. (eds.) ICSOC 2020. LNCS, vol. 12632, pp. 201–213. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-76352-7_22
Chabchoub, Y., Togbe, M.U., Boly, A., Chiky, R.: An in-depth study and improvement of isolation forest. IEEE Access 10, 10219–10237 (2022)
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. (CSUR) 41(3), 1–58 (2009)
Das, K., Bhaduri, K., Votava, P.: Distributed anomaly detection using 1-class svm for vertically partitioned data. Stat. Anal. Data Min ASA Data Sci. J. 4(4), 393–406 (2011)
Ding, Z., Fei, M.: An anomaly detection approach based on isolation forest algorithm for streaming data using sliding window. IFAC Proc. 46(20), 12–17 (2013)
Dua, D., Graff, C.: Uci machine learning repository [https://archive.ics.uci.edu/ml/index.php]. School of Information and Computer Science, University of California, Irvine, CA, vol. 25, p. 27 (2019)
Hawkins, D.M.: Identification of Outliers, vol. 11. Springer, Heidelberg (1980). https://doi.org/10.1007/978-94-015-3994-4
Hussain, N., Rani, P., Chouhan, H., Gaur, U.S.: Cyber security and privacy of connected and automated vehicles (CAVs)-based federated learning: challenges, opportunities, and open issues. In: Federated Learning for IoT Applications. EICC, pp. 169–183. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-85559-8_11
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: Eighth IEEE International Conference on Data Mining, pp. 413–422. IEEE (2008)
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation-based anomaly detection. ACM Trans. Knowl. Disc. Data (TKDD) 6(1), 1–39 (2012)
Ma, X., et al.: A comprehensive survey on graph anomaly detection with deep learning. IEEE Trans. Knowl. Data Eng. (2021)
Meng, X., et al.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)
Pandey, M., Pandey, S., Kumar, A.: Introduction to federated learning. In: Federated Learning for IoT Applications. EICC, pp. 1–17. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-85559-8_1
Pang, G., Shen, C., Cao, L., Hengel, A.V.D.: Deep learning for anomaly detection: a review. ACM Comput. Surv. (CSUR) 54(2), 1–38 (2021)
Qasem, M.H., Hudaib, A., Obeid, N., Almaiah, M.A., Almomani, O., Al-Khasawneh, A.: Multi-agent systems for distributed data mining techniques: an overview. In: Baddi, Y., Gahi, Y., Maleh, Y., Alazab, M., Tawalbeh, L. (eds.) Big Data Intelligence for Smart Applications. SCI, vol. 994, pp. 57–92. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-87954-9_3
Rayana, S.: ODDS Library. Stony Brook University, Department of Computer Sciences (2016). http://odds.cs.stonybrook.edu
Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on Apache Spark. Int. J. Data Sci. Anal. 1(3), 145–164 (2016). https://doi.org/10.1007/s41060-016-0027-9
Solaimani, M., Iftekhar, M., Khan, L., Thuraisingham, B., Ingram, J.B.: Spark-based anomaly detection over multi-source vmware performance data in real-time. In: IEEE Symposium on Computational Intelligence in Cyber Security (CICS), pp. 1–8. IEEE (2014)
Togbe, M., Chabchoub, Y., Boly, A., Chiky, R.: Etude comparative des méthodes de détection d’anomalies. Revue des Nouvelles Technologies de l’Information (2020)
Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., Rellermeyer, J.S.: A survey on distributed machine learning. ACM Comput. Surv. (CSUR) 53(2), 1–33 (2020)
Wang, C., Zhao, Z., Gong, L., Zhu, L., Liu, Z., Cheng, X.: A distributed anomaly detection system for in-vehicle network using HTM. IEEE Access 6, 9091–9098 (2018)
Yang, F.: Contributors: Fdiforest (2018). https://github.com/titicaca/spark-iforest
Zeng, L., et al.: Distributed data mining: a survey. Inf. Technol. Manag. 13(4), 403–409 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Togbe, M.U., Chabchoub, Y., Boly, A., Chiky, R. (2022). Distributed Anomalies Detection Using Isolation Forest and Spark. In: Bădică, C., Treur, J., Benslimane, D., Hnatkowska, B., Krótkiewicz, M. (eds) Advances in Computational Collective Intelligence. ICCCI 2022. Communications in Computer and Information Science, vol 1653. Springer, Cham. https://doi.org/10.1007/978-3-031-16210-7_57
Download citation
DOI: https://doi.org/10.1007/978-3-031-16210-7_57
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16209-1
Online ISBN: 978-3-031-16210-7
eBook Packages: Computer ScienceComputer Science (R0)