dSalmon: High-Speed Anomaly Detection for Evolving Multivariate Data Streams

Hartl, Alexander; Iglesias, Félix; Zseby, Tanja

doi:10.1007/978-3-031-48885-6_10

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 539))

Included in the following conference series:

EAI International Conference on Performance Evaluation Methodologies and Tools

95 Accesses

Abstract

We introduce dSalmon, a highly efficient framework for outlier detection on streaming data. dSalmon can be used with both Python and C++, meeting the requirements of modern data science research. It provides an intuitive interface and has almost no package dependencies. dSalmon implements main stream outlier detection approaches from literature. By using pure C++ in its core and making the most of available parallelism, data is analyzed with superior processing speed.

We describe design decisions and outline the software architecture of dSalmon. Additionally, we perform thorough evaluations on benchmarking datasets to measure execution time, memory requirements and energy consumption when performing outlier detection. Experiments show that dSalmon requires substantially less resources and in most cases is able to process datasets between one and three orders of magnitude faster than established Python implementations.

This work was supported by the project MALware cOmmunication in cRitical Infrastructures (MALORI), funded by the Austrian security research program KIRAS of the Federal Ministry for Agriculture, Regions and Tourism (BMLRT) under grant no. 873511.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/CN-TU/dSalmon.
2.
Missing results for RRCF using PySAD indicate experiment runs that failed due to reaching Python’s recursion limit.
3.
xStream for KDD Cup’99 using PySAD with 50 random projections failed due to running out of memory.

References

Kdd cup 1999 data. https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (1999), accessed: 2023–07-04
Ahmadzadeh, A., Aydin, B.: Multivariate Timeseries Feature Extraction on SWAN Data Benchmark (SWAN_Features) (2020), GSU Data Mining Lab
Google Scholar
Angiulli, F., Fassetti, F.: Detecting distance-based outliers in streams of data. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management, pp. 811–820, CIKM’07, ACM, New York, NY, USA (2007)
Google Scholar
Angryk, R.A., Martens, P.C., Aydin, B., Kempton, D., Mahajan, S.S., Basodi, S., Ahmadzadeh, A., Cai, X., Filali Boubrahimi, S., Hamdi, S.M., Schuh, M.A., Georgoulis, M.K.: Multivariate time series dataset for space weather data analytics. Sci. Data 7(227) (2020)
Google Scholar
Bachl, M., Hartl, A., Fabini, J., Zseby, T.: Walling up backdoors in intrusion detection systems. In: Big-DAMA ’19, pp. 8–13. ACM, Orlando, FL, USA (2019)
Google Scholar
Beazley, D.M.: SWIG: An easy to use tool for integrating scripting languages with C and C++. In: Proceedings of the 4th Conference on USENIX Tcl/Tk Workshop, 1996 - Volume 4, p. 15, TCLTK’96, USENIX Association, USA (1996)
Google Scholar
Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: Moa: massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)
Google Scholar
Campos, G.O., Zimek, A., et al.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Mining and Knowl. Discovery 30(4), 891–927 (2016), ISSN 1573–756X
Google Scholar
Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd International Conference on Very Large Data Bases, pp. 426–435, VLDB ’97, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1997), ISBN 1558604707
Google Scholar
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)
Article MathSciNet Google Scholar
David, H., Gorbatov, E., Hanebutte, U.R., Khanna, R., Le, C.: Rapl: memory power estimation and capping. In: 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED). IEEE, pp. 189–194 (201)
Google Scholar
Guha, S., Mishra, N., Roy, G., Schrijvers, O.: Robust random cut forest based anomaly detection on streams. In: Proceedings of The 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 48, pp. 2712–2721, PMLR, New York, New York, USA (2016)
Google Scholar
Gurtovoy, A., Abrahams, D.: The boost C++ metaprogramming library, p. 22 (2002)
Google Scholar
Harris, C.R., Millman, K.J., van der Walt, S.J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N.J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M.H., Brett, M., Haldane, A., del Río, J.F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., Oliphant, T.E.: Array programming with NumPy. Nature 585(7825), 357–362 (2020)
Google Scholar
Hartl, A., Bachl, M., Fabini, J., Zseby, T.: Explainability and adversarial robustness for RNNs. In: 2020 IEEE Sixth International Conference on Big Data Computing Service and Applications (BigDataService), pp. 148–156. IEEE, New York, NY, USA (2020a)
Google Scholar
Hartl, A., Iglesias, F., Zseby, T.: SDOstream: Low-density models for streaming outlier detection. In: ESANN 2020 proceedings, pp. 661–666 (2020b)
Google Scholar
Iglesias, F., Hartl, A., Zseby, T., Zimek, A.: Are network attacks outliers? a study of space representations and unsupervised algorithms. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 159–175. Springer (2019)
Google Scholar
Iglesias Vázquez, F., Hartl, A., Zseby, T., Zimek, A.: Anomaly detection in streaming data: A comparison and evaluation study. Expert Syst. with Appl. 233, 120994 (2023), ISSN 0957–4174
Google Scholar
Kontaki, M., Gounaris, A., Papadopoulos, A.N., Tsichlas, K., Manolopoulos, Y.: Continuous monitoring of distance-based outliers over data streams. In: IEEE 27th International Conference on Data Engineering, pp. 135–146 (2011)
Google Scholar
Lakkaraju, H., Rudin, C.: Learning cost-effective and interpretable treatment regimes. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pp. 166–175, PMLR, Fort Lauderdale, FL, USA (2017)
Google Scholar
Lundberg, H., Mowla, N.I., Abedin, S.F., Thar, K., Mahmood, A., Gidlund, M., Raza, S.: Experimental analysis of trustworthy in-vehicle intrusion detection system using explainable artificial intelligence (xai). IEEE Access 10, 102831–102841 (2022)
Article Google Scholar
Manzoor, E.A., Lamba, H., Akoglu, L.: xStream: outlier detection in feature-evolving data streams. In: 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2018)
Google Scholar
Meghdouri, F.: Datasets Preprocessing (2021). https://github.com/CN-TU/Datasets-preprocessing, gitHub repository
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet Google Scholar
Pevný, T.: Loda: Lightweight on-line detector of anomalies. Mach. Learn. 102(2), 275–304 (2016)
Google Scholar
Sathe, S., Aggarwal, C.C.: Subspace outlier detection in linear time with randomized hashing. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 459–468 (2016)
Google Scholar
Schubert, E., Zimek, A.: Elki: A large open-source library for data analysis—elki release 0.7.5 "heidelberg". arXiv preprint arXiv:1902.03616 (2019)
Sharafaldin, I., Habibi Lashkari, A., Ghorbani, A.A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: ICISSP, pp. 108–116, SCITEPRESS, Funchal, Madeira, Portugal (2018)
Google Scholar
Tan, S.C., Ting, K.M., Liu, T.F.: Fast anomaly detection for streaming data. In: Twenty-Second International Joint Conference on Artificial Intelligence (2011)
Google Scholar
Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S.J., Brett, M., Wilson, J., Millman, K.J., Mayorov, N., Nelson, A.R.J., Jones, E., Kern, R., Larson, E., Carey, C.J., Polat, İ., Feng, Y., Moore, E.W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E.A., Harris, C.R., Archibald, A.M., Ribeiro, A.H., Pedregosa, F., van Mulbregt, P., SciPy 1.0 contributors: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods 17, 261–272 (2020)
Google Scholar
Weng, S.F., Reps, J., Kai, J., Garibaldi, J.M., Qureshi, N.: Can machine-learning improve cardiovascular risk prediction using routine clinical data? PLoS ONE 12(4), 1–14 (2017)
Article Google Scholar
Williams, N., Zander, S., Armitage, G.: A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification. SIGCOMM Comput. Commun. Rev. 36(5), 5–16 (2006)
Google Scholar
Wouters, T.: Answer to "what is the maximum recursion depth in python, and how to increase it?" (2010). https://stackoverflow.com/a/3323013, stackoverflow discussion
Yang, D., Rundensteiner, E., Ward, M.O.: Neighbor-based pattern detection for windows over streaming data. In: Proceedings of the 12th International Conference on Extending Database Tech.: Advances in Database Tech., pp. 529–540, EDBT’09, ACM, New York, NY, USA (2009)
Google Scholar
Yilmaz, S.F., Kozat, S.S.: Pysad: a streaming anomaly detection framework in Python (2020). arXiv preprint arXiv:2009.02572
Zhao, Y., Nasrullah, Z., Li, Z.: Pyod: a python toolbox for scalable outlier detection. J. Mach. Learn. Res. 20(96), 1–7 (2019)
MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

TU Wien—Institute of Telecommunications, 1040, Wien, Austria
Alexander Hartl, Félix Iglesias & Tanja Zseby

Authors

Alexander Hartl
View author publications
You can also search for this author in PubMed Google Scholar
Félix Iglesias
View author publications
You can also search for this author in PubMed Google Scholar
Tanja Zseby
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander Hartl .

Editor information

Editors and Affiliations

University of Cambridge, Cambridge, UK
Evangelia Kalyvianaki
University of Southern California, Los Angeles, CA, USA
Marco Paolieri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hartl, A., Iglesias, F., Zseby, T. (2024). dSalmon: High-Speed Anomaly Detection for Evolving Multivariate Data Streams. In: Kalyvianaki, E., Paolieri, M. (eds) Performance Evaluation Methodologies and Tools. VALUETOOLS 2023. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 539. Springer, Cham. https://doi.org/10.1007/978-3-031-48885-6_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-48885-6_10
Published: 03 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48884-9
Online ISBN: 978-3-031-48885-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

dSalmon: High-Speed Anomaly Detection for Evolving Multivariate Data Streams