Abstract
We introduce dSalmon, a highly efficient framework for outlier detection on streaming data. dSalmon can be used with both Python and C++, meeting the requirements of modern data science research. It provides an intuitive interface and has almost no package dependencies. dSalmon implements main stream outlier detection approaches from literature. By using pure C++ in its core and making the most of available parallelism, data is analyzed with superior processing speed.
We describe design decisions and outline the software architecture of dSalmon. Additionally, we perform thorough evaluations on benchmarking datasets to measure execution time, memory requirements and energy consumption when performing outlier detection. Experiments show that dSalmon requires substantially less resources and in most cases is able to process datasets between one and three orders of magnitude faster than established Python implementations.
This work was supported by the project MALware cOmmunication in cRitical Infrastructures (MALORI), funded by the Austrian security research program KIRAS of the Federal Ministry for Agriculture, Regions and Tourism (BMLRT) under grant no. 873511.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Missing results for RRCF using PySAD indicate experiment runs that failed due to reaching Python’s recursion limit.
- 3.
xStream for KDD Cup’99 using PySAD with 50 random projections failed due to running out of memory.
References
Kdd cup 1999 data. https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (1999), accessed: 2023–07-04
Ahmadzadeh, A., Aydin, B.: Multivariate Timeseries Feature Extraction on SWAN Data Benchmark (SWAN_Features) (2020), GSU Data Mining Lab
Angiulli, F., Fassetti, F.: Detecting distance-based outliers in streams of data. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management, pp. 811–820, CIKM’07, ACM, New York, NY, USA (2007)
Angryk, R.A., Martens, P.C., Aydin, B., Kempton, D., Mahajan, S.S., Basodi, S., Ahmadzadeh, A., Cai, X., Filali Boubrahimi, S., Hamdi, S.M., Schuh, M.A., Georgoulis, M.K.: Multivariate time series dataset for space weather data analytics. Sci. Data 7(227) (2020)
Bachl, M., Hartl, A., Fabini, J., Zseby, T.: Walling up backdoors in intrusion detection systems. In: Big-DAMA ’19, pp. 8–13. ACM, Orlando, FL, USA (2019)
Beazley, D.M.: SWIG: An easy to use tool for integrating scripting languages with C and C++. In: Proceedings of the 4th Conference on USENIX Tcl/Tk Workshop, 1996 - Volume 4, p. 15, TCLTK’96, USENIX Association, USA (1996)
Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: Moa: massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)
Campos, G.O., Zimek, A., et al.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Mining and Knowl. Discovery 30(4), 891–927 (2016), ISSN 1573–756X
Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd International Conference on Very Large Data Bases, pp. 426–435, VLDB ’97, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1997), ISBN 1558604707
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)
David, H., Gorbatov, E., Hanebutte, U.R., Khanna, R., Le, C.: Rapl: memory power estimation and capping. In: 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED). IEEE, pp. 189–194 (201)
Guha, S., Mishra, N., Roy, G., Schrijvers, O.: Robust random cut forest based anomaly detection on streams. In: Proceedings of The 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 48, pp. 2712–2721, PMLR, New York, New York, USA (2016)
Gurtovoy, A., Abrahams, D.: The boost C++ metaprogramming library, p. 22 (2002)
Harris, C.R., Millman, K.J., van der Walt, S.J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N.J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M.H., Brett, M., Haldane, A., del Río, J.F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., Oliphant, T.E.: Array programming with NumPy. Nature 585(7825), 357–362 (2020)
Hartl, A., Bachl, M., Fabini, J., Zseby, T.: Explainability and adversarial robustness for RNNs. In: 2020 IEEE Sixth International Conference on Big Data Computing Service and Applications (BigDataService), pp. 148–156. IEEE, New York, NY, USA (2020a)
Hartl, A., Iglesias, F., Zseby, T.: SDOstream: Low-density models for streaming outlier detection. In: ESANN 2020 proceedings, pp. 661–666 (2020b)
Iglesias, F., Hartl, A., Zseby, T., Zimek, A.: Are network attacks outliers? a study of space representations and unsupervised algorithms. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 159–175. Springer (2019)
Iglesias Vázquez, F., Hartl, A., Zseby, T., Zimek, A.: Anomaly detection in streaming data: A comparison and evaluation study. Expert Syst. with Appl. 233, 120994 (2023), ISSN 0957–4174
Kontaki, M., Gounaris, A., Papadopoulos, A.N., Tsichlas, K., Manolopoulos, Y.: Continuous monitoring of distance-based outliers over data streams. In: IEEE 27th International Conference on Data Engineering, pp. 135–146 (2011)
Lakkaraju, H., Rudin, C.: Learning cost-effective and interpretable treatment regimes. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pp. 166–175, PMLR, Fort Lauderdale, FL, USA (2017)
Lundberg, H., Mowla, N.I., Abedin, S.F., Thar, K., Mahmood, A., Gidlund, M., Raza, S.: Experimental analysis of trustworthy in-vehicle intrusion detection system using explainable artificial intelligence (xai). IEEE Access 10, 102831–102841 (2022)
Manzoor, E.A., Lamba, H., Akoglu, L.: xStream: outlier detection in feature-evolving data streams. In: 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2018)
Meghdouri, F.: Datasets Preprocessing (2021). https://github.com/CN-TU/Datasets-preprocessing, gitHub repository
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Pevný, T.: Loda: Lightweight on-line detector of anomalies. Mach. Learn. 102(2), 275–304 (2016)
Sathe, S., Aggarwal, C.C.: Subspace outlier detection in linear time with randomized hashing. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 459–468 (2016)
Schubert, E., Zimek, A.: Elki: A large open-source library for data analysis—elki release 0.7.5 "heidelberg". arXiv preprint arXiv:1902.03616 (2019)
Sharafaldin, I., Habibi Lashkari, A., Ghorbani, A.A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: ICISSP, pp. 108–116, SCITEPRESS, Funchal, Madeira, Portugal (2018)
Tan, S.C., Ting, K.M., Liu, T.F.: Fast anomaly detection for streaming data. In: Twenty-Second International Joint Conference on Artificial Intelligence (2011)
Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S.J., Brett, M., Wilson, J., Millman, K.J., Mayorov, N., Nelson, A.R.J., Jones, E., Kern, R., Larson, E., Carey, C.J., Polat, İ., Feng, Y., Moore, E.W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E.A., Harris, C.R., Archibald, A.M., Ribeiro, A.H., Pedregosa, F., van Mulbregt, P., SciPy 1.0 contributors: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods 17, 261–272 (2020)
Weng, S.F., Reps, J., Kai, J., Garibaldi, J.M., Qureshi, N.: Can machine-learning improve cardiovascular risk prediction using routine clinical data? PLoS ONE 12(4), 1–14 (2017)
Williams, N., Zander, S., Armitage, G.: A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification. SIGCOMM Comput. Commun. Rev. 36(5), 5–16 (2006)
Wouters, T.: Answer to "what is the maximum recursion depth in python, and how to increase it?" (2010). https://stackoverflow.com/a/3323013, stackoverflow discussion
Yang, D., Rundensteiner, E., Ward, M.O.: Neighbor-based pattern detection for windows over streaming data. In: Proceedings of the 12th International Conference on Extending Database Tech.: Advances in Database Tech., pp. 529–540, EDBT’09, ACM, New York, NY, USA (2009)
Yilmaz, S.F., Kozat, S.S.: Pysad: a streaming anomaly detection framework in Python (2020). arXiv preprint arXiv:2009.02572
Zhao, Y., Nasrullah, Z., Li, Z.: Pyod: a python toolbox for scalable outlier detection. J. Mach. Learn. Res. 20(96), 1–7 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Hartl, A., Iglesias, F., Zseby, T. (2024). dSalmon: High-Speed Anomaly Detection for Evolving Multivariate Data Streams. In: Kalyvianaki, E., Paolieri, M. (eds) Performance Evaluation Methodologies and Tools. VALUETOOLS 2023. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 539. Springer, Cham. https://doi.org/10.1007/978-3-031-48885-6_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-48885-6_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48884-9
Online ISBN: 978-3-031-48885-6
eBook Packages: Computer ScienceComputer Science (R0)