Skip to main content

dSalmon: High-Speed Anomaly Detection for Evolving Multivariate Data Streams

  • Conference paper
  • First Online:
Performance Evaluation Methodologies and Tools (VALUETOOLS 2023)

Abstract

We introduce dSalmon, a highly efficient framework for outlier detection on streaming data. dSalmon can be used with both Python and C++, meeting the requirements of modern data science research. It provides an intuitive interface and has almost no package dependencies. dSalmon implements main stream outlier detection approaches from literature. By using pure C++ in its core and making the most of available parallelism, data is analyzed with superior processing speed.

We describe design decisions and outline the software architecture of dSalmon. Additionally, we perform thorough evaluations on benchmarking datasets to measure execution time, memory requirements and energy consumption when performing outlier detection. Experiments show that dSalmon requires substantially less resources and in most cases is able to process datasets between one and three orders of magnitude faster than established Python implementations.

This work was supported by the project MALware cOmmunication in cRitical Infrastructures (MALORI), funded by the Austrian security research program KIRAS of the Federal Ministry for Agriculture, Regions and Tourism (BMLRT) under grant no. 873511.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/CN-TU/dSalmon.

  2. 2.

    Missing results for RRCF using PySAD indicate experiment runs that failed due to reaching Python’s recursion limit.

  3. 3.

    xStream for KDD Cup’99 using PySAD with 50 random projections failed due to running out of memory.

References

  1. Kdd cup 1999 data. https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (1999), accessed: 2023–07-04

  2. Ahmadzadeh, A., Aydin, B.: Multivariate Timeseries Feature Extraction on SWAN Data Benchmark (SWAN_Features) (2020), GSU Data Mining Lab

    Google Scholar 

  3. Angiulli, F., Fassetti, F.: Detecting distance-based outliers in streams of data. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management, pp. 811–820, CIKM’07, ACM, New York, NY, USA (2007)

    Google Scholar 

  4. Angryk, R.A., Martens, P.C., Aydin, B., Kempton, D., Mahajan, S.S., Basodi, S., Ahmadzadeh, A., Cai, X., Filali Boubrahimi, S., Hamdi, S.M., Schuh, M.A., Georgoulis, M.K.: Multivariate time series dataset for space weather data analytics. Sci. Data 7(227) (2020)

    Google Scholar 

  5. Bachl, M., Hartl, A., Fabini, J., Zseby, T.: Walling up backdoors in intrusion detection systems. In: Big-DAMA ’19, pp. 8–13. ACM, Orlando, FL, USA (2019)

    Google Scholar 

  6. Beazley, D.M.: SWIG: An easy to use tool for integrating scripting languages with C and C++. In: Proceedings of the 4th Conference on USENIX Tcl/Tk Workshop, 1996 - Volume 4, p. 15, TCLTK’96, USENIX Association, USA (1996)

    Google Scholar 

  7. Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: Moa: massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)

    Google Scholar 

  8. Campos, G.O., Zimek, A., et al.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Mining and Knowl. Discovery 30(4), 891–927 (2016), ISSN 1573–756X

    Google Scholar 

  9. Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd International Conference on Very Large Data Bases, pp. 426–435, VLDB ’97, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1997), ISBN 1558604707

    Google Scholar 

  10. Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)

    Article  MathSciNet  Google Scholar 

  11. David, H., Gorbatov, E., Hanebutte, U.R., Khanna, R., Le, C.: Rapl: memory power estimation and capping. In: 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED). IEEE, pp. 189–194 (201)

    Google Scholar 

  12. Guha, S., Mishra, N., Roy, G., Schrijvers, O.: Robust random cut forest based anomaly detection on streams. In: Proceedings of The 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 48, pp. 2712–2721, PMLR, New York, New York, USA (2016)

    Google Scholar 

  13. Gurtovoy, A., Abrahams, D.: The boost C++ metaprogramming library, p. 22 (2002)

    Google Scholar 

  14. Harris, C.R., Millman, K.J., van der Walt, S.J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N.J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M.H., Brett, M., Haldane, A., del Río, J.F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., Oliphant, T.E.: Array programming with NumPy. Nature 585(7825), 357–362 (2020)

    Google Scholar 

  15. Hartl, A., Bachl, M., Fabini, J., Zseby, T.: Explainability and adversarial robustness for RNNs. In: 2020 IEEE Sixth International Conference on Big Data Computing Service and Applications (BigDataService), pp. 148–156. IEEE, New York, NY, USA (2020a)

    Google Scholar 

  16. Hartl, A., Iglesias, F., Zseby, T.: SDOstream: Low-density models for streaming outlier detection. In: ESANN 2020 proceedings, pp. 661–666 (2020b)

    Google Scholar 

  17. Iglesias, F., Hartl, A., Zseby, T., Zimek, A.: Are network attacks outliers? a study of space representations and unsupervised algorithms. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 159–175. Springer (2019)

    Google Scholar 

  18. Iglesias Vázquez, F., Hartl, A., Zseby, T., Zimek, A.: Anomaly detection in streaming data: A comparison and evaluation study. Expert Syst. with Appl. 233, 120994 (2023), ISSN 0957–4174

    Google Scholar 

  19. Kontaki, M., Gounaris, A., Papadopoulos, A.N., Tsichlas, K., Manolopoulos, Y.: Continuous monitoring of distance-based outliers over data streams. In: IEEE 27th International Conference on Data Engineering, pp. 135–146 (2011)

    Google Scholar 

  20. Lakkaraju, H., Rudin, C.: Learning cost-effective and interpretable treatment regimes. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pp. 166–175, PMLR, Fort Lauderdale, FL, USA (2017)

    Google Scholar 

  21. Lundberg, H., Mowla, N.I., Abedin, S.F., Thar, K., Mahmood, A., Gidlund, M., Raza, S.: Experimental analysis of trustworthy in-vehicle intrusion detection system using explainable artificial intelligence (xai). IEEE Access 10, 102831–102841 (2022)

    Article  Google Scholar 

  22. Manzoor, E.A., Lamba, H., Akoglu, L.: xStream: outlier detection in feature-evolving data streams. In: 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2018)

    Google Scholar 

  23. Meghdouri, F.: Datasets Preprocessing (2021). https://github.com/CN-TU/Datasets-preprocessing, gitHub repository

  24. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  Google Scholar 

  25. Pevný, T.: Loda: Lightweight on-line detector of anomalies. Mach. Learn. 102(2), 275–304 (2016)

    Google Scholar 

  26. Sathe, S., Aggarwal, C.C.: Subspace outlier detection in linear time with randomized hashing. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 459–468 (2016)

    Google Scholar 

  27. Schubert, E., Zimek, A.: Elki: A large open-source library for data analysis—elki release 0.7.5 "heidelberg". arXiv preprint arXiv:1902.03616 (2019)

  28. Sharafaldin, I., Habibi Lashkari, A., Ghorbani, A.A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: ICISSP, pp. 108–116, SCITEPRESS, Funchal, Madeira, Portugal (2018)

    Google Scholar 

  29. Tan, S.C., Ting, K.M., Liu, T.F.: Fast anomaly detection for streaming data. In: Twenty-Second International Joint Conference on Artificial Intelligence (2011)

    Google Scholar 

  30. Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S.J., Brett, M., Wilson, J., Millman, K.J., Mayorov, N., Nelson, A.R.J., Jones, E., Kern, R., Larson, E., Carey, C.J., Polat, İ., Feng, Y., Moore, E.W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E.A., Harris, C.R., Archibald, A.M., Ribeiro, A.H., Pedregosa, F., van Mulbregt, P., SciPy 1.0 contributors: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods 17, 261–272 (2020)

    Google Scholar 

  31. Weng, S.F., Reps, J., Kai, J., Garibaldi, J.M., Qureshi, N.: Can machine-learning improve cardiovascular risk prediction using routine clinical data? PLoS ONE 12(4), 1–14 (2017)

    Article  Google Scholar 

  32. Williams, N., Zander, S., Armitage, G.: A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification. SIGCOMM Comput. Commun. Rev. 36(5), 5–16 (2006)

    Google Scholar 

  33. Wouters, T.: Answer to "what is the maximum recursion depth in python, and how to increase it?" (2010). https://stackoverflow.com/a/3323013, stackoverflow discussion

  34. Yang, D., Rundensteiner, E., Ward, M.O.: Neighbor-based pattern detection for windows over streaming data. In: Proceedings of the 12th International Conference on Extending Database Tech.: Advances in Database Tech., pp. 529–540, EDBT’09, ACM, New York, NY, USA (2009)

    Google Scholar 

  35. Yilmaz, S.F., Kozat, S.S.: Pysad: a streaming anomaly detection framework in Python (2020). arXiv preprint arXiv:2009.02572

  36. Zhao, Y., Nasrullah, Z., Li, Z.: Pyod: a python toolbox for scalable outlier detection. J. Mach. Learn. Res. 20(96), 1–7 (2019)

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexander Hartl .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hartl, A., Iglesias, F., Zseby, T. (2024). dSalmon: High-Speed Anomaly Detection for Evolving Multivariate Data Streams. In: Kalyvianaki, E., Paolieri, M. (eds) Performance Evaluation Methodologies and Tools. VALUETOOLS 2023. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 539. Springer, Cham. https://doi.org/10.1007/978-3-031-48885-6_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-48885-6_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-48884-9

  • Online ISBN: 978-3-031-48885-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics