Skip to main content
Log in

Dataset Generation Methodology: Towards Application of Machine Learning in Industrial Water Treatment Security

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Successful cyber attacks against industrial systems, such as water treatment systems, can lead to irreparable consequences for public health and the economy. Machine learning and deep learning could help detecting and forecasting previously unknown cyber attacks but require specific datasets. The number of publicly available datasets in this field is very limited and the majority of the publicly available datasets used in cyber security tasks have severe flows. In this paper, the authors introduce the unified methodology for the generation of the dataset for industrial water treatment security. Detailed specification of stages of the methodology is given. The paper ends with a usage scenario describing preparatory stages for dataset generation for the cybersecurity research in water treatment systems, namely, specification of the technological process, testbed development, and development of the attack model for the considered technological process. The developed methodology will be used for the dataset generation, that, in turn, will be used to develop and test cyber attack detection methods based on machine learning and deep learning, and to strengthen the water treatment systems’ security.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data Availability

The data are available upon the request to the corresponding author.

Code Availability

Not applicable.

Notes

  1. https://www.bbc.com/news/world-us-canada-55989843.amp.

  2. https://www.computerweekly.com/news/252523856/South-Staffs-Water-is-victim-of-botched-Clop-attack.

  3. https://attack.mitre.org/matrices/ics/.

References

  1. Wu R, Keogh EJ. Current time series anomaly detection benchmarks are flawed and are creating the illusion of progress (extended abstract). In: 2022 IEEE 38th international conference on data engineering (ICDE); 2022; 1479–1480. https://doi.org/10.1109/ICDE53745.2022.00116

  2. Fedorchenko E, Novikova E, Danilov A, Saenko I. Towards the testbed and dataset for analysis of water treatment systems security. In: Nanda SJ, Yadav RP, Gandomi AH, Saraswat M, editors. Proceedings of ICDSA 2023. Springer; 2024

  3. Guo Y. A review of machine learning-based zero-day attack detection: challenges and future directions. Comput Commun. 2023;198:5–185. https://doi.org/10.1016/j.comcom.2022.11.001.

    Article  Google Scholar 

  4. Dong Y, Gong T, Chen H, Li C. Understanding the generalization ability of deep learning algorithms: a kernelized Renyi’s entropy perspective, 2023

  5. Zhang J, Wu D, Boulet B. Time series anomaly detection for smart grids: A survey. In: 2021 IEEE electrical power and energy conference (EPEC), 2021; 125–130

  6. Reddy S, Shyam GK. A machine learning based attack detection and mitigation using a secure SAAS framework. J King Saud Univ-Comput Inform Sci. 2022;34(7):4047–61. https://doi.org/10.1016/j.jksuci.2020.10.005.

    Article  Google Scholar 

  7. Leichtnam L, Totel E, Prigent N, Mé L. Sec2graph: network attack detection based on novelty detection on graph structured data. In: Maurice C, Bilge L, Stringhini G, Neves N, editors. Detection of intrusions and malware, and vulnerability assessment. Cham: Springer; 2020. p. 238–58.

    Chapter  Google Scholar 

  8. Golubev S, Novikova E, Fedorchenko E. Image-based approach to intrusion detection in cyber-physical objects. Information. 2022;13(12):553. https://doi.org/10.3390/info13120553.

    Article  Google Scholar 

  9. Li D, Chen D, Jin B, Shi L, Goh J, Ng S-K. Mad-GAN: Multivariate anomaly detection for time series data with generative adversarial networks. In: Tetko IV, Kůrková V, Karpov P, Theis F, editors. Artificial neural networks and machine learning-ICANN 2019: text and time series. Cham: Springer; 2019. p. 703–16.

    Chapter  Google Scholar 

  10. Shalyga D, Filonov P, Lavrentyev A: Anomaly detection for water treatment system based on neural network with automatic architecture optimization; 2018; CoRR abs/1807.07282arXiv:1807.07282

  11. Wang C, Wang B, Liu H, Qu H. Anomaly detection for industrial control system based on autoencoder neural network. Wirel Commun Mob Comput. 2020;2020:8897926–1889792610.

    Article  Google Scholar 

  12. Su Y, Zhao Y, Niu C, Liu R, Sun W, Pei D. Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. KDD ’19, pp. 2828–2837. Association for Computing Machinery, New York, NY, USA; 2019. https://doi.org/10.1145/3292500.3330672 .

  13. Audibert J, Michiardi P, Guyard F, Marti S, Zuluaga MA. Usad: Unsupervised anomaly detection on multivariate time series. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD ’20, pp. 3395–3404. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394486.3403392 .

  14. Xia F, Chen X, Yu S, Hou M, Liu M, You L. Coupled attention networks for multivariate time series anomaly detection. Accessed 13 Jul 2023; 2023. arXiv:2306.07114

  15. Goh J, Adepu S, Junejo KN, Mathur A. A dataset to support research in the design of secure water treatment systems. In: Havarneanu G, Setola R, Nassopoulos H, Wolthusen S, editors. Critical information infrastructures security. Cham: Springer; 2017. p. 88–99.

    Chapter  Google Scholar 

  16. Xia F, Chen X, Yu S, Hou M, Liu M, You L. Water distribution (WADI) dataset. Accessed 13 Jul 2023; 2023. https://itrust.sutd.edu.sg/itrust-labs-home/itrust-labs_wadi/

  17. Luo Y, Xiao Y, Cheng L, Peng G, Yao DD. Deep learning-based anomaly detection in cyber-physical systems: progress and opportunities. ACM Comput Surv. 2021. https://doi.org/10.1145/3453155.

    Article  Google Scholar 

  18. Inoue J, Yamagata Y, Chen Y, Poskitt CM, Sun J. Anomaly detection for a water treatment system using unsupervised machine learning. In: 2017 IEEE international conference on data mining workshops (ICDMW), 2017; pp. 1058–1065. https://doi.org/10.1109/ICDMW.2017.149

  19. Elnour M, Meskin N, Khan K, Jain R. A dual-isolation-forests-based attack detection framework for industrial control systems. IEEE Access. 2020;8:36639–51. https://doi.org/10.1109/ACCESS.2020.2975066.

    Article  Google Scholar 

  20. Hundman K, Constantinou V, Laporte C, Colwell I, Soderstrom T. Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. KDD ’18, pp. 387–395. Association for Computing Machinery, New York, NY, USA; 2018. https://doi.org/10.1145/3219819.3219845 .

  21. Neshenko N, Bou-Harb E, Furht B. A behavioral-based forensic investigation approach for analyzing attacks on water plants using GANs. Forensic Sci Int Dig Investig. 2021;37: 301198. https://doi.org/10.1016/j.fsidi.2021.301198.

    Article  Google Scholar 

  22. Lin Q, Adepu S, Verwer S, Mathur A. Tabor: A graphical model-based approach for anomaly detection in industrial control systems. In: Proceedings of the 2018 on Asia conference on computer and communications security. ASIACCS ’18, pp. 525–536. Association for computing machinery, New York, NY, USA; 2018. https://doi.org/10.1145/3196494.3196546 .

  23. Goetz C, Humm B. Decentralized real-time anomaly detection in cyber-physical production systems under industry constraints. Sensors. 2023;23(9):4207. https://doi.org/10.3390/s23094207.

    Article  Google Scholar 

  24. Xu Z, Yang Y, Gao X, Hu M. Dcff-mtad: a multivariate time-series anomaly detection model based on dual-channel feature fusion. Sensors. 2023;23(8):3910. https://doi.org/10.3390/s23083910.

    Article  Google Scholar 

  25. Oliveira N, Sousa N, Oliveira J, Praça I. Anomaly detection in cyber-physical systems: Reconstruction of a prediction error feature space. In: 2021 14th International Conference on Security of Information and Networks (SIN), 2021; vol. 1, pp. 1–5. https://doi.org/10.1109/SIN54109.2021.9699339

  26. Aboah Boateng E, Bruce JW, Talbert DA. Anomaly detection for a water treatment system based on one-class neural network. IEEE Access. 2022;10:115179–91. https://doi.org/10.1109/ACCESS.2022.3218624.

    Article  Google Scholar 

  27. Wu Z, Guo Y, Lin W, Yu S, Ji Y. A weighted deep representation learning model for imbalanced fault diagnosis in cyber-physical systems. Sensors. 2018;18(4):1096. https://doi.org/10.3390/s18041096.

    Article  Google Scholar 

  28. PHM Data Challenge. figshare https://phmsociety.org/conference/annual-conference-of-the-phm-society/annual-conference-of-the-prognostics-and-health-management-society-2015/phm-data-challenge-3/ (2015)

  29. Canizo M, Triguero I, Conde A, Onieva E. Multi-head CNN–RNN for multi-time series anomaly detection: an industrial case study. Neurocomputing. 2019;363:246–60. https://doi.org/10.1016/j.neucom.2019.07.034.

    Article  Google Scholar 

  30. Mokhtari S, Abbaspour A, Yen KK, Sargolzaei A. A machine learning approach for anomaly detection in industrial control systems based on measurement data. Electronics. 2021;10(4):407.

    Article  Google Scholar 

  31. Shin H-K, Lee W, Yun J-H, Kim H. Hai 1.0: Hil-based augmented ics security dataset. In: Proceedings of the 13th USENIX conference on cyber security experimentation and test, 2020; pp. 1–1

  32. Park S, Lee K. Improved mitigation of cyber threats in IIoT for smart cities: a new-era approach and scheme. Sensors. 2021;21(6):1976.

    Article  Google Scholar 

  33. Bian X. Detecting anomalies in time-series data using unsupervised learning and analysis on infrequent signatures. J IKEEE. 2020;24(4):1011–6.

    Google Scholar 

  34. Conti M, Donadel D, Turrin F. A survey on industrial control system testbeds and datasets for security research. IEEE Commun Surv Tutor. 2021;23(4):2248–94. https://doi.org/10.1109/COMST.2021.3094360.

    Article  Google Scholar 

  35. Guerra JL, Catania C, Veas E. Datasets are not enough: challenges in labeling network traffic. Comput Secur. 2022;120: 102810. https://doi.org/10.1016/j.cose.2022.102810.

    Article  Google Scholar 

  36. Tushkanova O, Levshun D, Branitskiy A, Fedorchenko E, Novikova E, Kotenko I. Detection of cyberattacks and anomalies in cyber-physical systems: approaches, data sources, evaluation. Algorithms. 2023;16(2):85. https://doi.org/10.3390/a16020085.

    Article  Google Scholar 

  37. Lemay A, Fernandez JM. Providing scada network data sets for intrusion detection research. In: Proceedings of the 9th USENIX conference on cyber security experimentation and test. CSET’16, p. 6. USENIX Association, USA; 2016

  38. Kyzas GZ, Matis KA. Flotation in water and wastewater treatment. Processes. 2018;6(8):116. https://doi.org/10.3390/pr6080116.

    Article  Google Scholar 

  39. Talvitie J, Mikola A, Koistinen A, Setälä O. Solutions to microplastic pollution-removal of microplastics from wastewater effluent with advanced wastewater treatment technologies. Water Res. 2017;123:401–7. https://doi.org/10.1016/j.watres.2017.07.005.

    Article  Google Scholar 

  40. Jovanović I, Miljanović I. Modelling of flotation processes by classical mathematical methods—a review. Arch Min Sci. 2015;60:905–19.

    Google Scholar 

  41. Jbair M, Ahmad B, Maple C, Harrison R. Threat modelling for industrial cyber physical systems in the era of smart manufacturing. Comput Ind. 2022;137: 103611. https://doi.org/10.1016/j.compind.2022.103611.

    Article  Google Scholar 

  42. Adepu S, Mathur A. Generalized attacker and attack models for cyber physical systems. In: 2016 IEEE 40th annual computer software and applications conference (COMPSAC), 2016; vol. 1, pp. 283–292. https://doi.org/10.1109/COMPSAC.2016.122

  43. Duso W, Zhou M, Abusorrah A. A survey of cyber attacks on cyber physical systems: recent advances and challenges. IEEE/CAA J Automatica Sinica. 2022;9:784. https://doi.org/10.1109/JAS.2022.105548.

    Article  Google Scholar 

  44. Peng Y, Wang Y, Xiang C, Liu X, Wen Z, Chen D, Zhang C. Cyber-physical attack-oriented industrial control systems (ics) modeling, analysis and experiment environment. In: 2015 International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP), 2015; pp. 322–326.https://doi.org/10.1109/IIH-MSP.2015.110

  45. Alanazi M, Mahmood A, Chowdhury MJM. Scada vulnerabilities and attacks: a review of the state-of-the-art and open issues. Comput Secur. 2023;125: 103028. https://doi.org/10.1016/j.cose.2022.103028.

    Article  Google Scholar 

Download references

Acknowledgements

The research is supported by the grant of Russian Science Foundation #23-11-20024, https://rscf.ru/project/23-11-20024/, and St. Petersburg Science Foundation.

Author information

Authors and Affiliations

Authors

Contributions

E. Fedorchenko: Conceptualization and Methodology, Formal analysis, Writing - original draft preparation, Writing - Review & Editing; E. Novikova: Conceptualization and Methodology, Formal analysis, Writing - original draft preparation, Writing - Review & Editing; A. Danilov: Formal analysis, Writing - Review & Editing, Visualization; I. Saenko: Validation, Funding acquisition, Project supervision.

Corresponding author

Correspondence to Elena Fedorchenko.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Ethics Approval

Not applicable.

Consent to Participate

Not applicable.

Consent for Publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Emerging Applications of Data Science for Real-World Problems” guest edited by Satyasai Jagannath Nanda, Rajendra Prasad Yadav and Mukesh Saraswat.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Novikova, E., Fedorchenko, E., Danilov, A. et al. Dataset Generation Methodology: Towards Application of Machine Learning in Industrial Water Treatment Security. SN COMPUT. SCI. 5, 373 (2024). https://doi.org/10.1007/s42979-024-02704-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-024-02704-9

Keywords

Navigation