Implementation and visualization of a netflow log data lake system for cyberattack detection using distributed deep learning

Shih, Wen-Chung; Yang, Chao-Tung; Jiang, Cheng-Tian; Kristiani, Endah

doi:10.1007/s11227-022-04802-y

Implementation and visualization of a netflow log data lake system for cyberattack detection using distributed deep learning

Published: 06 October 2022

Volume 79, pages 4983–5012, (2023)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Wen-Chung Shih¹,
Chao-Tung Yang ORCID: orcid.org/0000-0002-9579-4426^2,3,
Cheng-Tian Jiang^2,5 &
…
Endah Kristiani^2,4

503 Accesses
3 Citations
Explore all metrics

Abstract

Big data and artificial intelligence (AI) technology are complicated systems that will continue developing in recent years. This paper implemented a data lake architecture to handle massive data and perform data analysis in a real-time system. Using a data lake and AI model, a NetFlow storage monitoring system was deployed to perform a platform that can cover the storage, query, analysis, and visualization of massive volumes of data. The big data platform was built on Cloudera, which utilized big data tools like Kafka, Spark, HBase, Hive, and Impala. In addition, we used Spark to develop network threat recognition models using distributed deep learning. Also, we used the deep neural network (DNN) to train the model. Then, we evaluated the model performance, which reached 94% accuracy while decreasing by 48% of training time. The results of the studies demonstrate that deep learning model training time is significantly shortened. Additionally, this system employs several configurations to assess the elements influencing accuracy and performance. The model is evaluated using the confusion matrix to demonstrate that it can accurately detect attack behavior in log data. Furthermore, we have developed a real-time log data monitoring and analysis system to demonstrate the proposed architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 4

Using Spark Distributed Deep Learning to Analyze NetFlow in Data Lake System

The Deep Learning Modules for Cyberattack Identification in NetFlow Data Log with Ceph

Machine Learning Techniques for Anomaly-Based Detection System on CSE-CIC-IDS2018 Dataset

Data Availability

None

References

Netscout (2019) With key findings from the 15th annual worldwide infrastructure security report (wisr). https://www.netscout.com/threatreport
Liu J-C, Yang C-T, Chan Y-W, Kristiani E, Jiang W-J (2021) Cyberattack detection model using deep learning in a network log system with data visualization. J Supercomput 77(10):10984–11003
Article Google Scholar
Yang C-T, Chan Y-W, Liu J-C, Kristiani E, Lai C-H (2022) Cyberattacks detection and analysis in a network log system using xgboost with elk stack, Soft Comput 1–15
Bajaber F, Sakr S, Batarfi O, Altalhi A, Barnawi A (2020) Benchmarking big data systems: a survey. Comput Commun 149:241–251
Article Google Scholar
Huang S (2012) Performance analysis of cluster databases base on ycsb system
Yang C-T, Chen T-Y, Kristiani E, Wu SF (2021) The implementation of data storage and analytics platform for big data lake of electricity usage with spark. J Supercomput 77(6):5934–5959
Article Google Scholar
Kristiani E, Yang C-T, Huang C-Y, Ko P-C, Fathoni H (2021) On construction of sensors, edge, and cloud (iSEC) framework for smart system integration and applications. IEEE Internet Things J 8(1):309–319. https://doi.org/10.1109/JIOT.2020.3004244
Article Google Scholar
Kristiani E, Lin H, Lin J-R, Chuang Y-H, Huang C-Y, Yang C-T (2022) Short-term prediction of pm2.5 using lstm deep learning methods. Sustainability 14(4):2068
Article Google Scholar
Mousavi S, Khansari M, Rahmani R (2020) A fully scalable big data framework for botnet detection based on network traffic analysis. Inf Sci 512:629–640
Article Google Scholar
Dahiya P, Srivastava DK (2018) Network intrusion detection in big dataset using spark. Procedia Comput Sci 132:253–262
Article Google Scholar
Sahingoz OK, Buber E, Demir O, Diri B (2019) Machine learning based phishing detection from urls. Expert Syst Appl 117:345–357
Article Google Scholar
Zhang J, Gardner R, Vukotic I (2019) Anomaly detection in wide area network meshes using two machine learning algorithms. Futur Gener Comput Syst 93:418–426
Article Google Scholar
Kozik R (2018) Distributing extreme learning machines with apache spark for netflow-based malware activity detection. Pattern Recogn Lett 101:14–20
Article Google Scholar
Kozik R, Choraś M, Ficco M, Palmieri F (2018) A scalable distributed machine learning approach for attack detection in edge computing environments. J Parallel Distrib Comput 119:18–26
Article Google Scholar
Munshi AA, Mohamed YA-RI (2018) Data lake lambda architecture for smart grids big data analytics. IEEE Access 6:40463–40471
Article Google Scholar
Cruz L, Tous R, Otero B (2019) Distributed training of deep neural networks with spark: The marenostrum experience. Pattern Recogn Lett 125:174–178
Article Google Scholar
Tsung C-K, Hsieh H-Y, Yang C-T (2019) An implementation of scalable high throughput data platform for logging semiconductor testing results. IEEE Access 7:26497–26506
Article Google Scholar
Carcillo F, Dal Pozzolo A, Le Borgne Y-A, Caelen O, Mazzer Y, Bontempi G (2018) Scarff: a scalable framework for streaming credit card fraud detection with spark. Inf fusion 41:182–194
Article Google Scholar
Chen L, Ko J, Yeo J (2015) Analysis of the influence factors of data loading performance using apache sqoop. KIPS Trans Softw Data Eng 4(2):77–82
Article Google Scholar
Yuan X, Li C, Li X (2017) Deepdefense: identifying ddos attack via deep learning. In: 2017 IEEE International Conference on Smart Computing (SMARTCOMP), IEEE, pp 1–8
Google Scholar
Diro AA, Chilamkurti N (2018) Distributed attack detection scheme using deep learning approach for internet of things. Futur Gener Comput Syst 82:761–768
Article Google Scholar
Terzi DS, Terzi R, Sagiroglu S (2017) Big data analytics for network anomaly detection from netflow data. In: 2017 International Conference on Computer Science and Engineering (UBMK), IEEE, pp 592–597
Chapter Google Scholar
Ring M, Schlör D, Landes D, Hotho A (2019) Flow-based network traffic generation using generative adversarial networks. Comput Secur 82:156–172
Article Google Scholar
Solaimani M, Iftekhar M, Khan L, Thuraisingham B, Ingram J, Seker SE (2016) Online anomaly detection for multi-source vmware using a distributed streaming framework. Softw Pract Exp 46(11):1479–1497
Article Google Scholar
Lu X, Shi H, Biswas R, Javed MH, Panda DK (2018) Dlobd: a comprehensive study of deep learning over big data stacks on hpc clusters. IEEE Trans Multi Scale Comput Syst 4(4):635–648
Article Google Scholar
Yang C-T, Liu J-C, Chen S-T, Lu H-W (2017) Implementation of a big data accessing and processing platform for medical records in cloud. J Med Syst 41(10):149
Article Google Scholar
Yang C-T, Chen S-T, Liu J-C, Liu R-H, Chang C-L (2020) On construction of an energy monitoring service using big data technology for the smart campus. Clust Comput 23(1):265–288
Article Google Scholar
Yang C-T, Chen S-T, Cheng W-H, Chan Y-W, Kristiani E (2019) A heterogeneous cloud storage platform with uniform data distribution by software-defined storage technologies. IEEE Access 7:147672–147682
Article Google Scholar
Yang C-T, Liu J-C, Kristiani E, Liu M-L, You I, Pau G (2020) Netflow monitoring and cyberattack detection using deep learning with ceph. IEEE Access 8:7842–7850
Article Google Scholar
Tsung C-K, Yang C-T, Yang S-W (2020) Visualizing potential transportation demand from ETC log analysis using ELK stack. IEEE Internet Things J 7(7):6623–6633. https://doi.org/10.1109/JIOT.2020.2974671
Article Google Scholar
KDD (1999) Kdd cup 1999 data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

Download references

Acknowledgements

This work was supported by the National Science and Technology Council (NSTC), Taiwan (R.O.C.), under grants number 111-2622-E-029-003-, 111-2811-E-029-001-, 111-2621-M-029-004-, and 110-2221-E-029-020-MY3.

Author information

Authors and Affiliations

Department of M-Commerce and Multimedia Applications, Asia University, Taichung City, 41354, Taiwan, R.O.C.
Wen-Chung Shih
Department of Computer Science, Tunghai University, Taichung City, 407224, Taiwan, R.O.C.
Chao-Tung Yang, Cheng-Tian Jiang & Endah Kristiani
Research Center for Smart Sustainable Circular Economy, Tunghai University, No. 1727, Sec.4, Taiwan Boulevard, Taichung City, 407224, Taiwan, R.O.C.
Chao-Tung Yang
Department of Informatics, Krida Wacana Christian University, Jakarta, 11470, Indonesia
Endah Kristiani
iAmbition Technology Inc., Taichung City, 412031, Taiwan, R.O.C.
Cheng-Tian Jiang

Authors

Wen-Chung Shih
View author publications
You can also search for this author in PubMed Google Scholar
Chao-Tung Yang
View author publications
You can also search for this author in PubMed Google Scholar
Cheng-Tian Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Endah Kristiani
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

W-CS and C-TY: conceived of the presented idea, developed the theory, and supervised the findings of this work. C-TJ and EK verified the analytical methods and performed the computations. All authors discussed the results and contributed to the final manuscript.

Corresponding author

Correspondence to Chao-Tung Yang.

Ethics declarations

Conflict of interest

The authors declare that they have known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shih, WC., Yang, CT., Jiang, CT. et al. Implementation and visualization of a netflow log data lake system for cyberattack detection using distributed deep learning. J Supercomput 79, 4983–5012 (2023). https://doi.org/10.1007/s11227-022-04802-y

Download citation

Accepted: 27 August 2022
Published: 06 October 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s11227-022-04802-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Implementation and visualization of a netflow log data lake system for cyberattack detection using distributed deep learning

Abstract

Access this article

Similar content being viewed by others

Using Spark Distributed Deep Learning to Analyze NetFlow in Data Lake System

The Deep Learning Modules for Cyberattack Identification in NetFlow Data Log with Ceph

Machine Learning Techniques for Anomaly-Based Detection System on CSE-CIC-IDS2018 Dataset

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Implementation and visualization of a netflow log data lake system for cyberattack detection using distributed deep learning

Abstract

Access this article

Similar content being viewed by others

Using Spark Distributed Deep Learning to Analyze NetFlow in Data Lake System

The Deep Learning Modules for Cyberattack Identification in NetFlow Data Log with Ceph

Machine Learning Techniques for Anomaly-Based Detection System on CSE-CIC-IDS2018 Dataset

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation