Skip to main content
Log in

Implementation and visualization of a netflow log data lake system for cyberattack detection using distributed deep learning

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Big data and artificial intelligence (AI) technology are complicated systems that will continue developing in recent years. This paper implemented a data lake architecture to handle massive data and perform data analysis in a real-time system. Using a data lake and AI model, a NetFlow storage monitoring system was deployed to perform a platform that can cover the storage, query, analysis, and visualization of massive volumes of data. The big data platform was built on Cloudera, which utilized big data tools like Kafka, Spark, HBase, Hive, and Impala. In addition, we used Spark to develop network threat recognition models using distributed deep learning. Also, we used the deep neural network (DNN) to train the model. Then, we evaluated the model performance, which reached 94% accuracy while decreasing by 48% of training time. The results of the studies demonstrate that deep learning model training time is significantly shortened. Additionally, this system employs several configurations to assess the elements influencing accuracy and performance. The model is evaluated using the confusion matrix to demonstrate that it can accurately detect attack behavior in log data. Furthermore, we have developed a real-time log data monitoring and analysis system to demonstrate the proposed architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26

Similar content being viewed by others

Data Availability

None

References

  1. Netscout (2019) With key findings from the 15th annual worldwide infrastructure security report (wisr). https://www.netscout.com/threatreport

  2. Liu J-C, Yang C-T, Chan Y-W, Kristiani E, Jiang W-J (2021) Cyberattack detection model using deep learning in a network log system with data visualization. J Supercomput 77(10):10984–11003

    Article  Google Scholar 

  3. Yang C-T, Chan Y-W, Liu J-C, Kristiani E, Lai C-H (2022) Cyberattacks detection and analysis in a network log system using xgboost with elk stack, Soft Comput 1–15

  4. Bajaber F, Sakr S, Batarfi O, Altalhi A, Barnawi A (2020) Benchmarking big data systems: a survey. Comput Commun 149:241–251

    Article  Google Scholar 

  5. Huang S (2012) Performance analysis of cluster databases base on ycsb system

  6. Yang C-T, Chen T-Y, Kristiani E, Wu SF (2021) The implementation of data storage and analytics platform for big data lake of electricity usage with spark. J Supercomput 77(6):5934–5959

    Article  Google Scholar 

  7. Kristiani E, Yang C-T, Huang C-Y, Ko P-C, Fathoni H (2021) On construction of sensors, edge, and cloud (iSEC) framework for smart system integration and applications. IEEE Internet Things J 8(1):309–319. https://doi.org/10.1109/JIOT.2020.3004244

    Article  Google Scholar 

  8. Kristiani E, Lin H, Lin J-R, Chuang Y-H, Huang C-Y, Yang C-T (2022) Short-term prediction of pm2.5 using lstm deep learning methods. Sustainability 14(4):2068

    Article  Google Scholar 

  9. Mousavi S, Khansari M, Rahmani R (2020) A fully scalable big data framework for botnet detection based on network traffic analysis. Inf Sci 512:629–640

    Article  Google Scholar 

  10. Dahiya P, Srivastava DK (2018) Network intrusion detection in big dataset using spark. Procedia Comput Sci 132:253–262

    Article  Google Scholar 

  11. Sahingoz OK, Buber E, Demir O, Diri B (2019) Machine learning based phishing detection from urls. Expert Syst Appl 117:345–357

    Article  Google Scholar 

  12. Zhang J, Gardner R, Vukotic I (2019) Anomaly detection in wide area network meshes using two machine learning algorithms. Futur Gener Comput Syst 93:418–426

    Article  Google Scholar 

  13. Kozik R (2018) Distributing extreme learning machines with apache spark for netflow-based malware activity detection. Pattern Recogn Lett 101:14–20

    Article  Google Scholar 

  14. Kozik R, Choraś M, Ficco M, Palmieri F (2018) A scalable distributed machine learning approach for attack detection in edge computing environments. J Parallel Distrib Comput 119:18–26

    Article  Google Scholar 

  15. Munshi AA, Mohamed YA-RI (2018) Data lake lambda architecture for smart grids big data analytics. IEEE Access 6:40463–40471

    Article  Google Scholar 

  16. Cruz L, Tous R, Otero B (2019) Distributed training of deep neural networks with spark: The marenostrum experience. Pattern Recogn Lett 125:174–178

    Article  Google Scholar 

  17. Tsung C-K, Hsieh H-Y, Yang C-T (2019) An implementation of scalable high throughput data platform for logging semiconductor testing results. IEEE Access 7:26497–26506

    Article  Google Scholar 

  18. Carcillo F, Dal Pozzolo A, Le Borgne Y-A, Caelen O, Mazzer Y, Bontempi G (2018) Scarff: a scalable framework for streaming credit card fraud detection with spark. Inf fusion 41:182–194

    Article  Google Scholar 

  19. Chen L, Ko J, Yeo J (2015) Analysis of the influence factors of data loading performance using apache sqoop. KIPS Trans Softw Data Eng 4(2):77–82

    Article  Google Scholar 

  20. Yuan X, Li C, Li X (2017) Deepdefense: identifying ddos attack via deep learning. In: 2017 IEEE International Conference on Smart Computing (SMARTCOMP), IEEE, pp 1–8

    Google Scholar 

  21. Diro AA, Chilamkurti N (2018) Distributed attack detection scheme using deep learning approach for internet of things. Futur Gener Comput Syst 82:761–768

    Article  Google Scholar 

  22. Terzi DS, Terzi R, Sagiroglu S (2017) Big data analytics for network anomaly detection from netflow data. In: 2017 International Conference on Computer Science and Engineering (UBMK), IEEE, pp 592–597

    Chapter  Google Scholar 

  23. Ring M, Schlör D, Landes D, Hotho A (2019) Flow-based network traffic generation using generative adversarial networks. Comput Secur 82:156–172

    Article  Google Scholar 

  24. Solaimani M, Iftekhar M, Khan L, Thuraisingham B, Ingram J, Seker SE (2016) Online anomaly detection for multi-source vmware using a distributed streaming framework. Softw Pract Exp 46(11):1479–1497

    Article  Google Scholar 

  25. Lu X, Shi H, Biswas R, Javed MH, Panda DK (2018) Dlobd: a comprehensive study of deep learning over big data stacks on hpc clusters. IEEE Trans Multi Scale Comput Syst 4(4):635–648

    Article  Google Scholar 

  26. Yang C-T, Liu J-C, Chen S-T, Lu H-W (2017) Implementation of a big data accessing and processing platform for medical records in cloud. J Med Syst 41(10):149

    Article  Google Scholar 

  27. Yang C-T, Chen S-T, Liu J-C, Liu R-H, Chang C-L (2020) On construction of an energy monitoring service using big data technology for the smart campus. Clust Comput 23(1):265–288

    Article  Google Scholar 

  28. Yang C-T, Chen S-T, Cheng W-H, Chan Y-W, Kristiani E (2019) A heterogeneous cloud storage platform with uniform data distribution by software-defined storage technologies. IEEE Access 7:147672–147682

    Article  Google Scholar 

  29. Yang C-T, Liu J-C, Kristiani E, Liu M-L, You I, Pau G (2020) Netflow monitoring and cyberattack detection using deep learning with ceph. IEEE Access 8:7842–7850

    Article  Google Scholar 

  30. Tsung C-K, Yang C-T, Yang S-W (2020) Visualizing potential transportation demand from ETC log analysis using ELK stack. IEEE Internet Things J 7(7):6623–6633. https://doi.org/10.1109/JIOT.2020.2974671

    Article  Google Scholar 

  31. KDD (1999) Kdd cup 1999 data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

Download references

Acknowledgements

This work was supported by the National Science and Technology Council (NSTC), Taiwan (R.O.C.), under grants number 111-2622-E-029-003-, 111-2811-E-029-001-, 111-2621-M-029-004-, and 110-2221-E-029-020-MY3.

Author information

Authors and Affiliations

Authors

Contributions

W-CS and C-TY: conceived of the presented idea, developed the theory, and supervised the findings of this work. C-TJ and EK verified the analytical methods and performed the computations. All authors discussed the results and contributed to the final manuscript.

Corresponding author

Correspondence to Chao-Tung Yang.

Ethics declarations

Conflict of interest

The authors declare that they have known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shih, WC., Yang, CT., Jiang, CT. et al. Implementation and visualization of a netflow log data lake system for cyberattack detection using distributed deep learning. J Supercomput 79, 4983–5012 (2023). https://doi.org/10.1007/s11227-022-04802-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04802-y

Keywords

Navigation