Abstract
Big data and artificial intelligence (AI) technology are complicated systems that will continue developing in recent years. This paper implemented a data lake architecture to handle massive data and perform data analysis in a real-time system. Using a data lake and AI model, a NetFlow storage monitoring system was deployed to perform a platform that can cover the storage, query, analysis, and visualization of massive volumes of data. The big data platform was built on Cloudera, which utilized big data tools like Kafka, Spark, HBase, Hive, and Impala. In addition, we used Spark to develop network threat recognition models using distributed deep learning. Also, we used the deep neural network (DNN) to train the model. Then, we evaluated the model performance, which reached 94% accuracy while decreasing by 48% of training time. The results of the studies demonstrate that deep learning model training time is significantly shortened. Additionally, this system employs several configurations to assess the elements influencing accuracy and performance. The model is evaluated using the confusion matrix to demonstrate that it can accurately detect attack behavior in log data. Furthermore, we have developed a real-time log data monitoring and analysis system to demonstrate the proposed architecture.
Similar content being viewed by others
Data Availability
None
References
Netscout (2019) With key findings from the 15th annual worldwide infrastructure security report (wisr). https://www.netscout.com/threatreport
Liu J-C, Yang C-T, Chan Y-W, Kristiani E, Jiang W-J (2021) Cyberattack detection model using deep learning in a network log system with data visualization. J Supercomput 77(10):10984–11003
Yang C-T, Chan Y-W, Liu J-C, Kristiani E, Lai C-H (2022) Cyberattacks detection and analysis in a network log system using xgboost with elk stack, Soft Comput 1–15
Bajaber F, Sakr S, Batarfi O, Altalhi A, Barnawi A (2020) Benchmarking big data systems: a survey. Comput Commun 149:241–251
Huang S (2012) Performance analysis of cluster databases base on ycsb system
Yang C-T, Chen T-Y, Kristiani E, Wu SF (2021) The implementation of data storage and analytics platform for big data lake of electricity usage with spark. J Supercomput 77(6):5934–5959
Kristiani E, Yang C-T, Huang C-Y, Ko P-C, Fathoni H (2021) On construction of sensors, edge, and cloud (iSEC) framework for smart system integration and applications. IEEE Internet Things J 8(1):309–319. https://doi.org/10.1109/JIOT.2020.3004244
Kristiani E, Lin H, Lin J-R, Chuang Y-H, Huang C-Y, Yang C-T (2022) Short-term prediction of pm2.5 using lstm deep learning methods. Sustainability 14(4):2068
Mousavi S, Khansari M, Rahmani R (2020) A fully scalable big data framework for botnet detection based on network traffic analysis. Inf Sci 512:629–640
Dahiya P, Srivastava DK (2018) Network intrusion detection in big dataset using spark. Procedia Comput Sci 132:253–262
Sahingoz OK, Buber E, Demir O, Diri B (2019) Machine learning based phishing detection from urls. Expert Syst Appl 117:345–357
Zhang J, Gardner R, Vukotic I (2019) Anomaly detection in wide area network meshes using two machine learning algorithms. Futur Gener Comput Syst 93:418–426
Kozik R (2018) Distributing extreme learning machines with apache spark for netflow-based malware activity detection. Pattern Recogn Lett 101:14–20
Kozik R, Choraś M, Ficco M, Palmieri F (2018) A scalable distributed machine learning approach for attack detection in edge computing environments. J Parallel Distrib Comput 119:18–26
Munshi AA, Mohamed YA-RI (2018) Data lake lambda architecture for smart grids big data analytics. IEEE Access 6:40463–40471
Cruz L, Tous R, Otero B (2019) Distributed training of deep neural networks with spark: The marenostrum experience. Pattern Recogn Lett 125:174–178
Tsung C-K, Hsieh H-Y, Yang C-T (2019) An implementation of scalable high throughput data platform for logging semiconductor testing results. IEEE Access 7:26497–26506
Carcillo F, Dal Pozzolo A, Le Borgne Y-A, Caelen O, Mazzer Y, Bontempi G (2018) Scarff: a scalable framework for streaming credit card fraud detection with spark. Inf fusion 41:182–194
Chen L, Ko J, Yeo J (2015) Analysis of the influence factors of data loading performance using apache sqoop. KIPS Trans Softw Data Eng 4(2):77–82
Yuan X, Li C, Li X (2017) Deepdefense: identifying ddos attack via deep learning. In: 2017 IEEE International Conference on Smart Computing (SMARTCOMP), IEEE, pp 1–8
Diro AA, Chilamkurti N (2018) Distributed attack detection scheme using deep learning approach for internet of things. Futur Gener Comput Syst 82:761–768
Terzi DS, Terzi R, Sagiroglu S (2017) Big data analytics for network anomaly detection from netflow data. In: 2017 International Conference on Computer Science and Engineering (UBMK), IEEE, pp 592–597
Ring M, Schlör D, Landes D, Hotho A (2019) Flow-based network traffic generation using generative adversarial networks. Comput Secur 82:156–172
Solaimani M, Iftekhar M, Khan L, Thuraisingham B, Ingram J, Seker SE (2016) Online anomaly detection for multi-source vmware using a distributed streaming framework. Softw Pract Exp 46(11):1479–1497
Lu X, Shi H, Biswas R, Javed MH, Panda DK (2018) Dlobd: a comprehensive study of deep learning over big data stacks on hpc clusters. IEEE Trans Multi Scale Comput Syst 4(4):635–648
Yang C-T, Liu J-C, Chen S-T, Lu H-W (2017) Implementation of a big data accessing and processing platform for medical records in cloud. J Med Syst 41(10):149
Yang C-T, Chen S-T, Liu J-C, Liu R-H, Chang C-L (2020) On construction of an energy monitoring service using big data technology for the smart campus. Clust Comput 23(1):265–288
Yang C-T, Chen S-T, Cheng W-H, Chan Y-W, Kristiani E (2019) A heterogeneous cloud storage platform with uniform data distribution by software-defined storage technologies. IEEE Access 7:147672–147682
Yang C-T, Liu J-C, Kristiani E, Liu M-L, You I, Pau G (2020) Netflow monitoring and cyberattack detection using deep learning with ceph. IEEE Access 8:7842–7850
Tsung C-K, Yang C-T, Yang S-W (2020) Visualizing potential transportation demand from ETC log analysis using ELK stack. IEEE Internet Things J 7(7):6623–6633. https://doi.org/10.1109/JIOT.2020.2974671
KDD (1999) Kdd cup 1999 data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Acknowledgements
This work was supported by the National Science and Technology Council (NSTC), Taiwan (R.O.C.), under grants number 111-2622-E-029-003-, 111-2811-E-029-001-, 111-2621-M-029-004-, and 110-2221-E-029-020-MY3.
Author information
Authors and Affiliations
Contributions
W-CS and C-TY: conceived of the presented idea, developed the theory, and supervised the findings of this work. C-TJ and EK verified the analytical methods and performed the computations. All authors discussed the results and contributed to the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shih, WC., Yang, CT., Jiang, CT. et al. Implementation and visualization of a netflow log data lake system for cyberattack detection using distributed deep learning. J Supercomput 79, 4983–5012 (2023). https://doi.org/10.1007/s11227-022-04802-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04802-y