Skip to main content
Log in

A statistical analysis of intrinsic bias of network security datasets for training machine learning mechanisms

  • Published:
Annals of Telecommunications Aims and scope Submit manuscript

Abstract

Machine learning mechanisms for network intrusion detection systems lack accurate evaluation, comparison, and deployment due to the scarcity of well-constructed datasets. In this paper, we propose a statistical analysis of the features contained in four highly used security datasets. We conclude that the analyzed datasets should not be used as a benchmark for creating novel anomaly-based mechanisms for intrusion detection systems. The analyzed datasets introduce a biased classification since features are over-correlated, and most of the features are capable of making a complete distinction between normal and attack flows. Our proposed methodology analyzes the correlation among features instead of checking for redundant values or data imbalance. The results align with the performance of three machine learning techniques. We show that biased classification occurs due to a significant difference between attack and normal data. The syntactically generated features are statistically different between normal and attack classes, which implies overfitting in the machine learning approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Available at https://resources.stottandmay.com/hubfs/Research/Cyber%20Security%20in%20Focus%202020_web-2.pdf.

  2. Available at https://www.unb.ca/cic/datasets/nsl.html.

  3. Available at https://tcpreplay.appneta.com/.

  4. Available at https://github.com/DanielArndt/flowtbag.

References

  1. Lopez MA, Ferrazani Mattos DM, Duarte OCMB (2016) An elastic intrusion detection system for software networks. Ann Telecommun 71(11):595–605

    Article  Google Scholar 

  2. Andreoni Lopez M, Mattos DMF, Duarte OCMB, Pujolle G (2019) A fast unsupervised preprocessing method for network monitoring. Ann Telecommun 74(3):139–155

    Article  Google Scholar 

  3. Andreoni Lopez M, Mattos DMF, Duarte OCMB, Pujolle G (2019) Toward a monitoring and threat detection system based on stream processing as a virtual network function for big data. Concurrency Comput Pract Exp 31(20):e5344

    Article  Google Scholar 

  4. Mattos D. M. F., Ferraz L. H. G, Costa L. H. M. K., Duarte O. C. M. B. (2012) Evaluating virtual router performance for a pluralist future internet. In: Proceedings of the 3rd International Conference on Information and Communication Systems, ser. ICICS’12 Irbid. Association for Computing Machinery, Jordan

  5. Cic ids dataset, accessed: 2020-03-22

  6. Unsw-nb15 dataset, accessed: 2021-01-26

  7. Cic botnet 2014 dataset, accessed: 2020-04-02

  8. Liao H-J, Lin C-HR, Lin Y-C, Tung K-Y (2013) Intrusion detection system: a comprehensive review. J Netw Comput Appl 36:16–24

    Article  Google Scholar 

  9. Mrutyunjaya Panda MRP, Abrahamb A (2012) A hybrid intelligent approach for network intrusion detection, vol. 30 Elsevier

  10. Wathiq Laftah Al-Yaseen MZAN, Othman ZA (2017) Multi-level hybrid support vector machine and extreme learning machine based on modified k-means for intrusion detection system Expert Systems With Applications

  11. Sanz IJ, Mattos DMF, Duarte OCMB (2018) Sfcperf: An automatic performance evaluation framework for service function chaining. In: NOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium 1–9

  12. Depren O, Topallar M, Anarim E, Ciliz MK (2005) An intelligent intrusion detection system (ids) for anomaly and misuse detection in computer networks, vol. 29 Elsevier, 713–722

  13. 1998 darpa intrusion detection evaluation dataset, accessed: 2020-04-02

  14. Kdd cup 1999 data, accessed: 2020-02-22

  15. Tavallaee M, Bagheri E, Lu W, Ghorbani AA (2009) A detailed analysis of the kdd cup 99 data set IEEE

  16. Moustafa N, Slay J (2015) Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 military communications and information systems conference (milCIS). IEEE, 1–6

  17. Sharafaldin I, Lashkari AH, Ghorbani AA (2018) Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: ICISSp, pp 108–116

  18. Biglar Beigi E, Hadian Jazi H, Stakhanova N, Ghorbani AA (2014) Towards effective feature selection in machine learning-based botnet detection approaches. In: 2014 IEEE Conference on Communications and Network Security, pp 247–255

  19. Boutaba R, Salahuddin MA, Limam N, Ayoubi S, Shahriar N, Estrada-Solano F, Caicedo OM (2018) A comprehensive survey on machine learning for networking: evolution, applications and research opportunities. Journal of Internet Services and Applications 9(1):16

    Article  Google Scholar 

  20. Hastie T, Tibshirani R, Friedman J, Franklin J (2004) The elements of statistical learning: Data mining, inference, and prediction. Math Intell 27:83–85, 11

    Google Scholar 

  21. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: Synthetic minority over-sampling technique. J Artif Int Res 16(1):321–357

    MATH  Google Scholar 

  22. Nachar N, et al. (2008) The mann-whitney u: a test for assessing whether two independent samples come from the same distribution. Tutorials in quantitative Methods for Psychology 4(1):13–20

    Article  Google Scholar 

  23. Olusola DOA, Adetunmbi A., Adeola S (2010) Oladele, Analysis of kdd ’99’ intrusion detection dataset for selection of relevance features, vol. 1

  24. Mohammad khubeb siddiqui SN (2013) Analysis of kdd cup 99 dataset using clustering base data mining. 45:23–34

  25. Al Mehedi Hasan BPM, Mohammed N (2013) On kdd’99 dataset: Support vector machine based intrusion detection system (ids) with different kernels. Int J Electron Commun Comput Eng 4:2278–4209

    Google Scholar 

  26. Hasan MAM, Nasser M, Pal B, Ahmad S (2014) Support vector machine and random forest modeling for intrusion detection system (ids). 6:45–52

Download references

Funding

This research was made possible for the funding from CNPq, CAPES, FAPERJ, FAPESP (2018 / 23062-5), RNP and the Niterói City Hall (PDPA PMN/UFF/FEC).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicollas R. de Oliveira.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Silva, J.V.V., de Oliveira, N.R., Medeiros, D.S.V. et al. A statistical analysis of intrinsic bias of network security datasets for training machine learning mechanisms. Ann. Telecommun. 77, 555–571 (2022). https://doi.org/10.1007/s12243-021-00904-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12243-021-00904-5

Keywords

Navigation