Skip to main content

Advertisement

Log in

Feature selection using cloud-based parallel genetic algorithm for intrusion detection data classification

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

With the exponential growth of the amount of data being generated, stored and processed on a daily basis in the machine learning, data analytics and decision-making systems, the data preprocessing established itself as the key factor for building reliable high-performance machine learning models. One of the roles in preprocessing is variable reduction using feature selection methods; however, the processing time needed for these methods is a major drawback. This study aims at mitigating this problem by migrating the algorithm to a MapReduce implementation suitable for parallelization on a high number of commodity hardware units. The genetic algorithm-based methods were put in the focus of this study. Hadoop, an open-source MapReduce library, was used as a framework for implementing parallel genetic algorithms within our research. The representative machine learning methods, SVM (support vector machine), ANN (artificial neural network), RT (random tree), logistic regression and Naive Bayes, were embedded into implementation for feature selection. The feature selection methods were applied to four NSL-KDD data sets, and the number of features is reduced from cca 40 to cca 10 data sets with the accuracy of 90.45%. These results have both significant practical and theoretical impact. On the one hand, the genetic algorithm has been parallelized in the MapReduce manner, which has been considered unachievable in a strict sense. Furthermore, the genetic algorithm allows randomness-enhanced feature selection and its parallelization reduces overall data preprocessing and allows larger population count which in turn leads to better feature selection. On the practical side, it has been shown that this implementation outperforms the existing feature selection methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Salama MA, Eid HF, Ramadan RA, Darwish A, Hassanien AE (2011) Hybrid intelligent intrusion detection scheme. Adv Intell Soft Comput. https://doi.org/10.1007/978-3-642-20505-7_26

    Article  Google Scholar 

  2. W. Stallings and L. Brown, (2015) Computer Security: Principles and Practice, Global Edition. Pearson Higher Ed

  3. Maruyama, T., Hirose, T., & Konagaya, A. (1993). A fine-grained parallel genetic algorithm for distributed parallel systems. In: 1993 Proceedings of the 5th International Conference on Genetic Algorithms (pp 184–190). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Retrieved from http://dl.acm.org/citation.cfm? id=645513.657765 (accessed Feb. 12, 2021)

  4. Jiang S, Chin K-S, Wang L, Qu G, Tsui KL (2017) Modified genetic algorithm-based feature selection combined with pre-trained deep neural network for demand forecasting in outpatient department. Expert Syst Appl 82:216–230. https://doi.org/10.1016/j.eswa.2017.04.017

    Article  Google Scholar 

  5. M. Mitchell, An Introduction to Genetic Algorithms. MIT Press, 1998.

  6. Chen Z, Zhang B, Stojanovic V, Zhang Y, Zhang Z (2020) Event-based fuzzy control for T-S fuzzy networked systems with various data missing. Neurocomputing 417:322–332. https://doi.org/10.1016/j.neucom.2020.08.063

    Article  Google Scholar 

  7. Cheng P, Chen M, Stojanovic V, He S (2021) Asynchronous fault detection filtering for piecewise homogenous Markov jump linear systems via a dual hidden Markov model. Mech Syst Signal Process 151:107353. https://doi.org/10.1016/j.ymssp.2020.107353

    Article  Google Scholar 

  8. Tao H, Wang P, Chen Y, Stojanovic V, Yang H (2020) An unsupervised fault diagnosis method for rolling bearing using STFT and generative neural networks. J Franklin Inst 357(11):7286–7307. https://doi.org/10.1016/j.jfranklin.2020.04.024

    Article  MathSciNet  MATH  Google Scholar 

  9. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113. https://doi.org/10.1145/1327452.1327492 ((accessed Feb. 12, 2021))

    Article  Google Scholar 

  10. Scarfone KA, Mell PM (2007) Guide to Intrusion detection and prevention systems (IDPS). NIST Special Publication, US. https://doi.org/10.6028/nist.sp.800-94

    Book  Google Scholar 

  11. S.-C. Lin, W. F. Punch, and E. D. Goodman (1994) Coarse-grain parallel genetic algorithms: categorization and new approach. In: 1994 Proceedings of 6th IEEE Symposium on Parallel and Distributed Processing. https://doi.org/10.1109/spdp.1994.346184

  12. Lim D, Ong Y-S, Jin Y, Sendhoff B, Lee B-S (2007) Efficient hierarchical parallel genetic algorithms using grid computing. Futur Gener Comput Syst 23(4):658–670. https://doi.org/10.1016/j.future.2006.10.008

    Article  Google Scholar 

  13. Kečo D, Subasi A, Kevric J (2018) Cloud computing-based parallel genetic algorithm for gene selection in cancer classification. Neural Comput Appl 30(5):1601–1610. https://doi.org/10.1007/s00521-016-2780-z

    Article  Google Scholar 

  14. “Weka 3 - Data Mining with Open Source Machine Learning Software in Java.” http://www.cs.waikato.ac.nz/ml/weka/ (accessed Apr. 12, 2020)

  15. Stallings W (2014) Cryptography and network security: principles and practice, international edition: principles and practice. Pearson Higher Ed, USA

    Google Scholar 

  16. Kevric J, Jukic S, Subasi A (2017) An effective combining classifier approach using tree algorithms for network intrusion detection. Neural Comput Appl 28(S1):1051–1058. https://doi.org/10.1007/s00521-016-2418-1

    Article  Google Scholar 

  17. M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, (2009) A detailed analysis of the KDD CUP 99 data set. In: 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications. https://doi.org/10.1109/cisda.2009.5356528

  18. L. Dormehl, (2018) What is an artificial neural network? Here’s everything you need to know | Digital Trends,” Digital Trends, Sep. 13. https://www.digitaltrends.com/cool-tech/what-is-an-artificial-neural-network/ (accessed Nov. 26, 2018)

  19. Contributors to Wikimedia projects, “Artificial neural network - Wikipedia,” Wikimedia Foundation, Inc., Oct. 02, 2001. https://en.wikipedia.org/wiki/Artificial_neural_network (accessed Nov. 26, 2018).

  20. “Feedforward neural network - Wikipedia.” https://en.wikipedia.org/wiki/Feedforward_neural_network (accessed Nov. 26, 2018)

  21. “The Unreasonable Effectiveness of Recurrent Neural Networks.” http://karpathy.github.io/2015/05/21/rnn-effectiveness/ (accessed Nov. 26, 2018)

  22. “CS231n Convolutional Neural Networks for Visual Recognition.” http://cs231n.github.io/convolutional-networks/ (accessed Nov. 26, 2018)

  23. S. Patel, (2017) Chapter 2 : SVM (Support Vector Machine) — Theory – Machine Learning 101 – Medium. Medium, May 03. https://medium.com/machine-learning-101/chapter-2-svm-support-vector-machine-theory-f0812effc72 (accessed Dec. 01, 2018)

  24. “Categorical Data.” http://www.stat.yale.edu/Courses/1997-98/101/catdat.htm (accessed Jan. 12, 2020).

  25. S. Swaminathan, “Logistic Regression — Detailed Overview,” Medium, Mar. 15, 2018. https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc (accessed Jan. 12, 2020).

  26. T. K. Ho, Random decision forests. In Proceedings of 3rd International Conference on Document Analysis and Recognition. https://doi.org/10.1109/icdar.1995.598994

  27. Wu SX, Banzhaf W (2010) The use of computational intelligence in intrusion detection systems: a review. Appl Soft Comput 10(1):1–35. https://doi.org/10.1016/j.asoc.2009.06.019

    Article  Google Scholar 

  28. “KDD Cup 1999 Data.” http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (accessed Apr. 12, 2020)

  29. McHugh J (2000) Testing intrusion detection systems. ACM Trans Inf Syst Secur (TISSEC) 3(4):262–294. https://doi.org/10.1145/382912.382923

    Article  Google Scholar 

  30. Mohammadi M, Raahemi B, Akbari A, Nassersharif B (2012) New class-dependent feature transformation for intrusion detection systems. Secur Commun Netw 5(12):1296–1311. https://doi.org/10.1002/sec.403

    Article  Google Scholar 

  31. A. Verma, X. Llorà, D. E. Goldberg, and R. H. Campbell, (2009) Scaling genetic algorithms using mapreduce. In: 2009 Ninth International Conference on Intelligent Systems Design and Applications. https://doi.org/10.1109/isda.2009.181

  32. Shamshirband S, Rabczuk T, Chau K-W (2019) A survey of deep learning techniques: application in wind and solar energy resources. IEEE Access 7:164650–164666. https://doi.org/10.1109/access.2019.2951750

    Article  Google Scholar 

  33. Wu CL, Chau KW (2013) Prediction of rainfall time series using modular soft computingmethods. Eng Appl Artif Intell 26(3):997–1007. https://doi.org/10.1016/j.engappai.2012.05.023

    Article  Google Scholar 

  34. Ardabili SF, Najafi B, Shamshirband S, Bidgoli BM, Deo RC, Chau K-W (2018) Computational intelligence approach for modeling hydrogen production: a review. Eng Appl Comput Fluid Mech 12(1):438–458. https://doi.org/10.1080/19942060.2018.1452296

    Article  Google Scholar 

  35. Banan A, Nasiri A, Taheri-Garavand A (2020) Deep learning-based appearance features extraction for automated carp species identification. Aquac Eng 89:102053. https://doi.org/10.1016/j.aquaeng.2020.102053

    Article  Google Scholar 

  36. Keco D, Subasi A (2012) Parallelization of genetic algorithms using Hadoop Map/Reduce. Southeast Eur J Soft Comput. https://doi.org/10.21533/scjournal.v1i2.61

    Article  Google Scholar 

  37. P. Kromer, J. Platos, V. Snasel, and A. Abraham, (2011) Fuzzy classification by evolutionary algorithms. In: 2011 IEEE International Conference on Systems, Man, and Cybernetics. https://doi.org/10.1109/icsmc.2011.6083684

  38. Mahfouz AM, Venugopal D, Shiva SG (2020) Comparative analysis of ML classifiers for network intrusion detection. Adv Intell Syst Comput. https://doi.org/10.1007/978-981-32-9343-4_16

    Article  Google Scholar 

  39. Su T, Sun H, Zhu J, Wang S, Li Y (2020) BAT: deep learning methods on network intrusion detection using NSL-KDD dataset. IEEE Access 8:29575–29585. https://doi.org/10.1109/access.2020.2972627

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dželila Mehanović.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mehanović, D., Kečo, D., Kevrić, J. et al. Feature selection using cloud-based parallel genetic algorithm for intrusion detection data classification. Neural Comput & Applic 33, 11861–11873 (2021). https://doi.org/10.1007/s00521-021-05871-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-05871-5

Keywords

Navigation