Skip to main content

Impact of the Structure of Data Pre-processing Pipelines on the Performance of Classifiers When Applied to Imbalanced Network Intrusion Detection System Dataset

  • Conference paper
  • First Online:
Intelligent Systems and Applications (IntelliSys 2019)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1037))

Included in the following conference series:

Abstract

The application of machine learning techniques for the purpose of network intrusion detection has become popular over the course of the last decade. Due to the nature of network intrusions the datasets available for training machine learning algorithms, i.e. classifiers, is imbalanced, due to some attacks being rare and some being frequent, in practice. For example, the most widely used network Intrusion Detection System (IDS) dataset is the KDD cup 99 dataset which is known to be an imbalanced dataset, meaning that there is a considerable imbalance amongst the number of occurrences of attacks (i.e. instances) in the different dataset classes. Thus, the resulting data complexity (e.g., irrelevant features, class imbalance) influences how effective a learning task would be when this dataset is used to train a machine learning classifier. In a typical machine learning based IDS a minimum of two pre-processing stages is utilized, i.e. data resampling and feature selection, within the system’s data pre-processing pipeline. The impact of data resampling and feature selection, separately on the performance accuracy of classifiers has been investigated in detail in literature. However, the question of whether feature selection should be performed after or before resampling methods for tackling imbalanced datasets such as the KDD cup dataset, has not been investigated. Further the impact of this order of algorithms within the data pre-processing pipeline, on the performance of different classifiers has also not been studied. This paper centres on the dual utilisation of resampling techniques and feature selection approaches within a data pre-processing pipeline of an IDS, and explores which one, when implemented in what order, would achieve the superior classification results for a given classifier. Seven feature selection methods are studied alongside a most widely used resampling technique. The impact on three widely used classification algorithms are investigated; Naïve Bayes, Random Forest and Stacking. The performance of classifiers is examined in detail to determine which should come first, resampling or feature selection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Al-Mandhari, I.S., Guan, L., Edirisinghe, E.A.: Investigating the effective use of machine learning algorithms in network intruder detection systems. In: Advances in Information and Communication Networks, pp. 145–161 (2019)

    Google Scholar 

  2. Engen, V., Vincent, J., Phalp, K.: Exploring discrepancies in findings obtained with the KDD Cup ’99 data set. Intell. Data Anal. 15(2), 251–276 (2011)

    Article  Google Scholar 

  3. Mitchell, T.M.: The discipline of machine learning. Mach. Learn. 17, 1–7 (2006)

    Google Scholar 

  4. Grossman, R., Kasif, S., Moore, R., Rocke, D., Ullman, J.: Data mining research: opportunities and challenges, vol. 1998 (1999)

    Google Scholar 

  5. Portillo, S.P.: Ph.D. thesis attacks against intrusion detection networks: evasion, reverse engineering and optimal countermeasures, June 2014

    Google Scholar 

  6. Long, L., Wang, X., Zhu, X.: Machine learning in network intrusion detection, vol. 11, no. 2, p. 9941 (2015)

    Google Scholar 

  7. Sommer, R., Paxson, V.: Outside the closed world: on using machine learning for network intrusion detection, pp. 305–316 (2010)

    Google Scholar 

  8. Naiping, S.N.S., Genyuan, Z.G.Z.: A study on intrusion detection based on data mining. In: International Conference of Information Science and Management Engineering, ISME 2010, vol. 1, pp. 8–15 (2010)

    Google Scholar 

  9. Tavallaee, M.: An adaptive intrusion detection system. Sdstate.Edu (2011)

    Google Scholar 

  10. Kubat, M.: Neural networks: a comprehensive foundation by Simon Haykin, Macmillan, 1994, ISBN 0-02-352781-7. Knowl. Eng. Rev. 13(4), 409–412 (1999)

    Google Scholar 

  11. LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.R.: Efficient backprop. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). LECTU, vol. 7700, pp. 9–48 (2012)

    Google Scholar 

  12. Engen, V.: Machine learning for network based intrusion detection. Int. J. (2010)

    Google Scholar 

  13. López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. (Ny) 250, 113–141 (2013)

    Article  Google Scholar 

  14. Barandela, R., Sanchez, J.S., Garcia, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognit. 36, 849–851 (2003)

    Article  Google Scholar 

  15. Ducange, P., Lazzerini, B., Marcelloni, F.: Multi-objective genetic fuzzy classifiers for imbalanced and cost-sensitive datasets. Soft Comput. 14(7), 713–728 (2010)

    Article  Google Scholar 

  16. Lin, W.J., Chen, J.J.: Class-imbalanced classifiers for high-dimensional data. Brief. Bioinform. 14(1), 13–26 (2013)

    Article  Google Scholar 

  17. Wang, J., You, J., Li, Q., Xu, Y.: Extract minimum positive and maximum negative features for imbalanced binary classification. Pattern Recognit. 45(3), 1136–1145 (2012)

    Article  Google Scholar 

  18. Batuwita, R., Palade, V.: Class imbalance learning methods for support vector. Imbalanced Learn. Found. Algorithms Appl. 83–100 (2013)

    Google Scholar 

  19. García-Pedrajas, N., Pérez-Rodríguez, J., García-Pedrajas, M., Ortiz-Boyer, D., Fyfe, C.: Class imbalance methods for translation initiation site recognition in DNA sequences. Knowl.-Based Syst. 25(1), 22–34 (2012)

    Article  Google Scholar 

  20. Domingos, P.: MetaCost: a general method for making classifiers cost-sensitive. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, vol. 55, pp. 155–164 (1999)

    Google Scholar 

  21. Zhou, Z.H., Liu, X.Y.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans. Knowl. Data Eng. 18(1), 63–77 (2006)

    Article  MathSciNet  Google Scholar 

  22. Błaszczyński, J., Deckert, M., Stefanowski, J., Wilk, S.: Integrating selective pre-processing of imbalanced data with Ivotes ensemble. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). LNAI, vol. 6086, pp. 148–157 (2010)

    Chapter  Google Scholar 

  23. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting. In: Lecture Notes in Computer Science, vol. 2838, pp. 107–119 (2003)

    Google Scholar 

  24. Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 40(1), 185–197 (2010)

    Article  Google Scholar 

  25. Chawla, N.V.: Data mining for imbalanced datasets: an overview. In: Data Mining and Knowledge Discovery Handbook, pp. 853–867 (2005)

    Google Scholar 

  26. Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor. Newsl. 6(1), 80 (2004)

    Article  Google Scholar 

  27. Al-Shahib, A., Breitling, R., Gilbert, D.: Feature selection and the class imbalance problem in predicting protein function from sequence. Appl. Bioinform. 4(3), 195–203 (2005)

    Article  Google Scholar 

  28. Khoshgoftaar, T.M., Gao, K., Seliya, N.: Attribute selection and imbalanced data: problems in software defect prediction. In: Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI, vol. 1, pp. 137–144 (2010)

    Google Scholar 

  29. Wasikowski, M., Chen, X.W.: Combating the small sample class imbalance problem using feature selection. IEEE Trans. Knowl. Data Eng. 22(10), 1388–1400 (2010)

    Article  Google Scholar 

  30. García, V., Alejo, R., Sánchez, J.S., Sotoca, J.M., Mollineda, R.A.: Combined effects of class imbalance and class overlap on instance-based classification. In: Intelligent Data Engineering and Automated Learning – IDEAL 2006, vol. 4224, pp. 371–378 (2006)

    Chapter  Google Scholar 

  31. Mohammad, M.N., Sulaiman, N., Muhsin, O.A.: A novel intrusion detection system by using intelligent data mining in WEKA environment. Procedia Comput. Sci. 3, 1237–1242 (2011)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to I. Al-Mandhari .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Al-Mandhari, I., Guan, L., Edirisinghe, E.A. (2020). Impact of the Structure of Data Pre-processing Pipelines on the Performance of Classifiers When Applied to Imbalanced Network Intrusion Detection System Dataset. In: Bi, Y., Bhatia, R., Kapoor, S. (eds) Intelligent Systems and Applications. IntelliSys 2019. Advances in Intelligent Systems and Computing, vol 1037. Springer, Cham. https://doi.org/10.1007/978-3-030-29516-5_45

Download citation

Publish with us

Policies and ethics