Impact of the Structure of Data Pre-processing Pipelines on the Performance of Classifiers When Applied to Imbalanced Network Intrusion Detection System Dataset

Al-Mandhari, I.; Guan, L.; Edirisinghe, E. A.

doi:10.1007/978-3-030-29516-5_45

I. Al-Mandhari¹⁷,
L. Guan¹⁷ &
E. A. Edirisinghe¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1037))

Included in the following conference series:

Proceedings of SAI Intelligent Systems Conference

1662 Accesses
1 Citations

Abstract

The application of machine learning techniques for the purpose of network intrusion detection has become popular over the course of the last decade. Due to the nature of network intrusions the datasets available for training machine learning algorithms, i.e. classifiers, is imbalanced, due to some attacks being rare and some being frequent, in practice. For example, the most widely used network Intrusion Detection System (IDS) dataset is the KDD cup 99 dataset which is known to be an imbalanced dataset, meaning that there is a considerable imbalance amongst the number of occurrences of attacks (i.e. instances) in the different dataset classes. Thus, the resulting data complexity (e.g., irrelevant features, class imbalance) influences how effective a learning task would be when this dataset is used to train a machine learning classifier. In a typical machine learning based IDS a minimum of two pre-processing stages is utilized, i.e. data resampling and feature selection, within the system’s data pre-processing pipeline. The impact of data resampling and feature selection, separately on the performance accuracy of classifiers has been investigated in detail in literature. However, the question of whether feature selection should be performed after or before resampling methods for tackling imbalanced datasets such as the KDD cup dataset, has not been investigated. Further the impact of this order of algorithms within the data pre-processing pipeline, on the performance of different classifiers has also not been studied. This paper centres on the dual utilisation of resampling techniques and feature selection approaches within a data pre-processing pipeline of an IDS, and explores which one, when implemented in what order, would achieve the superior classification results for a given classifier. Seven feature selection methods are studied alongside a most widely used resampling technique. The impact on three widely used classification algorithms are investigated; Naïve Bayes, Random Forest and Stacking. The performance of classifiers is examined in detail to determine which should come first, resampling or feature selection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Al-Mandhari, I.S., Guan, L., Edirisinghe, E.A.: Investigating the effective use of machine learning algorithms in network intruder detection systems. In: Advances in Information and Communication Networks, pp. 145–161 (2019)
Google Scholar
Engen, V., Vincent, J., Phalp, K.: Exploring discrepancies in findings obtained with the KDD Cup ’99 data set. Intell. Data Anal. 15(2), 251–276 (2011)
Article Google Scholar
Mitchell, T.M.: The discipline of machine learning. Mach. Learn. 17, 1–7 (2006)
Google Scholar
Grossman, R., Kasif, S., Moore, R., Rocke, D., Ullman, J.: Data mining research: opportunities and challenges, vol. 1998 (1999)
Google Scholar
Portillo, S.P.: Ph.D. thesis attacks against intrusion detection networks: evasion, reverse engineering and optimal countermeasures, June 2014
Google Scholar
Long, L., Wang, X., Zhu, X.: Machine learning in network intrusion detection, vol. 11, no. 2, p. 9941 (2015)
Google Scholar
Sommer, R., Paxson, V.: Outside the closed world: on using machine learning for network intrusion detection, pp. 305–316 (2010)
Google Scholar
Naiping, S.N.S., Genyuan, Z.G.Z.: A study on intrusion detection based on data mining. In: International Conference of Information Science and Management Engineering, ISME 2010, vol. 1, pp. 8–15 (2010)
Google Scholar
Tavallaee, M.: An adaptive intrusion detection system. Sdstate.Edu (2011)
Google Scholar
Kubat, M.: Neural networks: a comprehensive foundation by Simon Haykin, Macmillan, 1994, ISBN 0-02-352781-7. Knowl. Eng. Rev. 13(4), 409–412 (1999)
Google Scholar
LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.R.: Efficient backprop. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). LECTU, vol. 7700, pp. 9–48 (2012)
Google Scholar
Engen, V.: Machine learning for network based intrusion detection. Int. J. (2010)
Google Scholar
López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. (Ny) 250, 113–141 (2013)
Article Google Scholar
Barandela, R., Sanchez, J.S., Garcia, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognit. 36, 849–851 (2003)
Article Google Scholar
Ducange, P., Lazzerini, B., Marcelloni, F.: Multi-objective genetic fuzzy classifiers for imbalanced and cost-sensitive datasets. Soft Comput. 14(7), 713–728 (2010)
Article Google Scholar
Lin, W.J., Chen, J.J.: Class-imbalanced classifiers for high-dimensional data. Brief. Bioinform. 14(1), 13–26 (2013)
Article Google Scholar
Wang, J., You, J., Li, Q., Xu, Y.: Extract minimum positive and maximum negative features for imbalanced binary classification. Pattern Recognit. 45(3), 1136–1145 (2012)
Article Google Scholar
Batuwita, R., Palade, V.: Class imbalance learning methods for support vector. Imbalanced Learn. Found. Algorithms Appl. 83–100 (2013)
Google Scholar
García-Pedrajas, N., Pérez-Rodríguez, J., García-Pedrajas, M., Ortiz-Boyer, D., Fyfe, C.: Class imbalance methods for translation initiation site recognition in DNA sequences. Knowl.-Based Syst. 25(1), 22–34 (2012)
Article Google Scholar
Domingos, P.: MetaCost: a general method for making classifiers cost-sensitive. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, vol. 55, pp. 155–164 (1999)
Google Scholar
Zhou, Z.H., Liu, X.Y.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans. Knowl. Data Eng. 18(1), 63–77 (2006)
Article MathSciNet Google Scholar
Błaszczyński, J., Deckert, M., Stefanowski, J., Wilk, S.: Integrating selective pre-processing of imbalanced data with Ivotes ensemble. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). LNAI, vol. 6086, pp. 148–157 (2010)
Chapter Google Scholar
Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting. In: Lecture Notes in Computer Science, vol. 2838, pp. 107–119 (2003)
Google Scholar
Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 40(1), 185–197 (2010)
Article Google Scholar
Chawla, N.V.: Data mining for imbalanced datasets: an overview. In: Data Mining and Knowledge Discovery Handbook, pp. 853–867 (2005)
Google Scholar
Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor. Newsl. 6(1), 80 (2004)
Article Google Scholar
Al-Shahib, A., Breitling, R., Gilbert, D.: Feature selection and the class imbalance problem in predicting protein function from sequence. Appl. Bioinform. 4(3), 195–203 (2005)
Article Google Scholar
Khoshgoftaar, T.M., Gao, K., Seliya, N.: Attribute selection and imbalanced data: problems in software defect prediction. In: Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI, vol. 1, pp. 137–144 (2010)
Google Scholar
Wasikowski, M., Chen, X.W.: Combating the small sample class imbalance problem using feature selection. IEEE Trans. Knowl. Data Eng. 22(10), 1388–1400 (2010)
Article Google Scholar
García, V., Alejo, R., Sánchez, J.S., Sotoca, J.M., Mollineda, R.A.: Combined effects of class imbalance and class overlap on instance-based classification. In: Intelligent Data Engineering and Automated Learning – IDEAL 2006, vol. 4224, pp. 371–378 (2006)
Chapter Google Scholar
Mohammad, M.N., Sulaiman, N., Muhsin, O.A.: A novel intrusion detection system by using intelligent data mining in WEKA environment. Procedia Comput. Sci. 3, 1237–1242 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Loughborough University, Loughborough, UK
I. Al-Mandhari, L. Guan & E. A. Edirisinghe

Authors

I. Al-Mandhari
View author publications
You can also search for this author in PubMed Google Scholar
L. Guan
View author publications
You can also search for this author in PubMed Google Scholar
E. A. Edirisinghe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to I. Al-Mandhari .

Editor information

Editors and Affiliations

School of Computing, Computer Science Research Institute, Ulster University, Newtownabbey, UK
Yaxin Bi
The Science and Information (SAI) Organization, Bradford, West Yorkshire, UK
Rahul Bhatia
The Science and Information (SAI) Organization, Bradford, West Yorkshire, UK
Supriya Kapoor

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Al-Mandhari, I., Guan, L., Edirisinghe, E.A. (2020). Impact of the Structure of Data Pre-processing Pipelines on the Performance of Classifiers When Applied to Imbalanced Network Intrusion Detection System Dataset. In: Bi, Y., Bhatia, R., Kapoor, S. (eds) Intelligent Systems and Applications. IntelliSys 2019. Advances in Intelligent Systems and Computing, vol 1037. Springer, Cham. https://doi.org/10.1007/978-3-030-29516-5_45

Download citation

DOI: https://doi.org/10.1007/978-3-030-29516-5_45
Published: 24 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29515-8
Online ISBN: 978-3-030-29516-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics