Abstract
The application of machine learning techniques for the purpose of network intrusion detection has become popular over the course of the last decade. Due to the nature of network intrusions the datasets available for training machine learning algorithms, i.e. classifiers, is imbalanced, due to some attacks being rare and some being frequent, in practice. For example, the most widely used network Intrusion Detection System (IDS) dataset is the KDD cup 99 dataset which is known to be an imbalanced dataset, meaning that there is a considerable imbalance amongst the number of occurrences of attacks (i.e. instances) in the different dataset classes. Thus, the resulting data complexity (e.g., irrelevant features, class imbalance) influences how effective a learning task would be when this dataset is used to train a machine learning classifier. In a typical machine learning based IDS a minimum of two pre-processing stages is utilized, i.e. data resampling and feature selection, within the system’s data pre-processing pipeline. The impact of data resampling and feature selection, separately on the performance accuracy of classifiers has been investigated in detail in literature. However, the question of whether feature selection should be performed after or before resampling methods for tackling imbalanced datasets such as the KDD cup dataset, has not been investigated. Further the impact of this order of algorithms within the data pre-processing pipeline, on the performance of different classifiers has also not been studied. This paper centres on the dual utilisation of resampling techniques and feature selection approaches within a data pre-processing pipeline of an IDS, and explores which one, when implemented in what order, would achieve the superior classification results for a given classifier. Seven feature selection methods are studied alongside a most widely used resampling technique. The impact on three widely used classification algorithms are investigated; Naïve Bayes, Random Forest and Stacking. The performance of classifiers is examined in detail to determine which should come first, resampling or feature selection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Al-Mandhari, I.S., Guan, L., Edirisinghe, E.A.: Investigating the effective use of machine learning algorithms in network intruder detection systems. In: Advances in Information and Communication Networks, pp. 145–161 (2019)
Engen, V., Vincent, J., Phalp, K.: Exploring discrepancies in findings obtained with the KDD Cup ’99 data set. Intell. Data Anal. 15(2), 251–276 (2011)
Mitchell, T.M.: The discipline of machine learning. Mach. Learn. 17, 1–7 (2006)
Grossman, R., Kasif, S., Moore, R., Rocke, D., Ullman, J.: Data mining research: opportunities and challenges, vol. 1998 (1999)
Portillo, S.P.: Ph.D. thesis attacks against intrusion detection networks: evasion, reverse engineering and optimal countermeasures, June 2014
Long, L., Wang, X., Zhu, X.: Machine learning in network intrusion detection, vol. 11, no. 2, p. 9941 (2015)
Sommer, R., Paxson, V.: Outside the closed world: on using machine learning for network intrusion detection, pp. 305–316 (2010)
Naiping, S.N.S., Genyuan, Z.G.Z.: A study on intrusion detection based on data mining. In: International Conference of Information Science and Management Engineering, ISME 2010, vol. 1, pp. 8–15 (2010)
Tavallaee, M.: An adaptive intrusion detection system. Sdstate.Edu (2011)
Kubat, M.: Neural networks: a comprehensive foundation by Simon Haykin, Macmillan, 1994, ISBN 0-02-352781-7. Knowl. Eng. Rev. 13(4), 409–412 (1999)
LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.R.: Efficient backprop. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). LECTU, vol. 7700, pp. 9–48 (2012)
Engen, V.: Machine learning for network based intrusion detection. Int. J. (2010)
López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. (Ny) 250, 113–141 (2013)
Barandela, R., Sanchez, J.S., Garcia, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognit. 36, 849–851 (2003)
Ducange, P., Lazzerini, B., Marcelloni, F.: Multi-objective genetic fuzzy classifiers for imbalanced and cost-sensitive datasets. Soft Comput. 14(7), 713–728 (2010)
Lin, W.J., Chen, J.J.: Class-imbalanced classifiers for high-dimensional data. Brief. Bioinform. 14(1), 13–26 (2013)
Wang, J., You, J., Li, Q., Xu, Y.: Extract minimum positive and maximum negative features for imbalanced binary classification. Pattern Recognit. 45(3), 1136–1145 (2012)
Batuwita, R., Palade, V.: Class imbalance learning methods for support vector. Imbalanced Learn. Found. Algorithms Appl. 83–100 (2013)
García-Pedrajas, N., Pérez-Rodríguez, J., García-Pedrajas, M., Ortiz-Boyer, D., Fyfe, C.: Class imbalance methods for translation initiation site recognition in DNA sequences. Knowl.-Based Syst. 25(1), 22–34 (2012)
Domingos, P.: MetaCost: a general method for making classifiers cost-sensitive. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, vol. 55, pp. 155–164 (1999)
Zhou, Z.H., Liu, X.Y.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans. Knowl. Data Eng. 18(1), 63–77 (2006)
Błaszczyński, J., Deckert, M., Stefanowski, J., Wilk, S.: Integrating selective pre-processing of imbalanced data with Ivotes ensemble. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). LNAI, vol. 6086, pp. 148–157 (2010)
Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting. In: Lecture Notes in Computer Science, vol. 2838, pp. 107–119 (2003)
Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 40(1), 185–197 (2010)
Chawla, N.V.: Data mining for imbalanced datasets: an overview. In: Data Mining and Knowledge Discovery Handbook, pp. 853–867 (2005)
Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor. Newsl. 6(1), 80 (2004)
Al-Shahib, A., Breitling, R., Gilbert, D.: Feature selection and the class imbalance problem in predicting protein function from sequence. Appl. Bioinform. 4(3), 195–203 (2005)
Khoshgoftaar, T.M., Gao, K., Seliya, N.: Attribute selection and imbalanced data: problems in software defect prediction. In: Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI, vol. 1, pp. 137–144 (2010)
Wasikowski, M., Chen, X.W.: Combating the small sample class imbalance problem using feature selection. IEEE Trans. Knowl. Data Eng. 22(10), 1388–1400 (2010)
García, V., Alejo, R., Sánchez, J.S., Sotoca, J.M., Mollineda, R.A.: Combined effects of class imbalance and class overlap on instance-based classification. In: Intelligent Data Engineering and Automated Learning – IDEAL 2006, vol. 4224, pp. 371–378 (2006)
Mohammad, M.N., Sulaiman, N., Muhsin, O.A.: A novel intrusion detection system by using intelligent data mining in WEKA environment. Procedia Comput. Sci. 3, 1237–1242 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Al-Mandhari, I., Guan, L., Edirisinghe, E.A. (2020). Impact of the Structure of Data Pre-processing Pipelines on the Performance of Classifiers When Applied to Imbalanced Network Intrusion Detection System Dataset. In: Bi, Y., Bhatia, R., Kapoor, S. (eds) Intelligent Systems and Applications. IntelliSys 2019. Advances in Intelligent Systems and Computing, vol 1037. Springer, Cham. https://doi.org/10.1007/978-3-030-29516-5_45
Download citation
DOI: https://doi.org/10.1007/978-3-030-29516-5_45
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29515-8
Online ISBN: 978-3-030-29516-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)