1 Introduction
Nowadays cybersecurity is a major concern in all sectors of the contemporary digital landscape, and cybersecurity has emerged as a paramount concern across all sectors [
8,
10]. Proper and initial detection of any malicious activities is very important for ensuring the security of confidential data and the integrity of the system. Typically, a cyber-attack starts with breaching the restricted part of a network system or a server bypassing its security mechanisms. Such breaches compromise the Confidentiality, Integrity, and Availability (CIA) of a system, collectively known as intrusions [
5].
To prevent intrusion from the network system, various kinds of Intrusion Detection Systems (IDS) were developed as shown in
Table 1. IDS is a combination of hardware and software components to detect suspicious attempts on the network [
32]. Broadly, intrusion detection can be categorized into three main types[
21],
•
Signature-based Detection (SD).
•
Anomaly-based Detection (AD).
•
Stateful Protocol Analysis (SPA).
The signature-based approach, often referred to as the anti-virus method, works by matching incoming data to pre-existing patterns of known attacks. While effective at identifying previously recognized threats, it falls short in predicting novel or unknown attacks. Conversely, stateful protocol analysis seeks to identify unexpected sequences of commands, but its application is resource intensive. This is where anomaly-based detection gains prominence. By analysing behavioural patterns, it utilizes statistical methods and connection data to flag suspicious network activities. Although it may occasionally generate false positives, this approach excels in detecting previously unseen threats.
For this anomaly-based detection system, a model needs to be built depending on various data. Numerous datasets were made depending on previous attacks such as AATCT-IDS [
23], LSPR23 [
7], CSE-CIC-IDS-2018 [
29], KDD CUP 99 [
3], NSL-KDD [
42], and so on. Among these, the NSL-KDD dataset stands out for its detailed composition, comprising 41 attributes such as connection time, network protocol, login status, the number of failed login attempts, root shell usage, and more. To make IDS from this kind of dataset, various statistical approaches have been taken [
14,
31,
40]. Also, some approaches use traditional Machine learning (ML). However traditional ML models cannot utilize the power of big datasets where a huge number of numerical and categorical variables get involved [
9,
16,
26,
30].
To solve this issue, various Deep Learning (DL) approaches were taken previously [
2,
12,
15,
22,
25,
38]. These models can detect intrusion with higher efficiency and higher accuracy. DL has a more robust training technique than traditional ML algorithms. DL can nearly learn anything, but it needs a bunch amount of data compared to the traditional ML models. However, the transfer learning method from pre-trained DL models on similar kinds of data can be used with a comparatively small amount of data [
24]. The NSL-KDD dataset contains more than 125,000 rows [
42] which is enough for transfer learning. So, a DL approach can be taken for this dataset.
As one of the most descriptive datasets, NSL-KDD contains 41 attributes such as connection time, network protocol, login status, no of failed login attempts, root shell used or not, and so on. It classifies attacks into 39 classes. [
4]. Several ML and DL models were built previously to detect intrusion from the NSL-KDD dataset [
28,
36,
41]. A study by S. Alrayes et al. used a Coevolutionary Neural Network (CNN) and got 99.728% accuracy [
41]. That model merged 36 attack classes of the dataset into 4 broad classes. Hence, that model was only able to categorize intrusion only in those 4 categories. However, in the current study, a model has been built that will be able to detect and categorize intrusion into 12 smaller classes for a more precious decision.
Initially, each row of the NSL-KDD dataset was transformed into an imagery that represents the row. Then the Coevolutionary Neural Network (CNN) was used to train the model here.
4 Results and Discussion
In this part, the results of the analysis of various models are shown. It includes the performance of various architectures and thus the hyperparameter tuning results of the best-performing architecture.
4.1 Performance of various architectures
In this study, the machine learning architectures deployed are ResNet18, ResNet50, ResNet101, ResNet152, and BEiTv2_base_patch16_224, all leveraging pre-trained architectures. The ResNet 18 performed best among all the models. The data of these training models are shown in
Table 4. The details of those models are given below.
4.1.1 ResNet18.
ResNet18 displayed superior performance in comparison with the other models, during the five epochs, the model continuously decreased both training and validation losses; accordingly, the training loss went from 0.1967 to 0.0466, and the validation loss came down to 0.0527 (
Figure 3). The corresponding accuracy increased and reached the maximum value of 98.75% after five epochs. Also, its average training time was lower (3 sec) due to less complexity in the model architecture. ResNet (Residual Network) is a deep convolutional neural network architecture designed to address the vanishing gradient problem, which can hinder training in deep networks [
11]. ResNet18, in particular, has 18 layers and is built upon residual blocks. Each block includes shortcut connections that bypass one or more layers, enabling efficient gradient flow and feature extraction [
27]. This architecture balances depth and simplicity, making it effective for datasets like NSL-KDD. ResNet18’s relatively shallow structure compared to deeper architectures like ResNet50 and ResNet101 helped it achieve superior performance on the transformed color-mapped images, with reduced risk of overfitting and lower computational requirements. This balance allowed it to generalize effectively across various intrusion classes while maintaining efficient training times.
4.1.2 ResNet50.
ResNet50 architecture also showed quite good accuracy but showed more variation in its validation loss compared to the ResNet18. Training losses were recorded over a period of five epochs. The model achieved a peak accuracy of 98.13%, but its training and test validation were higher compared to ResNet18 (
Table 4). Its training time was also higher compared to the ResNet18 model. This is due to the higher complexity of its model architecture [
17].
4.1.3 BEiTv2 base patch16 224.
The accuracy score for BEiTv2_base_patch16_224 was notably poor on this dataset. Though its performance improved over epochs, even after the ninth epoch, its accuracy was 74.61%. Which is very low compared to other ResNet models. Despite ongoing training, the validation losses remained large, 1.0111 for the train set and 0.9447 for the validation set. The train loss and validation loss over epochs are shown in Figure
4. Its training time was notably higher, 25.8 seconds than all other models as shown in
Table 4. So, it is no good choice for the current system.
4.1.4 ResNet101.
This model had higher accuracy than BEiTv2_base_patch16_224, but its accuracy remained lower than that of ResNet18 and ResNet50. Its accuracy was 97.20% while ResNet18 had 98.91% as shown in
Table 4. Also, its training time was higher than ResNet18 and ResNet50. This is because ResNet101 has 101 layers and ResNet18 and ResNet50 have 18 and 50 layers correspondingly as shown in
Table 3. This figures out the higher complexity of the ResNet101 model, resulting in a higher training time. The training and validation losses are shown in
Figure 4.
4.1.5 ResNet152
The ResNet152 is a more complex model than previously discussed ResNet models. Hence, its training time is higher than all of the models discussed above. However, its accuracy was 95.64% which is the worst of all the ResNet models evaluated here. Also, its test and validation losses were higher than ResNet18. So, after evaluation of all 5 models, ResNet18 is selected as the best-performing model. The training curve of ResNet152 is shown in Figure
5.
4.2 Fine-tuning of ResNet18
The best-performing model of
Table 4, the ResNet18 was further evaluated to find the best learning rate. The loss vs. learning rate plot for the test set and validation set are shown in
Figure 6. For a learning rate of 2.2 × 10
-6, the training and validation loss was lowest. Here, it decreases with the course of the learning rate, showing that the model is learning well. But after a threshold value is crossed-roughly 10
-6, the loss starts to increase, reflecting unsatisfactory learning behavior. This is indicative of the model sensitivity by the chosen learning rate and emphasizes important selection so that convergence is ensured.
After selecting the learning rate the previously trained ResNet18 model was further trained by more than 50 epochs as shown in Figure
5. However, after 4 epochs, the training loss decreased gradually but the validation loss increased. It indicates overfitting. So, weights of the model parameters of the 4
th epochs are the weights contributing to the best performance of the model.
4.3 Confusion matrix
For a thorough evaluation, a confusion matrix was generated for the ResNet18 model. As shown in Figure
7 below, the confusion matrix investigates model classification performance for various classes.
The diagonal elements refer to successful predictions, whereas non-diagonal elements reflect misclassifications. It is seen that the model predicts all the classes except ‘nmap’ and ‘normal’ with 100% accuracy. Some misclassifications can be spotted, mainly for the class 'nmap', which was wrongly classified as 'normal' five times, and ‘normal’ was wrongly classified 1 time.
These misclassifications can primarily be attributed to the overlapping characteristics between certain network traffic features in these classes. For instance, ‘nmap’ is often used in network mapping and reconnaissance, which may exhibit behavioral similarities to benign traffic when observed superficially. This resemblance might cause the model to misclassify ‘nmap’ activity as ‘normal,’ particularly when the nuances distinguishing it from legitimate behavior are subtle. There may be added contribution from the color-mapping approach used in order to transform the NSL-KDD dataset into an image dataset. While CNNs detect spatial patterns in image data with great efficiency, some network traffic patterns may not be differentiable enough by their color-mapped form. Subtle similarities of RGB-encoded features across these classes may bring confusion and not allow the model to correctly tell them apart.
4.4 Precision, Recall and f1 score
The precision, recall and f1 score for the model is given in
Table 5. It shows that 6 classes have a f1 score of 1. While the f1 score of nmap is very low (0.2857).
The precision of normal class is 1 meaning FP for normal class is zero. In other words, no attacks were classified as normal connection in this model. Which is very necessary for an IDS.
4.5 Comparison with some other models
A comparison with other models trained on NSL-KDD dataset is in the
Table 6 below. The accuracy of our model (98.91%) was better than these models. Only one model had better f1 score than the model of this paper.
5 Conclusion
The research justifies the effectiveness of employing pre-trained weights from CNN for the detection of intrusions, utilizing the NSL-KDD dataset as a foundational benchmark. By converting the dataset rows into image representations and implementing a variety of deep learning architectures—including ResNet18, ResNet50, and BEiTv2_base_patch16_224—this study has identified ResNet18 as the most effective model, achieving a good accuracy rate of 98.91%. From the confusion matrix, it was clear that the model can detect all intrusion as intrusion with no false negative for the validation set. This demonstrates the practical applicability of this model in IDSs. The thorough analysis performed over diverse architectures complemented by fine-tuning illustrates the ease with which transfer learning can be adapted to small-sized data as well, without compromising its accuracy or efficiency.
The results throw light on the substantial benefits of deep learning model integration, especially in those scenarios with high order feature dimensional space and a relatively low dataset size. This fact is supported by the excellent performance of ResNet18 which has a simple architecture, lower losses in training and validation, and requires less time in training. In the validation dataset, one normal connection was detected as an intrusion for only one time with a false positive rate (FPR) of 3.05%. However, the 0% false negative rate (FNR) of this model demonstrates the potential of this model for Anomaly detection in IDSs.
However, there are still some shortcomings in this model for classifying certain types of attacks, such as 'nmap'. Hence, there is room for improvement. Future work can be done on more sophisticated and hybrid datasets. This can improve the attack classification performance of this model even for very rare and modern attack types.