1 Introduction

The Internet has become an important part of our society in many areas, such as science, economics, government, business, and personal daily lives. Further, an increasing amount of critical infrastructure, such as the industrial control systems and power grids, is managed and controlled via the Internet. However, it is important to note that today’s cyberspace is full of threats, such as distributed denial of service attack, information phishing, vulnerability scanning, and so on. Recent news also points out that attackers can build large organization consisting of different types of devices and bypass existing defenses by changing communication patterns [6].

Despite these difficulties, network anomaly detection is still making progress. Network anomaly detection systems identify network anomalies by monitoring and analyzing network traffic and searching for anomalous flows with malicious behaviors. Despite having high accuracy rates and low false alarm rates, these systems are easily evaded by recent collaborative-based anomalies, even anomalies with the similar packets as legitimate traffic. Moreover, existing detection modules usually consider less about the communication relationship, such as the communication pattern between inner and outer networks, so that it is difficult to describe anomalies from single flow itself in network.

Collective network anomalies are major kinds of network anomaly nowadays [12]. According to the communication volume of the traffic flows, individual steps can be roughly classified into two types, namely high-intensity attacks and low-intensity attacks. High-intensity attacks such as Smurf [9] transmit packets at a high rate causing a sudden surge in traffic volume whereas low-intensity attacks such as Slowloris [19] are high-level attacks that do not rely on intensity and volume to breakdown network service. Other low-intensity attacks do not attempt to completely disrupt the service but rather degrade the quality of service over a long period of time to achieve economic damage or reduce the performance of servers, indicates that all the packets involved in the attack are not detected. In complex attacks, an attacker floods the target system with low-intensity, highly targeted, and application-specific traffic.

Recent research has shown that the following challenges that are not tackled in previous researches. We briefly describe two of the modern challenges for network anomaly detection that we will address in this work.

Challenge 1: Low-Intensity Volume of Network Anomaly Traffic Might Seem Innocuous. Most traditional methods analyze statistical changes of traffic volume. The statistical features of the network flows contain information such as the number of flows, the number of packets of the flow, the average length of packets of the flow, the network protocol flags, and the duration of the flow. If a statistical mode mismatch in those values is detected, the event is reported as an anomaly. For example, if the statistical features of the destination port numbers suddenly increase, it is assumed that some hosts conduct port scanning. However, the anomalous activities that may not increase the number of packets are not visible in a large network. Those volume-based anomaly detection techniques, however, cannot detect low-intensity attacks.

Challenge 2: More Advanced Evasion Techniques Can Conceal Real Attacks. Even though some flow features have the ability to capture anomalous traffic that has statistical variations of traffic volume, they cannot capture correlations in communications between hosts because they focus on the statistics of traffic sent by individual hosts. For example, command and control communications, the major source of advanced cyber anomalies, consist of attackers that originate attack commands. The attackers make up a cluster where they communicate with each other. With traffic statistics-based detection methods, we can detect partial anomalies but cannot detect or capture the overall anomalies, especially for the concealed network attacks.

By exploiting the above two challenges, we adopt communication graph embedding method to represent the structural similarities of communication patterns by calculating the similarities between graphs generated by network flows. For the reason that network anomalies may not occur as one single action on a single host, there may be many small seemingly innocuous multi-hosts behaviors. Attackers might do several small probe actions that are temporally spaced or they might initiate a massive denial of service attack. Each change of anomalies has a specific pattern of a particular feature set.

In this paper, we investigate the benefit and efficiency of applying the graph kernel embedding approach to improve the deep neural networks. Our proposed approach aims to improve the accuracy of network anomaly detection process and analyze through an attributed communication graph by using communication graph features, which stores the representation of the status of the participating networks in a specific time interval. Our contribution in this paper is threefold:

  • To understand the new network threat, we propose a novel network flow analytics architecture for low-intensity network anomaly that can evade other detection approach easily;

  • To detect anomaly flows and improve the accuracy of network anomaly detection process, we develop the first work to systematically combine graph kernels with deep neural networks for network anomaly detection;

  • To evaluate the proposed method, we build a working system to hold a sliding time window view of the entire network based on the structural similarity analysis of network flows. In this way, it is optimal for the application of machine learning techniques. We thoroughly evaluate our system with two real-world datasets and demonstrate that the proposed method effectively outperforms the traditional machine learning solutions and the baseline deep neural network detection solutions in terms of accuracy.

The rest of the paper is organized as follows. Backgrounds and previous work related to this paper are discussed in Sect. 2. Section 3 described the threat model and major approaches. The proposed anomaly detection framework is discussed in Sect. 4 followed by the empirical evaluation in Sect. 5. Last, Sect. 6 concluded this paper.

2 Background and Related Work

We first lay out the background information of network evasion attacks against anomaly detection models. Then we review related prior research.

2.1 Network Evasion Attack Against Anomaly Detection Models

Since attackers have become more organized, skilled and professional, research in network anomaly detection must always take into account the presence of an evasion attack. While some detection papers are providing discussion on evasion resistance, there is rarely any practical method to increase evasion resilient capabilities of the anomaly detection models. In fact, the purpose of evasion attacks is to move an anomalous flow to a benign one, or to attempt to mimic a specific benign point. Therefore, the attacker attempts to modify the features of the anomalous flow so that the anomaly detection model marks the flow as benign.

The limitation of evasion attacks is performed in two ways. First, there could be features that are not modifiable to the extent required to perform the attack. There is also a limitation on how difference the anomalous activities can be changed, because the activities still need to follow their original, anomalous purposes. For example, the target IP addresses of network flows cannot be modified arbitrarily, so the communication patterns of anomalous activities are relatively stable. The evasion attack also needs understandings of what is considered benign in the production detection model, which is usually not available.

2.2 Deep Neural Network Based Network Anomaly Detection

Deep neural networks have been shown tremendous performance on a variety of application areas, such as image, speech and video analysis tasks, while they are high-dimensional inputs and have high computational requirements. For network anomaly detection, deep neural networks increase the detection rate of known anomalies and reduce the false positive rate of unknown anomalies.

Alom et al. [1] train deep belief network models for identifying any kind of unknown anomalies in dataset and evaluated the performance on intrusion detection datasets. Their proposed system not only detects anomalies but also classifies them and achieves 97.5% accuracy for only fifty iterations. Alrawashdeh et al. [2] explore the anomaly detection capabilities of restricted Boltzmann machine and deep belief network. They outperform the former works in both detection speed and accuracy and achieve a detection rate of 97.9% presenting machine learning approaches for predicting anomalies with reasonable understood. As for deep belief neural network [11], it achieves accuracy as high as 99% combining neural networks with semi-supervised learning in a smaller percentage of labeled samples.

In the case of Convolutional Neural Network (CNN)-based and Long Short-Term Memory (LSTM)-based anomaly detection, Zhang et al. [20] propose a specially designed CNN to detect network anomalies. They find only some preprocessing is needed whereas the tedious feature extraction is done by the CNN itself and the experimental results show that the designed CNN model has a good performance. Chawla et al. [4] employ recurrent neural network (RNN) as their model for host-based intrusion detection systems which determine legitimate behavior based on sequences of system calls. Kim et al. [7] propose a system-call language-modeling method on host-based intrusion detection systems. Their method learns the semantic meaning and interactions of each system call which has never been considered before and the validity and effectiveness are shown on public benchmark datasets.

2.3 Graph Based Anomaly Detection

Several graph-based and machine learning techniques have been investigated over the last two decades for detecting network anomaly detection. Our previous work [18] proposed an effective method for deep graph feature extraction to distinguish network anomaly in network flows.

The graph kernel approach is a rapid feature extraction scheme and its runtime scales only linearly in the number of edges of the graphs and the length of the subgraph sequence, by converting a high-dimensional graph into a feature vector space. The scalar product of the two graphs in space measures their similarity. Generally, graph kernels are utilized to define the similarity function of two graphs. One effective way is the Weisfeiler-Lehman (WL) kernel [16]. However, WL kernel only support discrete features and costs linear memory of training examples. Other methods include depth graph kernel [17] and graph invariant kernel [13] comparing based on the existence or counting of the small substructures, such as the shortest path [3], graphlets, subtrees, and other graph invariants [13]. We adopt general shortest path kernel as one possible labeling procedure to compute receptive fields.

3 Threat Model and Our Approach

To better understand the efficiency of the method we proposed in this paper, we describe the threat model and the defense model. The threat that we study as a use case is a kind of the denial of service attack that exhausts the resources using low-intensity traffic flows described in [8], seen in Fig. 1. We consider an outside attacker that can send arbitrary network flows to a server or link that is vulnerable to network anomalies. The attacker can exploit the vulnerability by sending carefully crafted flows that will consume a significant amount of the resources. The attackers’ goal is to occupy available resources by sending multiple flows in parallel at a low intensity.

Fig. 1.
figure 1

A use case for threat model: a low-intensity version of the network anomalies.

In low-intensity attacks, the attacker can use controller servers to attack the targets. To achieve this goal, the attack assigns botnets controlled by his controller servers to send legitimate, low-intensity network flows towards certain target servers. Since anomalous flows are well-formed, they cannot be easily distinguished from legitimate flows through statistical features, such as the average packet size. The attacker can also send legitimate flows to hide attack trace among legitimate network flows. The attacker does not send numerous attack flows within a very short time interval, because volumetric attacks with a high intensity can be easily detected by network-based anomaly detection, and attacks with low-intensity are already sufficient to affect the network.

3.1 Problem Definition

Given a set of network flows, we can detect several kinds of anomalous network flows, but some kinds of anomalies can be evaded by modifying several statistical flow features. In fact, similar communication patterns represents on communication graphs, for example, when the network flows come from the similar sources and go towards the similar destinations.

Communication graphs are used as abstraction of network communication patterns. Hosts are represented by vertices, while communication between them is indicated by edges, weighted by the statistical features of exchanged data. Given a directed multigraph \(G<V,E>\), the purpose of the anomaly detection is to relate a set of vertices to a set of edges. An anomaly at time interval \([t,t+1]\) attacks a set of host \(E_{A(t)} \in E\), affecting the connectivity of a set of vertices \(V_{A(t)} \in V\). As described in [8], the goal of the anomaly detection is to find the malicious host set and detect the malicious communication flows until \(t+1\).

Assume that p(e) expresses the probability of e acting as a malicious host at time interval t, p(v) is the probability of v being an target at time interval \([t,t+1]\), and p(ev) is the probability of e attacking v at this time interval. The goal is to seek to minimize the total possibility of p(e) over all time intervals. In the neural networks model, system needs to extract the information about p(v) and seek to minimize p(v). However, attackers are able to do little changes to evade the single time interval detection without much effort. While it is difficult to evade from the whole communication graph pattern which graph kernels have the ability to describe, that is, graph kernels can capture the relationship between p(v) and p(e), that is p(ev).

For estimating the similarity between communication patterns on a pair of communication graphs, we adopt graph kernel functions. It relies on graph comparison. It is clear that there is no link between these network flows because they happened at different time. We construct the communication graph for each network flow using a specific time interval.

3.2 Communication Graph Embedding

Most existing machine learning or neural network algorithms are generally applicable to vector-based data. In order to enhance interpretability of deep neural networks and learn accurate graph feature embedding, it is necessary to extract the features of graph similarity data in advance. Algorithm 1 presents the details of communication graph feature representation.

First, we extract the statistical flow features. These features are extracted from each network flow, which directs source IP address to destination IP address and contains statistical features that are commonly used for anomaly detection.

Then we construct the communication graph for each network flow by a specific time interval. Given a set of graphs, many algorithms have been designed and widely used for similarity analysis on graphs. While determining all paths is NP-hard, finding special subsets of paths is not necessarily. Determining longest paths in a graph is also NP-hard. To overcome this problem, the shortest path similarity measures [3] are selected in this paper, which capture diverse aspects of the network communication pattern and can be efficiently computed.

According to the sorted hosts and the length of shortest paths, the shortest path compares the similarity of two communication graphs. Using the network flows of each pair of hosts noted as i and j constructs the similarity metrics. The shortest path d(ij) represents the shortest distance from host i to j. With low resolution the vertices are more likely to interact which is what our model wants to detect recorded as p(hv) (seen in Sect. 3.1).

At the same time, the shortest path also describes the communication situation of pair of hosts as toward or away from each other in the network flow graph. As discussed in [5], given the small world property of networks, we can limit the distance searched before stopping and supposing no other paths.

figure a

4 Graph Kernels Combine Deep Neural Networks Framework

We propose AnoNG framework by combining deep neural networks and graph kernels from two different angles. We adopt the LSTM and CNN as our baseline deep neural network models because they are effective for network anomaly detection. There are two ways to extend the kernel vectors for DNNs. We can append the kernel vectors either to input vectors, or to the softmax classifier. To determine which strategy works best, we perform a pilot study. We found that the first method causes the dimension of features turn to be too large, so we use dimensional reduction method before inputting the DNN. While for the second method, the softmax classifier tends to heavily rely on the kernel vectors, and the result is similar to the one only given kernel features, so we design a weight variable to calculate the final results. The architecture of AnoNG framework can be seen in Fig. 2.

Fig. 2.
figure 2

Overview of the network anomaly detection framework using deep neural networks and graph kernels (AnoNG).

4.1 Integrating Graph Kernels at Input Level

At the input level, we use the evaluation outcomes of graph kernels as features fed to DNN models as Fig. 3(a). The number of graph features for network anomaly detection can be different in detailed situation. We extract network flow features according to effectiveness, and graph kernels’ number can change depending on computational efficiency.

Because of the conflicting situations illustrated above, we cannot directly add or multiply the kernel vectors to flow features. To solve this issue, we extend the flow features with kernel features to form aggregated features as the input of DNNs.

4.2 Integrating Graph Kernels at Output Level

At the output level, graph kernels are used to improve the output of DNNs in Fig. 3(b). Since the graph kernels used in the input level only extend the statistical network flow features, we can also dig further expression by DNNs. So we combine the softmax classifier results of DNNs as the final results.

For instance, the softmax classifier of graph kernels can output the probability distribution \(K_p=(K_{p1}, K_{p2})\) of two classes. DNN outputs the probability distribution \(N_p=(N_{p1}, N_{p2})\) for two classes. We define the final probability of the label value for a network flow is:

$$\begin{aligned} p=a_1K_p+a_2N_p \end{aligned}$$
(1)

where \(a_1+a_2=1\) and \(K_p\) is the probability distribution produced by the graph kernels and \(N_p\) is the probability distribution produced by the baseline deep neural network model. And \(a_1\) and \(a_2\) are trainable weights indicating the overall confidence for anomalous or benign labels. Here we assign 0.1 and 0.9 for original trainable weights and use 0.1 as interval. We modify 0.1 as interval when training \(a_1\) and \(a_2\) instead of more precise interval because using 0.1 is precise enough for the difference of two classes. Actually, using more precise interval such as 0.01 has the same effect as interval 0.1.

Fig. 3.
figure 3

Combination of graph kernels and deep neural networks.

5 Evaluations

In this section, we evaluate the proposed AnoNG framework. To evaluate the anomaly detection performance of our proposed AnoNG detection module, we adopt the UNSW-NB15 [10] dataset and the CIC-IDS-2017 [15] dataset, respectively. Both the datasets labeled the network anomaly flows, which means we have ground truth of the traffic. Then, we divide each dataset into training and test data sets using a ratio of 70% and 30%.

5.1 Dataset Description

UNSW-NB15 Dataset: This dataset is created by establishing a synthetic environment at the UNSW cybersecurity lab. The data presents a mixture of the real modern normal and the contemporary synthesized attack activities of the network traffic. It had 47 features including new low-intensity attack scenarios and 9 significant series of the cyberattack. We select only 8 statistical network flow features from this dataset, namely dur, sbytes, dbytes, Sload, Dload, Spkts, Dpkts, Sintpkt, Dintpkt.

CIC-IDS-2017 Dataset: This dataset is a newer dataset that contains a more recent form of anomalies such as high-intensity high level layer attacks generated using botnets and low-intensity attacks generated using Slowloris tool. We also select 8 statistical network flow features from this dataset, namely Flow Duration, Total Fwd Packets, Total Backward Packets, Total Length of Fwd Packets, Total Length of Bwd Packets, Flow Bytes/s, Flow Packets/s, Average Packet Size.

5.2 Experimental Environment

We implemented a proof-of-concept version of our AnoNG framework (see Sects. 3 and 4). Python3, Scikit-learn and Keras with TensorFlow backend are used as the software frameworks, which are run on the Ubuntu 16.04 64-bit OS. The server is a DELL R720 with 16 CPU cores, 128 GB of memory and NVIDIA Quadro GPU. To training the deep learning methods, we set the batch size to 128, the learning rate is set to 0.001, the epochs of training as 10, and the dropout rate is set to 0.5.

For evaluation, we report precision, recall, and F1-score, which depend on the four terms true positive (TP), true negative (TN), false positive (FP) and false negative (FN) for the anomaly detection results achieved by all methods.

We conduct three experiments: baseline DNNs, integrating graph kernels at input level of DNNs, and integrating graph kernels at output level of DNNs on two real-world datasets. We select t as 60 s as the time window to construct communication graphs as it was shown as a suitable classification point. All experiments were repeated 10 times and we report the means of evaluation matrices for all algorithms. The best results are bolded. For the compared methods, both LSTM and CNN are considered as baselines. To provide a better overview of the performance of the baseline approach on the statistical network flow features, the overall precision, recall and F1-score are presented in Table 1.

We set dimensions of the CNN layers to input-128-256-128-output for both datasets. Dimensions of the LSTM layers are set to input-128-256-128-output, where the output is the statistical feature dimension, which varies between datasets. All layers are densely (fully) connected. To training the deep learning methods, we set the batch size to 128, and set the learning rate to 0.001. The epochs of training is set to 10, and the dropout rate is set to 0.5. The ADAM optimizer is adopted for variant of gradient descent.

Both the datasets are divided into two sets: training set and test set, which are the mixture of both anomalies and benign ones. Then, a deep neural network model is trained to predict on the training set and the prediction errors on the training set are fit to a multivariate Gaussian using maximum likelihood estimation. The loss function that we adopt is the sigmoid cross entropy. The threshold for discriminating between anomalous and benign values is then determined via by maximizing the accuracy value with respect to the threshold. At last, the trained model is then used to predict on the test set and the results are recorded and compared.

5.3 Performance on DNN with Graph Kernels at Input Level (GKIL)

Table 1 details the results of the experiments. According to our experimental results, we believe that the AnoNG framework is successful because it consistently generates highly accurate results in both real-world datasets used to identify anomaly and benign flows. From the results, we can see that AnoNG significantly outperforms baseline methods. We also compare AnoNG’s accuracy on the network anomaly detection task against the random forests classifier proposed in [14, 15].

It is noted that the communication graph features improve the recall increasing from 0.90105 to 0.99757 and the F1-score increasing from 0.93292 to 0.96990 of CNN model in UNSW-NB15 dataset, as well as the precision increasing from 0.95420 to 0.98847 and the F1-score increasing from 0.95719 to 0.98776 of LSTM model in CIC-IDS-2017 dataset.

Fig. 4.
figure 4

Precision, recall and F1-score with respect to parameter \(a_1\)

5.4 Performance on DNN with Graph Kernels at Output Level (GKOL)

To understand the necessity of graph kernels, we further measured the performance of the GKOL algorithm. The results for the GKOL algorithm of our validation are equally promising.

After training and obtaining representations of statistical network flow features and graph kernels, for each DNN model, we select trainable weights of \(a_1\) from 0.1 to 0.9. The results of precision, recall and F1-score are shown in Fig. 4, respectively, from which we can observe that GKOL is consistently above baseline methods, and achieves statistically significant improvements on all metrics. Take \(a_1 = 0.6\) as an example, LSTM with GKOL outperforms best performance in CIC-IDS-2017 dataset. Therefore, we can draw the conclusion that GKOL maintains a more decent performance in anomaly detection tasks compared with other methods.

Table 1. Precision, recall and F1-score over the datasets of different methods. The best performance is shown in BOLD.

5.5 Discussions

AnoNG is designed based on learning the features of network communication graphs. For one thing, the evaluation for this approach verifies its capability to effectively detect existing low-intensity attacks in large-scale networks as its profile involves communication patterns which can be extracted via constructing the communication graphs. For another, graph kernel embedding aims at find small variations on communication patterns between benign and anomalous flows. If the variances between these features are not high, the performance of this technique will decrease, the feature evaluation methods can be further adopted to select the most highly varied features by testing these features and their principal components.

6 Conclusion

In recent years, graph-based or DNN-based network anomaly detection has attracted more and more research attention. In this paper, we propose a novel framework called AnoNG to solve new network anomaly detection tasks by integrating graph kernels with deep neural networks, then apply it to the network anomaly detection problem and investigate different methods of combining deep neural networks with graph kernels. Experiments show that AnoNG outperforms the existing deep neural network models on two real-world network anomaly detection datasets.