Keywords

1 Introduction

The study of previous research on network security analysis (NTA) at network level [13] discloses three main claimed goals: (a) attack detection, (b) anomaly detection, and (c) traffic classification. These three topics are not the same, but undoubtedly overlap. For instance, traffic classifications often include classes that are attacks. An anomaly might be an attack, but an attack does not necessarily show itself to be an anomaly. The traffic features selected for the analysis obviously play a determining role to see if a network attack expresses itself as an anomaly or not, but also the analysis perspective is relevant [33]. For example, Distributed Denial of Service (DDoS) attacks usually appear as anomalous peaks in network monitors that observe traffic as time series [12]; however, they are hardly anomalies from a spatial perspective, in which they can take a significant portion of the total captured traffic—note that DDoS attacks try to harm targets by bombarding them with false connection requests. Actually, DDoS and other types of illegitimate traffic (e.g., scanning activities) have become so common that they can rarely be considered anomalies in most networks any more [11, 22].

When the term “outlier” comes into play, things become even more confusing. “Anomaly” and “outlier” are not smooth synonyms, and even the description of outlier can be ambiguous in practical implementations [39]. For instance, it is common to find small groups of close traffic instances that are distant from the data bulk. Together, they form an outlying cluster; individually, instances can be deemed as outliers or not. Even in spite of such ambiguities, in related research the meaning of anomaly is commonly assumed without discussion. Carefully reviewing such works (and excluding time series analysis), the empirical meaning of anomaly inferred from experiments habitually corresponds to network attacks that show outlierness. Some authors identify attacks as anomalies and perform their detection with outlier-based techniques [6, 18, 38]. Also many outlier-based detection proposals appear in other field surveys [5, 8, 25].

But, do network attacks actually show themselves as outliers or outlying clusters? This is the crux that will make unsupervised methods effective for attack detection or not. Related works take it for granted, but the question must be analytically answered, not in vain most attacks are designed to pass unperceived. As a starting point, we highly recommend that research works on anomaly detection in NTA clearly establish their definition of anomaly. Otherwise, whenever theoretical proposals are implemented into real scenarios—far from lab conditions—such methods are prone to trigger detection alarms in view of many harmless, meaningless, noisy instances. This discussion is critical because precisely unacceptable high false positives rates is what slows down the adoption of machine learning in real-world network attack detectors [15, 17]. If this is true for supervised machine learning, it is even more severe for unsupervised methods, which are also commonly evaluated with the same Intrusion Detection System (IDS) datasets (e.g., [6, 18, 38]). Note that IDS datasets are usually not designed to match realistic ratios between normal and attack traffic, but to offer a variety of attack classes with sufficient representation in the dataset [16]. This is not ideal for unsupervised methods because they work by learning from the sample placement and space geometries drawn by the analyzed data. From here, and without considering irrelevant, easy-to-detect, illegitimate traffic that has become common, it naturally follows that the real-world ratio attack/non-attack is going to be considerably lower than in IDS datasets. Therefore, the probabilities for a detected anomaly to be an actual attack drop dramatically. How the base-rate fallacy affects IDS was already advised by Axelsson in [4].

The previous observations do not imply that unsupervised methods are not valid for attack detection; instead, they introduce the necessity for evaluating the outlierness of network attacks and to investigate if unsupervised methods suffice by themselves for the actual detection in real implementations. Note that signature-based detection or supervised approaches are limited in detecting novel threats and zero-day attacks; therefore, the contribution of unsupervised approaches is deemed highly valuable. A last challenge that unsupervised methods must additionally face is their traditional high computational complexity. Most popular outlier detection algorithms are naturally instance-based and show considerable time and memory overloads [9, 30]. Network traffic analysis for attack detection must be fast and lightweight, since it must deal with ever-growing volumes of traffic (big data, streaming data) and is expected to promptly react when malicious instances are discovered.

The main contribution of this paper is answering the following questions:

  • Are network attacks outliers? We study five popular and recent space representations used in NTA security applications and experiment with five popular and recent unsupervised outlier detection algorithms in order to elucidate if network traffic attacks show a distinguishable outlier nature.

  • What are the most suitable feature representations for attack detection? We investigate which existing feature vectors perform best in conjunction with outlier detection.

  • Is the observed outlierness sufficient as indicator for implementing real-world attack detection? We discuss if the detected outlierness suffices for implementing effective detectors in real environments. Additionally, we propose a new vector that maximizes attack/non-attack separation.

Unlike most papers that apply outlier detection in NTA, we do not use the KDD-Cup98’, KDD-Cup99’ or NSL-KDD datasets, which have not been representative any more for a long time. Moreover, such datasets use a set of ad-hoc features whose extraction is obscure, costly and unfeasible for modern lightweight detectors. Instead, our experiments are conducted on the CICIDS2017 dataset [34], which is one of the most complete, reliable IDS evaluation datasets to date. As for the selected features, we study five vector spaces created by the CAIA [36], CISCO-Joy [3], Consensus [14], TA [22] and AGM [21] formats. Outlierness ranks are obtained by using five different algorithms: k-nearest neighbours [32], LOF [7], HBOS [19], isolation Forest [27], and SDO [23]. Scripts and experiments are openly available for replication and reuse in [10].

2 Problem Spaces in NTA

When considering traffic at network level, the possibilities for extracting features are immense. Irrespective of the specific features, NTA is mostly faced by constructing homogeneous vectors from different perspectives, therefore leading to problem spaces where instances correspond to:

  • Packets, meaning the contents of every datagram exchanged between two network devices. This type of analysis allows dpi (deep packet inspection), which has become obsolete due to high data rates, privacy concerns and encryption. Hence, packet based analysis is computationally too demanding and unable to explore modern network traffic with reasonable costs.

  • Flows. The definition of a traffic flow given by IPFIX [1] is extremely flexible. A flow is defined “as a set of packets or frames passing an Observation Point in the network during a certain time interval. All packets belonging to a particular Flow have a set of common properties”, which can vary depending on the use case. We underline two special cases: (a) Application-based Flows. For the last three decades flows have been principally defined with the 5-tuple key: [IP source, IP destination, source Port, destination Port, Protocol], which states the communication for a specific application between endpoints, e.g., a TCP connection. However, the use of the 5-tuple is not justified in terms of security, it is simply a reminiscence from network policies implemented in the 1990’s that have become a de facto standard. A myriad of works in NTA for security assume the 5-tuple key (e.g., [24, 26, 35,36,37]). (b) Endpoint-based Flows. Yet rare in literature, a flow key can also be a 1-tuple, i.e., using the endpoint address as flow key (either IP source or IP destination). In such cases, a flow summarizes the behavior of a single device (in its role of source or destination) for a defined observation window.

  • Aggregated flows. In this scenario, a set of features are timely aggregated to reveal the current status of the network as a whole (e.g., number of sent packets per second). Such approach usually analyzes time series and is effective to quickly detect attacks and events that have a strong impact in the network. However, this top perspective is useless to capture subtle, more selective attacks or threats that only aim at one or few destinations, or show a slow propagation. Also, tracking back attack sources is a challenging task.

Table 1. Studied NTA representations (feature vectors).

1: Removed from the analysis.    2: The Cisco-Joy tool can extract more features. We removed features that did not contain usable information in the CICIDS2017 dataset.

We discard the analysis of packets and aggregated flows due to the reasons given above and focus on flows to capture application or device behavior (5-tuple and 1-tuple flow key respectively). We select a set of feature vectors that are popular in the NTA literature or have been recently proposed. They are:

  • CAIA vector. As coined in [28], we use CAIA to refer to the feature vector originally proposed by Williams et al. [36]. The same vector has been commonly applied (as defined or with minimal variations) in the context of NTA, specifically when using machine learning-based solutions, e.g., [26, 35, 37]. The original CAIA vector stores bidirectional information and consists of 22 features. We extended it to 30 features as in [26].

  • Consensus vector. In [14] a set of features for NTA are selected based on a meta-study including 71 of the most relevant, cited papers in NTA. This work concludes with 12 relevant features. We extend them based on the considerations discussed in [14] and [28], obtaining a final 20-feature vector.

  • Cisco-Joy vector. Anderson et al. recently proposed this feature vector, which is able to discriminate attacks in supervised learning and is suitable for encrypted traffic [2, 3]. It contains 650 features and can be easily extracted by using the Cisco/Joy open tool, https://github.com/cisco/joy.

  • Time-Activity vector (TA). The Time-Activity vector [22] uses a 3/5-tuple key and is unidirectional. It was devised to profile flows from a time-behavioral perspective, allowing lightweight characterization of traffic by means of clustering methods. The final vector is formed by 13 features.

  • AGM vector. Designed for the discovery of patterns in the Internet Background Radiation [21], this vector allows profiling traffic sources or destinations. The basic AGM vector contains 22 features, which are extended after removing nominal features or transforming them into dummy variables if distributions are concentrated on few values (e.g., more than 90% of traffic uses TCP, UDP or ICMP). The extended AGM vector is purely numerical.

The CAIA, Consensus, TA, and AGM vectors are compared in [28] for supervised attack detection with the UNSW-NB15 dataset [29]. Table 1 shows vector features in the format used here. We apply the nomenclature described in [28]. We refer the interested reader to the original papers for detailed descriptions.

Fig. 1.
figure 1

A quick overview of how the studied algorithms estimate the outlierness (\(o_a\)) of a random point a.

3 Outlier Detection Algorithms

In this section we briefly introduce the used outlier detection algorithms (a visual overview of the different approaches is shown in Fig. 1):

kNN. The k-nearest neighbor distance (kNN) has been used for measuring object isolation in [32], where each instance outlierness is ranked based on the distance to its \(k^{th}\) nearest neighbor. kNN is an instance-based method where estimations are locally approximated. It does not require training and the computational effort appears every time that a new instance must be evaluated and compared with the previous ones. kNN requires setting a k parameter.

LOF. The Local Outlier Factor algorithm (LOF) entailed a considerable enhancement in the task of measuring instance outlierness within data [7], generating a varied family of derived algorithms [33]. LOF compares the density estimate (\(D_i\)) based on the k-nearest neighbors with the density estimates for each of the k-nearest neighbors, thus adapting to different local densities. LOF is also an instance-based method and does not require training. In a recent comparison, LOF has shown to be a good benchmark solution, which, in general, has not been significantly outperformed by more recent methods in terms of accuracy [9]. LOF uses the \( MinPts \) parameter, which is equivalent to k in kNN.

HBOS. Histogram-Based Outlier Detection (HBOS) [19] is a simple, straightforward algorithm based on evaluating the feature empirical distributions of the analyzed dataset (i.e., histograms for continuous features and frequency tables for nominal features). Since it assumes feature independence, it sacrifices precision to achieve fast performances in linear times. Outlierness is calculated based on the relative position of the instance feature values with regard to the obtained empirical distributions (hist). HBOS does not require parameterization, but for the histogram binning, which allows static bin-widths (k equal width bins) or dynamic bin-widths (in every bin falls N/k instances, being N the total number of instances). In our experiments, bins-widths are “static”.

iForest. Isolation Forest (iForest) [27] is a model-based outlier ranking method that shows linear time complexity with low memory requirements even in front of large datasets. The operation principle is as follows: for a given instance, features and splits are randomly selected in a procedure that progressively reduces the range of feature values until the instance is isolated (i.e., the only instance in the remaining subspace). The number of splits defines the outlierness value of the instance, since outliers are expected to be easier to isolate (less splits) than inliers (more splits). The partitioning procedure can be abstracted as a tree (an iTree), therefore an iForest provides the weighted evaluations of a set of iTrees. During training, iTrees are built using the training dataset; in application phases, instances pass through iTrees to obtain outlierness scores. iForest parameters are: t, the number of estimators or iTrees; \(\psi \), the sample size to train every iTree; and f, number of features passed to each iTree.

SDO. The Sparse Data Observers (SDO) algorithm is a model-based unsupervised outlier ranking method that has been designed to provide fast evaluations and be embedded in autonomous frameworks [23]. SDO is conceived to avoid the common bottleneck problems implied by traditional instance-based outlier detection when a continuous evaluation of incoming instances is demanded. SDO creates a low density data model by sampling a training population. During training, model instances—called observers (O)—are evaluated in a way that low-active observers are removed. Thus, the low density model becomes free of potential outliers. In application phases, observers provide instance outlierness based on joined distance estimations. SDO is light, easy to tune, and makes the most of pre-knowledge. SDO parameters are intuitive and stable, rule of thumb parameterization works well in most applications. Parameters are: k, the number of observers; x, the number of closest observers that evaluate every instance; and q (or qv), which establishes the threshold for the removal of low-active observers.

4 Dataset

The CICIDS2017 dataset [34] was recently published by the Canadian Institute for Cybersecurity (CIC). The CIC has developed some of the most widely-used IDS and IPS (Intrusion Prevention Systems) datasets in research for the last decades, including NSL-KDD and ISCX series. The CICIDS2017 dataset accomplishes the quality criteria that IDS/IPS datasets must meet in order to provide suitable evaluation tests. These criteria [16] control that data is complete, realistic, representative, diverse, and heterogeneous in terms of protocols, attacks, and legitimate uses, as well as in formats and supporting metadata. In CICIDS2017, attack families are implemented according to the most common security threats reported by McAfee in 2016, including: Web based, Brute force, DoS, DDoS, Infiltration, Heart-bleed, Bot, and Scan. The CICIDS2017 dataset is available at: https://www.unb.ca/cic/datasets/ids-2017.html.

Table 2. Used parameters in the experiments.

5 Experiments

This section describes the conducted experiments. Henceforth, we refer to the feature formats as the subset F and the used algorithms as the subset A:

$$\begin{aligned} F = \{\text {CAIA, Consensus, TA, Cisco-Joy, AGM}\} \end{aligned}$$
(1)
$$\begin{aligned} A = \{\text {kNN, LOF, HBOS, iForest, SDO}\} \end{aligned}$$
(2)

We describe the experiments step-by-step:

  • 1. Flow extraction

    From CICIDS2017 pcaps, we extracted features to match the studied representations. Therefore, for each vector format we obtained a structured dataset, \(D_i = M_i \times (N_i+2)\), where \(i \in F\), \(M_i\) is the respective number of instances (flows), and \(N_i+2\) is the respective number of features plus a binary label (attack, non-attack) and a multiclass label (attack family). Feature vectors were extracted with a feature extractor based on GolangFootnote 1.

  • 2. Cleaning and normalization

    We removed nominal features from preprocessed datasets (see Table 1), except for the “Protocol”, which was transformed into the dummies “TCP”, “UDP”, “ICMP” and “others”. Datasets were min-max normalized:

    $$\begin{aligned} Z_i= normalize \big (remove\_nominal(D_i) \big ) \end{aligned}$$
    (3)
  • 3. Stratified sampling

    Datasets were sampled and a 5% subset was drawn for hyperparameter search and tuning: \(Z_i'=strat\_sample_{.05}(Z_i)\), where \(i \in F\). The sampling process was stratified with respect to the multiclass labels to keep balanced distributions.

  • 4. Hyperparameter search

    For each vector format (\(i \in F\)) and algorithm (\(j \in A\)), hyperparameter search was conducted by means of evolutionary algorithmsFootnote 2:

    $$\begin{aligned} param_{i,j}= hyperparam\_search(Z_i',j) \end{aligned}$$
    (4)

    Obtained hyperparameters are shown in Table 2.

  • 5. Univariate analysis of outlierness ranks

    We split each \(Z_i\) dataset into a non-attack (\(Z_{in}\)) and attack subsets (\(Z_{ia}\)). Later, measures of central tendency and histograms over \(Z_{in}\) and \(Z_{ia}\) were extracted with each algorithm j.

  • 6. Analysis with outlier ranking metrics

    For each dataset \(Z_i\), the performance of each algorithm j was evaluated with the metrics defined in Sect. 6.

  • 7. Feature selection for maximizing outlierness

    Finally, CAIA, Consensus and AGM formats (i.e., the best ones in previous experiments) were joined, vectors were extracted from pcaps, and a 5% stratified sample was drawn, obtaining the final \(Z_F'\) dataset. By means of a forward wrapper with SDO as nested algorithm, features were gradually selected to find a set that maximizes the separation between attack and non-attack outlierness. ROC-AUC (Sect. 6) was selected as optimization criterion. The obtained vector was named “OptOut” (from Optimized Outlierness), it is shown in Table 4. Steps 4, 5 and 6 were repeated for the OptOut vector.

6 Outlier Detection Evaluation Indices

For evaluating algorithms, we have used the same metrics applied by Campos et al. in their recent outlier detection algorithm comparison [9]. We refer the reader to this paper for further explanations about the performance indices. They are: P@n, precision at the top n ranks; AdjP@n, P@n adjusted for chance; AP, average precision; AdjAP, AP adjusted for chance; MaxF1, Maximum F1 score [31]; AdjMF1, MaxF1 score adjusted for chance; ROC-AUC, area under the ROC curve. Indices named adjusted are based on the recommendations given by Hubert et al. [20]. Following Campos et al. [9], in our experiments P@n and adjP@n define n as the number of instances of labeled outliers in the dataset.

7 Results and Discussion

We proceed to show results and discuss the questions raised in the Introduction.

7.1 Are Network Attacks Outliers?

Figure 2 shows box plots obtained from the univariate analysis of outlierness ranks step. For the sake of visibility, extreme values (top outliers) have been removed and outlierness ranks have been normalized. Upper and lower box boundaries correspond to 75th and 25th percentiles respectively, whereas upper and lower whiskers correspond to 95th and 5th percentiles. Additionally, we show some histograms in Fig. 3 (attack and non-attack empirical densities are equalized by normalizing histograms). There are four immediate evidences that stand out from the statistics:

Fig. 2.
figure 2

Box plots for outlierness ranks.

  1. (a)

    The differences between attack and non-attack instances in terms of outlierness for the Cisco-Joy are useless for discriminating attacks. Note that boxplots and distributions overlap or non-attacks show higher values.

  2. (b)

    Regardless of the used algorithm, as a general rule attacks show higher outlierness than non-attack instances when using the CAIA, TA or AGM vectors, being AGM the format that shows major differences.

  3. (c)

    Attack and non-attack outlierness ranges significantly overlap.

  4. (d)

    SDO shows the best performances, followed by HBOS.

The inability of the Cisco-Joy format for discriminating attacks based on outlierness (a) was expected since this vector uses a considerably high dimensional space with a majority of binary features, drawing an input space highly unsuitable for methods based on Euclidean metrics. On the other hand, the preponderance of SDO and HBOS (d), when considered together with observations (b) and (c), suggests that network attacks tend to be global, but clustered outliers, and not local outliers. The spaces drawn by the feature vectors are highly noisy and rich in density variations, and such noise and multiple densities are mainly generated by legitimate traffic. Network attacks tend to set small clusters relatively far from the data bulk. Such conditions favor non-local distance-based methods like HBOS and SDO. In any case, the significant range overlap (c) makes detection solely based on outlier ranking algorithms hardly suitable for real applications, in which high false positive rates would be unacceptable.

Fig. 3.
figure 3

Normalized histograms (top 5% outliers removed for a better visualization).

7.2 What Are the Best Feature Vectors for the Task?

Table 3 shows the performance of algorithms for each feature vector with the indices defined in Sect. 6. As for the algorithms, the evaluation measures corroborate the findings discussed in Sect. 7.1, confirming the prevalence of HBOS and SDO. On the other hand, noteworthy is the fact that the AGM vector shows high ROC-AUC and low values of other indices, whereas CAIA and Consensus show low ROC-AUC but higher values for the other indices when compared with AGM. This fact suggest that, in the AGM case, most attacks show higher outlierness than most non-attack instances, but still top outlierness values correspond to legitimate traffic. Contrarily, in the CAIA and Consensus cases most attacks and most non-attacks show similar outlierness, but top outlier positions are considerably taken by attacks (note that attacks in the dataset are negligible compared to normal instances). Such circumstance favors the use of the AGM vector to build a general purpose detector, but CAIA or Consensus as a support detector for evaluating only extreme outlierness cases; more interestingly, it suggests that vector formats are complementary and a new feature vector that maximizes attack outlierness can be built from them.

Table 3. Algorithm performances.

7.3 Can We Improve Vectors and Use Them in Real Detection?

Results in Table 3 show that the studied vectors would generate many false positives in real-world applications. As described in Sect. 5, we constructed a feature vector OptOut that maximizes the separation between attack and non-attack outlierness. OptOut uses the 5-tuple key, but enriched with features that describe the behavior of the network device as information source, therefore instances profile application-based and endpoint-based behavior at the same time. Table 4 shows the included features in the OptOut vector and Fig. 4 the forward selection process. We performed hyperparameter search also for this vector and obtained the following values: kNN, \(k=15\); LOF, \( MinPts =50\); HBOS, \(k=22\); iForest, \(t=50, f=4, \psi =456\); SDO, \(k=241, x=25\), \(qv=0.35\). Some histograms are shown in Fig. 5.

Fig. 4.
figure 4

OptOut forward selection process.

Obtained outlierness box plots are shown in Fig. 2, histograms in Fig. 5, and performance indices in Table 5. Results disclose that the OptOut vector considerably increases performances and, therefore, the capability of algorithms to discriminate attacks based on outlierness (particularly when using SDO). However, real-world detection demands high accuracy to minimize the proliferation of false positives. Attack detection based on unsupervised algorithms can hardly solve the problem alone, but its combination with supervised methods and techniques that leverage pre-knowledge is expected to build detection frameworks with highly effective performances.

Fig. 5.
figure 5

OptOut vector. Normalized histograms (top 5% outliers removed for a better visualization).

Table 4. OptOut feature vector after forward selection (SDO nested).
Table 5. Algorithm performances for the OptOut feature vector.

8 Conclusions

In this work we have faced three relevant aspects of network attacks, namely: (a) if they are actually outliers, (b) what are the most suitable algorithms and feature vectors for implementing outlierness-based detectors, and (b) if the attack outlierness is enough for implementing real-world detection. We have studied these questions from analytical perspectives by evaluating five different feature vectors used in the literature with five different outlier ranking algorithms. For our experiments we have used a dataset for intrusion detection evaluation that reflects modern attacks as well as legitimate behavior profiles.

The conducted experiments reveal that, as a general rule, network attacks have higher global distance-based outlierness averages than normal traffic. Given the characteristics of network feature spaces—noisy, highly varied, with normal instances covering a broad spectrum and drawing subspaces with many density differences—local algorithms show low performances for attack detection. Algorithms with a more global space interpretation—like SDO or HBOS—tend to perform better, specially when representation spaces capture the behavior of network devices and hosts (e.g., the AGM format). We have proposed a feature space that maximizes the separation of attacks and non-attacks in terms of outlierness; however, the risk of high false positive rates still prevails due to the base-rate fallacy problem inherent to network security spaces. Outlier detection algorithms can be a powerful tool for detecting known and novel attacks, but leveraging pre-knowledge with supervised methods should not be omitted, since supervised and unsupervised methods are complementary and, together, can build highly refined solutions.