Keywords

1 Introduction

Botnet is one of the most significant threats for Internet security. Nowadays, the botnet keeps evolving which is composed by not only compromised computers, but also a large variety of IoT devices, including smart phones, IP cameras, routers, printers, DVRs and so on. With enormous cumulative bandwidth and computing capability, botnet becomes the most important and powerful tool available for cheaper and faster large-scale network attacks in the Internet [1].

According to AV-Test [2] report, on average over 390,000 new malware samples are detected every day. The enormous volume of new malware variants renders manual analysis inefficient and time-consuming. Nowadays, machine learning has been widely deployed in botnet detection system as a core component [3,4,5,6], and has achieved good detection results.

However, with financial motivation, attackers keep evolving the evasion techniques to bypass machine learning detection. Currently, more and more well-crafted botnets exploit concept drift, a vulnerability of machine learning, to accelerate the decay of detection model. Machine learning algorithms assume that the underlying botnet data distribution is stable in training and testing dataset. The well-crafted concept drift attacks gradually and stealthily introduce changes into malware data distribution to mislead machine learning models, such as new communication channels [7,8,9,10,11], mimicry attacks [12, 13], gradient descent attacks [12, 13], poison attacks [14], and so on. To build change-resistant and self-renewal learning models against advanced evasion techniques is very important for botnet detection system.

Existing solutions use passive and periodical model retraining to mitigate concept drift attacks. However, the interval of two retraining is hard to decide because frequently retraining is not efficient while loose frequency leads untrusted predictions in some periods. And the manual labelling of all new samples is required for supervised retraining process. The labelling is based on the traditional coarse-grained and fixed threshold, which are not sensitive to hidden and gradual changes of underlying data distribution. Stationary learning model’s confidential values are critical for detection performance. So if the learning model’s algorithm and parameters are stolen by adversaries [15], retraining will be no longer useful for sudden concept drift attacks, which are crafted quickly and easily based on full knowledge of the detection model.

In this paper, we present an active and dynamic botnet detection approach that enhance traditional horizontal correlation detection model. Compared to the traditional models, this model could actively detect hidden concept drift attacks and dynamically evolve to track the trend of latest botnet concept.

In particular, this paper makes the following contributions:

  • As far as we know, we are the first to present an active and dynamic learning approach for botnet detection that actively tracks the trend of hidden botnet concept drift and accordingly evolves learning model dynamically to mitigate model aging.

  • We extend the traditional passive decision method which is based on coarse-grained threshold to check whether the bottom line are crossed or not. In contrast, we introduce fine-grained p-values as indicator to actively identify hidden concept drift before the detection performance starts to degrade.

  • The confidential values of traditional detection models are fixed, such as model parameters, so retraining is the only way to combat concept drift attacks. We introduce DRIFT assessment and feature reweighting to dynamically tune model parameters following the trend of current botnet concept.

The remainder of this paper is outlined as follows: In Sect. 2, we review the related works. Section 3 presents the architecture of our active and dynamic botnet detection approach, and describes each components. Section 4 shows our experiments performed to assess the recognition of underlying data distribution concept drift and model self-renewal. In Sect. 5, we discuss the limitations and future work, and in Sect. 6 we summarize our results.

2 Related Works

Nowadays, machine learning (ML) has been widely used in botnet detection system as a core component. The assumption of ML is that the underlying data distribution is stable for both training dataset and testing dataset. By exploiting the assumption of ML, many well-crafted evasion approaches, known as concept drift attacks, have been proposed to evade or mislead ML models [16]. As shown in Fig. 1, every step of ML process is a potential part of the concept drift attack surface. With different levels of knowledge of target ML system, attackers could launch various concept drift attacks [17]. Arce [18] pointed out that machine learning itself could be the weakest link in the security chain.

Fig. 1.
figure 1

Machine learning process and corresponding concept drift attack surface

Botnet attackers have begun to exploit many stealthy communication channels which are beyond the scope of ML data collection, such as social network [10, 11], email protocol [19], SMS [7] and bluetooth [8]. Erhan et al. [11] proposed social network based botnet to abuse trusted popular websites, such as twitter.com, as C&C servers. Kapil et al. [19] evaluated the viability of using harmless-looking emails to delivery botnet C&C message. The new stealthy channel always involve highly trusted and very popular websites or heavily used email servers, which own excellent reputation and are whitelisted from most detection systems. And the exploited websites or email servers have very large normal traffic volume, so that light-weight occasional botnet traffic is unlikely to be noticed.

Mimicry attack refers to the techniques that mimic benign behaviors to reduce the differentiation between the malicious events and benign events. Wagner and Soto [20] demonstrated the mimicry attack against a host-based IDS that mimicked the legitimate sequence of system calls. Šrndic and Laskov [17] presented a mimicry attack against PDFRate [21], a system to detect malicious pdf files based on the random forest classifier.

Gradient descent is an optimization process to iteratively minimize the distance between malicious points and benign points. Srndic and Laskov [12] applied a gradient descent-kernel density estimation attack against the PDFRate system that uses SVM and random forest classifier. Biggio et al. [13] demonstrated a gradient descent component against the SVM classifier and a neural network.

Poisoning attacks work by introducing carefully crafted noise into the training data. Biggio et al. [14] proposed poisoning attacks to merge the benign and malicisous clusters that make learning model unusable.

Therefore, botnet problems are not stable but change with time. For machine learning based botnet detectors, they are designed under the assumption that the training and testing data follow the same distribution which make them vulnerable to concept drift problem that the underlying data distribution are changing with time. One of the concept drift mitigation approach is to recognize and react to recent concept changes before model aging. Demontis et al. [14] proposed an adversary-aware approach to proactively anticipates the attackers. Deo et al. [22] presented a probabilistic predictor to assess the underlying classifier and retraining model when it recognized concept drift. Transcend [23] is a framework to identify model aging in vivo during deployment, before the performance starts to degrade. In this paper we present an active and dynamic botnet detection approach which could actively detect the trend of hidden concept drift attacks and dynamically evolve learning model to mitigate model aging.

3 Active and Dynamic Botnet Learning Approach

Driven by financial motivation, malware authors keep evolving malware perpetually using various advanced evasion techniques to evade detection, especially to bypass widely deployed learning-based models. Many learning-based detection models calculate a score for a new approaching sample describing the relationship between the known malware samples and the new one. Then detectors compare the score with a fixed and empirical threshold to make a decision if it is malicious. The threshold usually fits the old training dataset very well, even overfits. However, the performance degenerates to the new ever-changing malicious dataset with time. In this paper, we propose an active and dynamic learning approach to track the trend of botnet underlying concept drift and renew learning model to mitigate model aging.

Figure 2 depicts the framework of active and dynamic botnet detection approach that includes five components: non-conformity measure (NCM), conformal learning, concept drift detection, model assessment, and self renewal. Non-conformity measure is the core part of botnet detection system, which is used to tell the different degree between a given sample and known botnet samples. In this paper, we select the horizontal correlation classifier BotFinder [24] as the NCM. The conformal learning component uses p-values to carry out statistical analysis based on NCM scores. The p-value is more fine-grained than threshold which can be used to observe the gradual decay of detection model. The concept drift recognition component uses the average p-value (APV) algorithm to detect the concept drift of malware data distribution between two different time windows. The model assessment component applies DRIFT algorithm to locate the features that are affected by identified concept drift. The self renewal component dynamically adjust the weight of predictive features to track the current botnet concept.

Fig. 2.
figure 2

The framework of active and dynamic botnet learning approach

3.1 Horizontal Correlation Classifier

In general, horizontal correlation techniques focus on the common behaviors among a set of hosts, and use clustering and classification algorithm to build detection model to recognize the infected machines or suspect behaviors. In this section, we introduce BotFinder into our approach as the underlying NCM. BotFinder includes four parts: training dataset selection, preprocessing, feature extraction, modeling with machine learning algorithm.

Training Dataset. The training dataset selection directly affects the quality of the detection model. The CTU botnet capture dataset is stored in files using the binetflow format, in which each row represents a network behavior and each column is a behavior feature. According to the granularity, the network behaviors can be abstracted to different levels, such as packets, netflows, traces and hosts. In this work, netflow is the basic data unit for training datasets, and then we abstract netflows into traffic trace by grouping the netflows with the same source IP address, the same destination IP address, the same destination port and the protocol together.

Preprocessing. Before starting training phase, we will preprocess the data that filter noise data and transform the features by scaling each feature to a given range. In this work the range of each feature on the training dataset is given between 0 and 1 at initialization time. To make the data clearer and more usable, we will filter the datasets by whitelisting common Internet service, such as Microsoft Update and Google, and known online movie and music traffic by their communication pattern.

Feature Extraction. After trace generation, we perform a statistical analysis of the traces consisting of netflows. All features extracted are presented as floats range from 0 to 1, and grouped to 2 feature sets: as listed in Table 1.

Table 1. Overview of feature sets

Modeling with Machine Learning Algorithm. The horizontal correlation classifier BotFinder is a detection method that does not require deep packet inspection. BotFinder uses the CLUES algorithm to cluster the similar traces of a botnet family, and builds detection model for each class of this family. This method can effectively identify the botnet network traffic similarity between different malware variants, and give a prediction based on the optimal threshold fitting the training dataset.

3.2 Non-conformity Measures

Many machine learning algorithms are in fact scoring classifiers: when trained on a set of observations and fed with a test object x, they could calculate a prediction score s(x) called scoring function. The input of the NCM is a known sample set and an unknown sample, and the output is a score that describes the similarity or dissimilarity of the unknown sample to the known sample set. So any scoring classifiers using a fixed and empirical threshold can be introduced into our approach as a underlying NCM. In this work, we select BotFinder as the NCM. BotFinder selects time related features and traffic volume features to build detection model from the horizontal perspective. BotFinder is in fact also a scoring classifier: when trained on a set of observations and fed with a test object x, it could calculate a prediction score botfinder(x), and botfinder() is used as the underlying NCM in our approach.

3.3 P-Value

Once a non-conformity measure is selected, conformal predictor computes a p-value \(p_{z^*}\), which in essence for a new object \(z^*\), represents the percentage of objects in \(\left\{ x \in C, \forall C \in \mathbb {D}\right\} \), (i.e., the whole dataset) that are equally or more estranged to C as \(p_{z^*}\), and we will get a number between 0 and 1. The algorithm is shown in Algorithm 1.

figure a

P-value measures the fraction of objects within \(\mathbb {D}\), that are at least as different from a class C as the new object \(z^*\). For instance, if C represents the set of botnet traces, a high p-value \(p_{z^*}\) means that there is a significant part of the objects in this set that is more different than \(z^*\) with C. On the other words, \(z^*\) is more similar to these botnet traces than the objects that already marked botnet. Therefore, the prediction result based on a high p-value shows a high credibility. P-values are directly involved in our discussion of concept drift recognition.

3.4 Concept Drift Identification

We use the average p-value (APV) algorithm and drift rating function to calculate concept drift scores (CDS) for each time windows, which is used to recognize concept drift attacks, as shown in Fig. 3.

Fig. 3.
figure 3

The conformal learning component calculates APV for each time windows

Firstly, to visualize the botnet underlying data distribution, we select tSNE [25] algorithm to do dimension reduction. The tSNE is an algorithm to visualize high-dimensional datasets by dimensionality reduction. The tSNE maps the high-dimensional points into two or three dimensions and keeps the distance structure that the close points in high-dimensional space remain close to each other on the low dimension space. Secondly, we calculate the p-values of each botnet trace to see the significant level of this trace in its family traces. Because there are usually a large number of traces for most botnet family, to be efficient, we split tSNE space into small \(n \times n\) grids. For each grid, we calculate its APV, which is the average p-value of all botnet traces belong to this grid, as shown in Fig. 4. We group the botnet traces into different time windows according to their time stamps in the timeline. After calculating APV of each grid, we judge the concept drift in each time window using a drift rating function. In the tSNE space, there are common grids shared by multiple botnet traces in different time windows, while there are exclusive grids that are occupied by the botnet traces only in one time window. We use the fraction of the sum of APV of exclusive grids over the sum of APV of all grids in one time window to represent the drift rating in this time window, as shown in the Eq. 1.

$$\begin{aligned} Concept Drift Score = \frac{\sum {APV_{exclusive}}}{\sum {APV_{common}}+\sum {APV_{exclusive}}} \end{aligned}$$
(1)

The change of concept drift score (CDS) between different time window reflects the change of underlying botnet data distribution with time that can identify gradual moderate drift.

If the CDS score in the latest time window increases, it shows that the current concept of underlying botnet data distribution is different from the old concept learnt from previous time window, and indicates that the detection model is suffering from concept drift attack. But the decay of threshold based detection performance may not be observed immediately when concept drift is found. Only when the variation of the underlying data distribution exceeds the boundary of the threshold, the detection model starts make poor decisions. If the CDS score does not increase in the new time window, it means in the current time window, the distribution of botnet traces does not have significant concept drift.

3.5 Model Assessment

When concept drift is found in the latest time window, we will use the DRIFT algorithms to evaluate the contribution of each feature in current window to identify the features which effected by concept drift, as shown in Algorithm 2. DRIFT[i] represents the effect of the \(i^{th}\) feature on the average distance between two botnet traces in different time windows. If DRIFT[i] increases, it means that the concept drift affects the \(i^{th}\) feature.

figure b

3.6 Model Self Renewal

When concept drift is recognized, we will reweight the affected features according to the DRIFT score to dynamically update the model before the cumulative radical drift. The formula for calculating a new weight based on DRIFT score is:

$$\begin{aligned} W_i= {\left\{ \begin{array}{ll} 3 \times (1-\sqrt{DRIFT[i]}) &{} \quad \text {if} \quad DRIFT[i] \le 0.05 \\ 2 \times (1-\sqrt{DRIFT[i]}) &{} \quad \text {if} \quad 0.05< DRIFT[i] < 0.1 \\ (1-\sqrt{DRIFT[i]}) &{} \text {if} \quad DRIFT[i] \ge 0.1 \end{array}\right. } \end{aligned}$$
(2)

By updating the weight, we can reduce the weight of feature that is significantly influenced by concept drift attack, and increase the weight of feature that reflect current botnet concept very well.

4 Experiment

4.1 Botnet Dataset

In this paper, we use the public CTU botnet datasets for our experiment that is provided by Malware Capture Facility projectFootnote 1. They capture long-live real botnet traffic and generate labeled netflow files that are public for malware research. The traffic dataset is from 2011 to present. We plan to recognize the concept drift between different variants in the same family. We select 6 botnet families that have more than 2 variants for this experiments as shown in Table 2. All file names of CTU botnet captures have the same prefix “CTU-Malware-Capture-Botnet”, and each capture has an unique suffix name. In the Table 2, only suffix names are listed to save space. Each family has multiple variants and the capture time of all variants and the time span of each family are different.

Table 2. The selected botnet families in CTU malware dataset

4.2 Active Concept Drift Recognition

We cut the time span of each family into 2 time windows: \(tw_1\) and \(tw_2\), so there are 12 time windows that each botnet family has 2 disjoint time windows. According to the time order of variants, the \(tw_2\) of each family only contains its latest variant, while other variants are all grouped into \(tw_1\). To recognize the hidden concept drift between different time windows, we take three steps: first, visualize botnet data distribution in the two dimension figure; second, split the two dimension space into small grids and compute the significant levels of all grids; third, calculate concept drift score using botnet data distribution and significant levels.

The data distribution of each family are shown in Fig. 4. We select tSNE [25] algorithm to do dimension reduction. The tSNE is an algorithm to visualize high-dimensional datasets by dimensionality reduction. The tSNE maps the high-dimensional points into two or three dimensions and keeps the distance structure that the close points in high-dimensional space remain close to each other on the low dimension space.

We split the 2-dimension tSNE space into 30\(\,\times \,\)30 grids, so there are 900 grids in total for each family as shown in Fig. 4. And we calculate average p-values (APV) for all grids. The APV represents the significance of each grid. If a grid has a high APV, the grid is important for the description of this botnet family characteristics. In the Fig. 4, we use square to denote the common grids for two time windows, and the triangle represents the grid only for \(tw_1\), and the circle means this grid only belong to \(tw_2\).

Fig. 4.
figure 4

The drift of data distribution and significant levels of each family.

According to the difference of data distribution and significant levels between the traces in two time windows, we calculate the concept drift scores (CDS) for the traces in \(tw_2\). The scope of CDS is between 0 and 1. 0 means there is no concept drift between the 2 time windows, while 1 represents sudden drift that there is no common grid shared in two time windows.

From the Fig. 5, we can see the CDS of OpenCany family is 0.018, which means there is almost no concept drift in the latest time window. And the CDSs for family Dynamer, Taobao, Cridex are 0.121, 0.216 and 0.238, which means there are gradual and moderate concept drifts in the latest time windows. For family Yakes and Dridex, the CDSs are 0.994 and 0.998 which indicate radical concept drifts in the latest time windows.

4.3 Dynamic Model Evolution

After recognized concept drift, we will assess this concept drift affect to each predictive feature and then dynamically calculate new weights for all features to track the trend of underlying concept drift. In this paper, we use DRIFT algorithms to assess the concept drift effect to each predictive feature. DRIFT score represents the distance between the observed traces in \(tw_2\) and the traces in \(tw_1\). According to Algorithm 2, we update the weight for all features as shown in Tables 3 and 4.

Table 3. The feature DRIFT scores and new weights of Dridex
Table 4. The feature DRIFT scores and new weights of Yakes

Figure 6 shows the changes of time windows APVs. Note that the time window APV is different from grid APV. The time window APV is the average p-value of all traces captured in a time window, while grid APV is the average p-value of the traces in a small grid. After feature reweighting, the latest time window APVs of family Yakes and Dridex have dramatic increase, which means the latest concept is becoming more consistent with the previous concept. Note that the real underlying botnet data distribution does not change. Just the model observing perspective changes. From the new perspective, the new botnet variant looks more similar to known variants.

Fig. 5.
figure 5

The concept drift scores of each family.

Fig. 6.
figure 6

The APV of time windows before and after feature reweighting.

5 Discussion

Machine learning is widely used as a core component in the advanced botnet detection system. However, machine learning is not a panacea, which is suffering from the advanced concept drift attacks. Concept drift attacks exploit the vulnerable assumption of machine learning that the underlying data distribution is stable for both training and testing dataset. There are various advanced concept drifts attacks, such as mimicry attacks, gradient descent attacks, poisoning attacks, and so on. Such concept drift attacks change the underlying data distribution to make the botnet concept appear to be different from machine learning observing perspective. The new botnet concept just looks different, but the botnet essential behaviors are still the same.

To mitigate botnet concept drift attacks and build sustainable botnet detection model, we can make efforts from the following aspects: dynamic feature selection, dynamic reweighting, and ensemble learning. The goal of dynamic feature selection is to dynamically choose relevant features to the current botnet concept. Before changing feature selection, we should get deep insights of the trend of hidden botnet concept drift and assess the contribution of each feature to this new trend. In this paper, we propose the concept drift score to identify the hidden drift, and DRIFT function to assess the contribution of each feature to the new trend of concept drift. Dynamic reweighting handles botnet concept drift by changing the feature weight in the learning model to make model dynamically fit the current botnet concept. Note that dynamic reweighting may cause data overfitting. In this paper, we use the piecewise reweighting function to calculate new weight using different sub-functions according to DRIFT scores. Ensemble learning maintains a set of learning models that observe the botnet concept from diverse perspectives. Ensemble learning has multiple botnet concept descriptions, so it is robust to the hidden concept drift which increases the complexity for botnet evasion.

In this work, we use concept drift scores and DRIFT function to identify hidden concept drift and dynamic feature reweighting. However, we only has one botnet description based on network trace horizontal correlation. The concept drift scores and DRIFT function are agnostic to the underlying learning algorithm, making our approach versatile and compatible with multiple ML algorithms. Our approach could not only use horizontal correlation BotFinder classifier, but also be applied on the top of any other botnet classification or clustering algorithm that uses a numeric score for prediction. In the future, we are trying to introduce more botnet concept descriptions to build ensemble botnet learning model.

6 Conclusions and Future Work

The botnet threat is totally different from optical character recognition, speech recognition, bioinformatics and so on, whose concept description could be stable for many years. For the financial motivation, botnet keeps evolving perpetually and introducing well-crafted concept drift to evade detection. To build a sustainable and secure learning model, we need to quickly recognize and react to the concept drift of underlying botnet data distribution. In this paper we proposed an novel botnet detection approach that based on the matching scores provided by BotFinder classifier, we use concept drift scores and DRIFT function to identify hidden concept drift and dynamic feature reweighting. As far as we know, we are the first to achieve an active and dynamic learning approach for botnet detection that actively tracks the trend of hidden botnet concept drift and accordingly evolves learning model dynamically to fit latest botnet concept.

Our approach is agnostic to the algorithm, making it compatible with most botnet classification or clustering algorithm that uses a score for prediction. In the future, we will integrate more diverse scoring botnet classifiers into this approach, such as vertical correlation classifiers, to increase botnet concept description and learn botnet from more and more diverse perspectives. And we are going to improve the efficiency of this approach through introducing sliding window to online learn the latest concepts and remove aging data dynamically.