1 Introduction

Starting with the use of naive Bayes classifiers for spam detection [1], machine learning has been increasingly applied to solve core security problems. For instance, anomaly detection creates a model of expected behavior in order to detect network intrusions or other instances of malicious activities [35]. Classification with machine learning is also applied to automate the detection of unwanted software like malware [29], or to automate source code analysis [33].

This includes Deep neural networks (DNNs) in security-critical applications, such as malware detection [6, 31]. While the benefits applying DNNs are undisputed, previous work has also shown that, as is the case for many machine learning models, they lack robustness to adversarially crafted inputs known as adversarial examples. These inputs are derived from legitimate inputs by adding carefully chosen perturbations that force models to output erroneous predictions [9, 25, 36].

To evaluate the applicability of adversarial examples to a core security problem, we chose the settings of malware detection. In contrast to the task of image classification, the span of acceptable perturbations is greatly reduced: the model input is now a set of features taking discrete values. Thus, acceptable perturbations must correspond exactly to one of these discrete values. Furthermore, the similarity criteria defined by human perception is replaced by the more challenging requirement that perturbations do not jeopardize the software’s malware functionality pursued by the adversary.

In this paper, we show that android malware detection that uses neural networks, with performance comparable to the state-of-the-art, is easy to deceive with adversarial examples. Furthermore, we find that hardening the model to increase its robustness to these attacks is a very difficult task. Our attack approach elaborates on an adversarial example crafting algorithm previously introduced in [25]. Our approach thus generalizes to any malware detection system using a differentiable classification function.

Contributions. We expand the method originally proposed by Papernot et al. [25] to attack Android malware detection. We adapt it to handle binary features while at the same time preserving the Apps malicious functionality.

Applying the attack, we are able to mislead our best performing malware detector (on the DREBIN dataset [2]) at rates higher than 63%.

As a second contribution, we investigate potential defense mechanisms for hardening malware detection models trained using DNNs.

We consider defensive distillation [27] and adversarial training [9, 36]. The findings of our experimental evaluation of the aforementioned mechanisms is twofold. Applying defensive distillation reduces the rates at which adversarial examples are misclassified, but the improvement observed is often negligible. In comparison, training the model intentionally with adversarially crafted malware applications improves its robustness, as long as the perturbation introduced during adversarial training is carefully chosen.

2 Background

In this section, we explain the general concepts used in this paper. We first give a short introduction to malware detection. Afterwards, we move to the machine learning algorithm we apply, neural networks. Subsequently, we discuss adversarial machine learning with a focus on neural networks. We end the section by briefly reviewing defenses that have been proposed so far.

2.1 Malware Detection

Due to the increasing amount of published programs and applications, malware detection has become application of machine learning. The quality of detection depends then however heavily on the provided features. The literature generally differentiates two types of such features: static and dynamic features. Static features can directly be collected from the application’s code and include, for example, n-gram frequencies in the code, opcode usage or control flow graph properties. Dynamic features, the nowadays more popular category, samples features from the application during runtime, observing general behavior, access and communication patterns.

As an example of an approach combining static and dynamic analysis we mention Marvin [20], which extracts features from an application while running it in an analysis sandbox and observing data flow, network behavior and other operations. This approach reaches an accuracy of 98.24% of malicious applications with less than 0.04% false positives.

In malware detection, not only accuracy, but also the false positive and false negative rates matter – classifying malware as benign might lead to a loss of trust by the users, whereas false negatives might lead to great financial loss for companies whose benign applications got classified as malware.

2.2 Neural Networks

We will now have detailed look at neural networks and introduce the required notation and definitions. Neural networks consist of elementary computing units—named neurons—organized in interconnected layers. Each neuron applies an activation function to its input to produce an output. Figure 1 illustrates the general structure of the network used throughout this paper and also introduces the notation used here.

Fig. 1.
figure 1

The structure of deep feed-forward neural network as used in our setting.

Starting with the model input, each network layer produces an output used as input by the next layer. Networks with a single intermediate—hidden—layer are qualified as shallow neural networks whereas models with multiple hidden layers are deep neural networks. Using multiple hidden layers is interpreted as hierarchically extracting representations from the input [8], eventually producing a representation relevant to solve the machine learning task and output a prediction.

A neural network model \(\mathbf {F} \) can be formalized as the composition of multi-dimensional and parametrized functions \(f_i\) each corresponding to a layer of the network architecture—and a representation of the input:

$$\begin{aligned} \mathbf {F}:\varvec{x} \mapsto f_n(...f_2(f_1(\varvec{x},\theta _1), \theta _2)..., \theta _n) \end{aligned}$$
(1)

where each vector \(\theta _i\) parametrizes layer i of the network \(\mathbf {F} \) and includes weights for the links connecting layer i to layer \(i-1\). The set of model parameters \(\theta =\{\theta _i\}\) is learned during training. For instance, in supervised settings, parameter values are fixed by computing prediction errors \(f(x)-\varvec{y}\) on a collection of known input-output pairs \((\varvec{x},\varvec{y})\).

2.3 Adversarial Machine Learning

DNNs, like numerous machine learning models, have been shown to be vulnerable to adversarial manipulations of their inputs [36]. Adversarial goals thereby vary from simple misclassification of the input in a class different from the legitimate source class to source-target misclassification where samples from any source class are to be misclassified in a chosen target class. The space of adversaries was formalized for multi-class deep learning classifiers in a taxonomy [25]. Adversaries can also be taxonomized by the knowledge of the targeted model they must possess to perform their attacks.

Crafting an adversarial example \(\varvec{x^*}\)—misclassified by model \(\mathbf {F} \)—from a legitimate sample \(\varvec{x}\) can be formalized as the following problem [36]:

$$\begin{aligned} \varvec{x^*}=\varvec{x}+\delta _{\varvec{x}}=\varvec{x}+\min \Vert \varvec{z}\Vert \ \mathtt { s.t. }\ \mathbf {F} (\varvec{x}+\varvec{z}) \ne \mathbf {F} (\varvec{x}) \end{aligned}$$
(2)

where \(\delta _{\varvec{x}}\) is the minimal perturbation \(\varvec{z}\) yielding misclassification, according to a norm \(\Vert \cdot \Vert \) appropriate for the input domain.

Due to the non-linearity and non-convexity of models learned by DNNs, a closed form solution to this problem is hard to find. Thus, algorithms were proposed to select perturbations approximatively minimizing the optimization problem stated in Eq. 2. The fast gradient sign method introduced by Goodfellow et al. [9] linearizes the model’s cost function around the input to be perturbed and selects a perturbation by differentiating this cost function with respect to the input itself and not the network parameters like is traditionally the case during training. The forward derivative based approach introduced by Papernot et al. [25] evaluates the model’s output sensitivity to each input component using its Jacobian matrix. From this, we derive a saliency map ranking the individual features by their influence for a particular class.

All previous attack are white-box attacks, since they require access to the differentiable model. Additionally, black-box attacks leveraging both of the previous approaches to target unknown remotely hosted DNNs was proposed in [24]. The attack first approximates the targeted model by querying it for output labels to train a substitute model, which is then used to craft adversarial examples also misclassified by the originally targeted model.

Several approaches have also been presented in the literature to harden classifiers against such crafted inputs. Goodfellow et al. [9] employed an explicit training with adversarial examples. Papernot et al. [27] proposed distillation as another potential defense, of which a simpler alternative—label smoothing—was investigated by Warde-Farley et al. [38]. Since both, adversarial training and distillation, have only been investigated in the image classification setting, we will evaluate their performance for malware detection in Sect. 5.

3 Methodology

This section describes the approach to adversarial crafting for malware detection. We start by describing the data and how we train and configure the DNNs. Thereafter, we describe in detail how we craft adversarial examples, and detail how the perturbation search during adversarial example crafting needs to be adapted to our settings of malware detection.

3.1 Application Model

In the following, we describe the representation of applications we use as input to our malware detector. In this work, we focus on statically determined features of applications. As a feature, we understand some property that the statically evaluated code of the application exhibits. This includes whether the application uses a specific system call or not, as well as a usage of specific hardware components or access to the Internet.

A natural way to represent such features is using binary indicator vectors: Given features \(1,\ldots ,M\), we represent an application using the binary vector \(\mathbf {X} \in \{0,1\}^M\), where \(X_i\) indicate whether the application exhibits feature i, i.e. \(\mathbf {X} _i=1\), or not, i.e. \(\mathbf {X} _i=0\). Due to the varied nature of applications that are available, M will typically be very large and sparse: each single application only exhibits very few features relatively to the entire feature set. This leads to very sparse feature vectors, and overall, a very sparsely populated space of applications in which we try to successfully separate malicious from benign applications.

3.2 Training the Malware Classifier

In this section, we describe how we train a malware detector using DNNs.

While Dahl et al. [6] use a neural network to classify malware, their approach uses random projections and dynamic data. Since perturbing dynamically gathered features is a lot more challenging than modifying static features, we consider the simpler, static case in this work and leave the dynamic case for future work. Also Saxe et al. [31] proposed a well functioning detection system based on a neural network, which is, to the best of our knowledge, not publicly accessible.

We will thus train out own neural network malware detection system. This also enables us to consider a worst case attacker having full knowledge about model and training data.

Since the binary indicator vector \(\mathbf {X} \) we use to represent an application does not possess any particular structural properties or interdependencies, like for example images, we apply a regular, feed-forward neural network as described in Sect. 2 to solve our malware classification task.

We use a rectifier as the activation function for each hidden neuron in our network. As output, we employ a softmax layer for normalization of the output probabilities. the output is thus computed as

$$\begin{aligned} \mathbf {F} _i(\mathbf {X}) = \frac{e^{x_i}}{e^{x_0}+e^{x_1}}~,~ x_i = \sum _{j=1}^{m_{n}}w_{j,i}\cdot x_j+b_{j,i} \end{aligned}$$
(3)

To train our network, we use standard gradient descent and standard dropout.

figure a

3.3 Crafting Adversarial Malware Examples

We next describe the algorithm that we use to craft adversarial examples against the malware detector we trained in the previous section. The goal of adversarial example crafting in malware detection is to mislead the detection system, causing the output of the classifier for a particular application to change according to the attacker’s goal.

More formally, we start with \(X \in \{0,1\}^m\), a binary indicator vector that indicates which features are present in an application. Given X, the classifier \(\mathbf {F} \) returns a two dimensional vector \(\mathbf {F} (\mathbf {X})= \left[ \mathbf {F} _0(\mathbf {X}),\mathbf {F} _1(\mathbf {X})\right] \) with \(\mathbf {F} _0(\mathbf {X})+\mathbf {F} _1(\mathbf {X}) = 1\) that encodes the classifiers belief that \(\mathbf {X} \) is either benign (\(\mathbf {F} _0(\mathbf {X})\)) or malicious (\(\mathbf {F} _1(\mathbf {X})\)). We take as the classification result y the option that has the higher probability, i.e. \(y = {\mathop {\mathop {\mathsf {arg~max}}}\nolimits _{i}} \mathbf {F} _i(\mathbf {X})\). The goal of adversarial example crafting now is to find a small perturbation \(\delta \) such that the classification results \(y'\) of \(\mathbf {F} (X+\delta )\) is different from the original results, i.e. \(y'\ne y\). We denote \(y'\) as our target class in the adversarial example crafting process.

Our goal is to have a malicious application classified as benign, i.e. given a malicious input \(\mathbf {X} \), the classification results \(y'=0\). Note that our approach naturally extends to the symmetric case of misclassifying a benign application.

We adopt the adversarial example crafting algorithm based on the Jacobian matrix

$$ \mathbf {J} _\mathbf {F} = \frac{\partial \mathbf {F} (\mathbf {X})}{\partial \mathbf {X}} = \left[ \frac{\partial \mathbf {F} _i(\mathbf {X})}{\partial \mathbf {X} _j}\right] _{i\in {0,1}, j\in [1,m]} $$

of the neural network \(\mathbf {F} \) put forward by Papernot et al. [25]. Despite it originally being defined for images, we show that a careful adaptation to a different domain is possible. Note, in particular, that this approach is not restricted to the specific DNN we described in the previous section, but to any differentiable classification function F.

To craft an adversarial example, we take mainly two steps. In the first, we compute the gradient of \(\mathbf {F} \) with respect to \(\mathbf {X} \) to estimate the direction in which a perturbation in \(\mathbf {X} \) would change \(\mathbf {F} \)’s output. In the second step, we choose a perturbation \(\delta \) of \(\mathbf {X} \) with maximal positive gradient into our target class \(y'\). For malware misclassification, this means that we choose the index \(i = {\mathop {\mathop {\mathsf {arg~max}}}\nolimits _{j\in [1,m], \mathbf {X} _j=0}} \mathbf {F} _0(\mathbf {X} _j)\) that maximizes the change into our target class 0 by changing \(\mathbf {X} _i\). We repeat this process until either (a) we reached the limit for maximum amount of allowed changes or (b) we successfully cause a misclassification. A pseudo-code implementation of the algorithm is given in Algorithm 1.

Ideally, we keep the change small to make sure that we do not cause a negative change of \(\mathbf {F} \) due to intermediate changes of the gradient. For computer vision, this is not an issue since the values of pixels are continuous and can be changed by as arbitrarily small perturbations as permitted by the encoding of the image. In the malware detection case, however, we do not have continuous data, but rather discrete input values: since \(\mathbf {X} \in {0,1}^m\) is a binary indicator vector, our only option is to increase one component in \(\mathbf {X} \) by exactly 1 to retain a valid input to \(\mathbf {F} \). This motivates the changes to the original algorithm in [25].

Note finally that we only consider positive changes for positions j at which \(\mathbf {X} _j=0\), which correspond to adding features the application represented by \(\mathbf {X} \) (since \(\mathbf {X} \) is a binary indicator vector). We discuss this choice in the next subsection.

3.4 Restrictions on Adversarial Examples

To make sure that modifications caused by the above algorithms do not change the application too much, we bound the maximum distortion \(\delta \) applied to the original sample. As in the computer vision case, we only allow distortions \(\delta \) with \(||\delta ||\le k\). We differ, however, in the norm that we apply: in computer vision, the \(L_\infty \) norm is often used to bound the maximum change. In our case, each modification to an entry will always change its value by exactly 1, and we thus use the \(L_1\) norm to bound the overall number of features modified. We further bound the number of features to \(k=20\) (see Appendix B for details).

While the main goal of adversarial example crafting is to achieve misclassification, for malware detection, this cannot happen at the cost of the application’s functionality: feature changes determined by Algorithm 1 can cause the application in question to lose its malware functionality in parts or completely. Additionally, interdependencies between features can cause a single line of code that is added to a malware sample to change several features at the same time. We discuss this issue more in detail in Appendix A.

To maintain the functionality of the adversarial example, we restrict the adversarial crafting algorithm as follows: first, we will only change features that result in a single line of code that needs to be added to the real application. Second, we only modify manifest features which relate to the AndroidManifest.xml file contained in any Android application. Together, both of these restrictions ensure that the original functionality of the application is preserved. Note that this approach only makes the crafting adversarial examples harder: instead of using features that have a high impact on misclassification, we skip those that are not manifest features.

4 Experimental Evaluation

We evaluate the training of the neural network based malware detector and adversarial example-induced misclassification of inputs on it. Through our evaluation, we want to validate the following two hypotheses.

First, that the neural network based malware classifier achieves performance comparable to state-of-the-art malware classifiers (on static features) presented in the literature.

Second, the adversarial example crafting algorithm discussed in Sect. 3.3 allows us to successfully mislead the neural network we trained. As a measure of success, we consider the misclassification rate achieved by this algorithm. The misclassification rate is defined as the percentage of malware samples that are classified as benign after being altered, but are correctly classified before.

We base our evaluations on the DREBIN data set, originally introduced by Arp et al. [2]: DREBIN contains 129.013 android applications, of which 123,453 are benign and 5,560 are malicious. There are 8 feature classes, containing 545,333 static features, each of which is represented by a binary value that indicates whether the feature is present in an application or not. This directly translates to the binary indicator vector \(\mathbf {X} \in \{0,1\}^M\) to represent applications, with \(M=545,333\). A more detailed breakdown of the DREBIN data set can be found in Appendix B.

4.1 DNN Model

We train numerous neural network architecture variants, according to the training procedure described in Sect. 3. Since the DREBIN data set has a fairly unbalanced ratio between malware and benign applications, we experiment with different ratios of malware in each training batch to compare the achieved performance values. The number of training iterations is then set in such a way that all malware samples are at least used once. We evaluate the classification performance of each of these networks using accuracy, false negative and false positive rates as performance measures. We decided to pick an architecture consisting of two hidden layers each consisting of 200 neurons and provide more details about the performance of other architecture is a longer version of this paper. In Table 1 the accuracy as well as positive and negative false negative rates are displayed.

In comparison, Arp et al. [2] achieve a \(6.1\%\) false negative rate at a \(1\%\) false positive rate. Sayfullina et al. [32] even achieve a \(0.1\%\) false negative rate, however at the cost of \(17.9\%\) false positives. Saxe and Berlin [31] report 95.2% accuracy given 0.1 false positive rate, where the false negative rate is not reported. Zhu et al. [39], finally, applied feature selection and decision trees and achieved 1% false positives and 7.5% false negatives. As we can see, our networks are close to this trade-offs and can thus be considered comparable to state-of-the-art.

4.2 Adversarial Malware Crafting

Next, we apply the adversarial example crafting algorithm described in Sect. 3 and observe how often the adversarial inputs are able to successfully mislead our neural network based classifiers. As mentioned previously, we quantify the performance of our algorithm through the achieved misclassification rate, which measures the amount of previously correctly classified malware that is misclassified after the adversarial example crafting. In addition, we also measure the average number of modifications required to achieve misclassification to assess which architecture provided a harder time being mislead. As discussed above, we allow at most 20 modification to any of the malware applications.

The performance results are listed in Table 1. As we can see, we achieve misclassification rates from roughly 63% up to 69%. We can observe that the malware ratio used in the training batches is correlated to the misclassification rate: a higher malware ratio generally results in a lower misclassification rate.

Table 1. Performance of the classifiers. Given are used malware ratio (MWR), accuracy, false negative rate (FNR) and false positive rate (FPR). The misclassification rates (MR) and required average distortion (Dist., in number of added features) with a threshold of 20 modifications are given as well. The last five approaches use the DREBIN data set.

While the set of frequently modified features across all malware samples differ slightly, we can observe trends for frequently modified features across all networks. For the networks of all malware ratios, the most frequently modified features are permissions, which are modified in roughly 30–45% of the cases. Intents and activities come in at second place, modified in 10–20% of the cases.

More specifically, for the network with ratio 0.3, the feature intent.category.DEFAULT was added to 86.4% of the malware samples. In the networks with the other malware ratios, the most modified feature was permission.MODIFY_AUDIO_SETTINGS (82.7% for malware ratio 0.4 and 87% for malware ratio 0.5).

Other features that are modified frequently are for example activity.SplashScreen, android.appwidget.provider or the GPS feature. And while for all networks the service_receiver feature was added to many malware samples, other are specific to the networks: for malware ratio 0.3 it is the BootReceiver, for 0.4 the AlarmReceiver and for 0.5 the Monitor.

Overall, of all features that we decided to modify (i.e. the features in the manifest), only 0.0004%, or 89, are used to mislead the classifier. Of this very small set of features, roughly a quarter occurs in more than 1, 000 adversarially crafted examples. A more detailed breakdown can be found in Table 2.

Table 2. Feature classes from the manifest and how they were used to provoke misclassification. Values in brakets denote number of features used in \({>}1,000\) Apps.

Since our algorithm is able to successfully mislead most networks for a large majority of malware samples, we validate the hypothesis that our adversarial example crafting algorithm for malware can be used to mislead neural network based malware detection systems.

5 Defenses

In this section, we investigate the applicability of two defense mechanisms previously introduced—defensive distillation (Papernot et al. [27]) and adversarial training (Szegedy et al. [36])—in the setting of malware classification. We also investigated feature selection as a defense, but leave the description of the approach a longer version of this paper, since it did not yield conclusive results.

To measure the effectiveness of defensive mechanisms against adversarial examples, we monitor the misclassification rates. The misclassification rate is defined as the percentage of malware samples that are misclassified after the application of the adversarial example crafting algorithm, but were correctly classified before. We simply compare these rates of the original network and the network where the mechanism was applied.

5.1 Distillation

We will investigate now a defense introduced in the context of a computer vision application, distillation, and investigate its applicability in binary, discrete cases such as malware detection. We first introduce the concept of distillation as used by Papernot et al. [27]. Afterwards, we present our evaluation.

While distillation was originally proposed by Hinton et al. [12] as a way to transfer knowledge from large neural networks to a smaller ones, Papernot et al. [27] recently proposed using it as a defensive mechanism against adversarial example crafting. They motivate this through its capability to improve the second network’s generalization performance (i.e. classification performance on test samples) and the smoothing effect on the decision boundary.

The idea is, in a nutshell, to use an already existing classifier \(\mathbf {F} (\mathbf {X})\) that produces probability distribution over the classes \(\mathbf {\mathcal {Y}}\). This output is used, as labels, to train a second model \(F'\). Since the new label contain more information about the data \(\mathbf {X} \) than the simple class labels, the network will perform similar or better than the original network F. In the original idea, the second trained network is smaller than the first one, whereas in Papernot et al.’s approach, both networks are of the same size.

An important detail in the distillation process is the slight modification of the final softmax layer (cf. Eq. 3) in the original network \(\mathbf {F} \): instead of the regular softmax normalization, we use

$$\begin{aligned} \mathbf {F} _i(X) = \left( \frac{e^{z_i(x)/T}}{\sum ^{|\mathbf {\mathcal {Y}}|}_{l=1}e^{z_l(x)/T}}\right) , \end{aligned}$$
(4)

where T is a distillation parameter called temperature. For \(T=1\), we obtain the regular softmax normalization commonly used in training. If T is large, the output probabilities approach a more uniform distribution, whereas for small T, the output of \(\mathbf {F} \) will become more extreme. To achieve a good distillation result, we use the output of the original network \(\mathbf {F} \) produced at a high temperature T and use this output to train the new network \(\mathbf {F} '\).

The overall procedure for hardening our classifier against adversarial examples can thus be summarized in the following three steps.

  1. 1.

    Given the original classifier \(\mathbf {F} \) and the samples \(\mathbf {\mathcal {X}}\), construct a new training data set \(D=\{(\mathbf {X},\mathbf {F} (X)) ~\vert ~ \mathbf {X} \in \mathbf {\mathcal {X}}\}\) that is labeled with \(\mathbf {F} \)’s output at high temperature.

  2. 2.

    Construct a new neural network \(\mathbf {F} '\) with the same architecture as \(\mathbf {F} \).

  3. 3.

    Train \(\mathbf {F} '\) on D.

Note that both step two and step three are performed under the same high temperature T to achieve a good distillation performance.

Fig. 2.
figure 2

False negative rates, misclassification rates and average required distortions after applying distillation, original networks are the baseline. For FNR and misclassification rate, higher is better. Average distortion should be negative.

Fig. 3.
figure 3

Misclassification rates, false negative rates and average required distortion achieved on adversarially trained networks. Regular network’s performance is given as baseline, indicated by horizontal lines.

Evaluation. We now apply the above procedure on our originally trained classifiers and examine the impact of distillation as a defensive mechanism against adversarial examples in the domain of malware detection. Figure 2 shows the effects of distillation on misclassification compared to the original models. We use a rather low temperature of 10, since we observe a strong decrease of accuracy when distilling on higher temperatures. In general we observe a strong increase of the false negative rate, and a slight increase in the false positive rate. For ratio 0.5, it raises from 4 to 6.4, whereas it is equivalent for 0.3. The accuracy varies in between 93–95%.

We further observe that the misclassification rate drops significantly, in some cases to 38.5% for ratio 0.4. The difference in the average number of perturbed features, however, is rather small. The number of perturbed features is 14 for ratio 0.3 to 16 for the other two.

Using distillation, we can strengthen the neural network against adversarial examples. However, the misclassification rates are still around \(40\%\). Additionally, we pay this robustness with a less good classifier. The effect is further not as strong as on computer vision data. Papernot et al. [27] reported rates around \(5\%\) after distillation for images. We further observed that higher temperature (\({>}25\)), as used in computer vision settings, strongly harms accuracy.

5.2 Adversarial Training

We now apply adversarial training and investigate its influence on the robustness on the resulting classifier. As before, we first introduce the technique of adversarial training and then report the results we observed.

Adversarial training means to additionally train our classifier with adversarially crafted samples. This method was originally proposed by Szegedy et al. [36] and involves the following steps:

  1. 1.

    Train the classifier \(\mathbf {F} \) on original data set \(D = B \cup M\), where B is the set of benign, and M the set of malicious applications

  2. 2.

    Craft adversarial examples A for \(\mathbf {F} \) using the forward gradient method described in Sect. 3.3

  3. 3.

    Iterate additional training epochs on \(\mathbf {F} \) with the adversarial examples from the last step as additional, malicious samples.

By applying adversarial training, we aim to improve the model’s generalization, i.e. predictions for samples outside of our training set. Good generalization generally makes a classifier less sensitive to small perturbations, and therefore also more resilient to adversarial examples.

Evaluation. We now present the results when applzing adversarial training to our networks. Using \(n_1 = 20\), \(n_2=100\) and \(n_3=250\) additional adversarial examples, we continued their training. We combined the adversarial examples to create training batches by mixing them with benign samples at each network’s malware ratio. We then trained the network for one more epoch on one training batch and re-evaluated their susceptibility against adversarial examples.

Figure 3 illustrates the performance (false negative rate) of the adversarially trained networks and the misclassification rate Algorithm 1 achieved on them (in misclassification rate and average required distortion). We grouped networks by their malware ratio during training.

For the network trained with malware ratio 0.3 and 0.4, we observe a reduction of the misclassification rate, and an increase of the required average distortion for \(n_1\) and \(n_2\) additional training samples. For instance, we achieve a misclassification rate of \(67\%\) for the network trained with 100 additional samples at 0.3 malware ratio, from \(73\%\) for the original network. A further increase of the adversarial training samples used for adversarial training, however, causes the misclassification rate to increase again to \(79\%\) for both malware ratios.

For the networks trained with malware ratio 0.5, the misclassification rate only decreases if we use 250 adversarial training samples. Here, we reach \(68\%\) misclassification rate, down from \(69\%\) for the original network. For fewer amount of adversarial examples for adversarial training, the misclassification rate remains very similar to the original case. It seems that the network trained with 0.5 malware ratio is fitting very close to the malware samples it was trained on, and therefore requires more adversarial examples to generalize and improve its robustness against adversarial example crafting.

Overall, we can conclude that simple adversarial training does improve the neural network’s robustness against adversarial examples. The number of adversarial examples required to improve the robustness depend heavily on the training parameters we chose for training the original networks. However, choosing too many may also further degrade the network’s robustness against adversarial examples. This is likely explained by the fact that when too many adversarial examples are used for training, the neural network then overfits to the particular perturbation style used to craft these adversarial examples.

5.3 Summary and Discussion of Evaluation

We evaluated two potential defensive mechanisms, adversarial retraining and distillation.

Adversarial training achieved consistent reduction of misclassification rates across different models. The amount of adversarial training samples has a significant impact on this reduction. Iteratively applying adversarial training to a network may further improve the network’s robustness. Unfortunately, this defense is only effective against the perturbation styles that are fed to the model during training.

Distillation does have a positive effect, but does not perform as well as in the computer vision setting. It remains unclear whether this is due to the binary nature or the unbalanced classes of the data. This is left as future work.

Finally, we note that these defenses are non-adaptive: an adversary may exploit knowledge of the defense deployed to evade it.

6 Related Work

The following discussion of related work complements the references included in Sect. 2. The security of machine learning is an active research area [26]. Barreno et al. [3] give a broad overview of attacks against machine learning systems. Previous work showed that adversarial examples can be constructed for different algorithms and also generalize between machine learning techniques in many cases [4, 9, 21, 24, 30, 36].

Many more defenses have been developed than used here. We will thus focus on introducing the main ideas relevant to neural networks and malware. There are many more variants of adversarial training [11, 15, 23], all slightly differing in their objectives from the original version introduced by Goodfellow et al. [9].

Other approaches include blocking the gradient flow [37], changing the activation function [17], or directly classifying adversarial examples as out of distribution [7, 10, 13, 22]. Finally, also the application of statistics has been investigated [19, 28]. An exhaustive study of defenses in a single article is, due to the variety and number of approaches, not feasible. We thus focused on the two most promising approaches.

Related to adversarial examples for malware, Hu and Tan [14] propose another approach to generate examples which is however based on generative adversarial networks.

Further Biggio et al. [4] propose a method that is based on gradient descent. They evaluate their adversarial examples similar to Laskov [18], who show the viability of adversarially crafted inputs against a PDF malware detection system based on random forests. Their adversarial example crafting algorithm, however, focuses on features in the semantic gap between the specific classifier they study and PDF renderers, i.e. this gap includes features that are only considered by the classifier, but not by the renderer. While this allows them to generate unobservable adversarial perturbations, their approach does not generalize to arbitrary classifiers.

In contrast, our approach considers all editable features and identifies those that need to be perturbed in order to achieve misclassification. Our technique is applicable to any differentiable machine learning classifier. While this still requires the identification of suitable application perturbations that correspond to feature perturbations, as we discussed in Sect. 3, this is mostly an orthogonal problem that needs to be solved independently.

7 Conclusion and Future Work

In this paper, we investigated the viability of adversarial example crafting against neural networks in a domain different from computer vision and relevant to core security problematics. On the DREBIN data set, we achieved misclassification rates of up to \(69\%\) against models that achieve classification performance comparable to state-of-the-art models from the literature. Further, our adversarial examples have no impact on the malware’s functionality. Threat vectors like adversarial examples need to be taken into account by defenders.

As a second contribution, we examined two potential defensive mechanisms for hardening our neural networks against adversarial examples. Our evaluations of these mechanisms showed the following: first, distillation does improve misclassification rates, but does not decrease them as strongly as observed in computer vision settings. Secondly, adversarial training achieves consistent reduction of misclassification rates across architectures.