1 Introduction
Machine learning (ML) has been widely proposed as a promising technique to address the proliferation of Android malware through rapid, systematic, and large-scale identification of malicious samples and their variants. Several approaches [
7,
8,
16,
28,
30,
33,
52] have then been presented in the literature, where authors explore static, dynamic or hybrid analysis methods to extract features, as well as a variety of classification algorithms. Although the malicious characteristic of a sample is derived from the app behaviour, many approaches rely only on static analysis to produce features that are then assumed to be a proxy representation of the app’s entire code (and hence its behaviour). Such features unfortunately capture only a small predefined portion of information about the app behaviour: features select only a finite set of code attributes, may leave out native code, and so forth. Therefore, extracted features for ML-based malware detection offer an imprecise representation of an incomplete view of the app that is further used as a proxy of application behaviour. Consequently, there is an assumption that this proxy representation still contains enough of the relevant information for discriminating malware. Feature engineering is thus the essential step in malware detection that implements the intelligence of the malware detection approach, beyond the off-the-shelf classification algorithms that are employed.
In the literature of malware detection, state-of-the-art approaches mainly differ by the feature sets that are proposed. For example, DREBIN [
7] is a key reference in the literature that builds on a standard SVM algorithm but was novel in proposing a large and domain-specific feature set that was manually engineered for the problem of Android malware detection. Interestingly, the contributions that stand out in most of the ML-based malware detection approaches are associated to the feature set. These key contributions, however, are rarely evaluated through in-depth ablation studies, where the feature set value is assessed thoroughly. Instead, the literature provides assessment results by reporting detection performance measurements of the overall approach. The state of the practice thus considers that good performance indicators offer sufficient validation on the value of the feature set.
Evaluation of ML-based malware detection approaches have been scrutinised in recent studies. Allix et al. [
3] have demonstrated that the performance of a malware detector may drop drastically when it is evaluated in the wild. Spatial and temporal biases have also been pointed out in two independent research works [
4,
35]. Although these works attempt initial analyses of
whether models can fail to generalise, they do not provide in-depth investigations into
how or
why. We propose to fill this gap in our work by conducting a tentative analysis of the feature set that is fed to the learners.
Analysing a classifier is an open problem [
27]. Our work engages in building a roadmap in this direction for ML-based malware detection. We propose a dissection study that overviews the performance of a classifier from different points of views. Our work builds on a reproduction study [
11] that has considered Android malware detectors from 16 major venues on Software engineering, security, and ML. Specifically, we focus on the state-of-the-art malware detector DREBIN, which is a the most cited approach that has been successfully replicated. Typically, our investigation attempts to go beyond the quantitative measurements of precision and recall detection metrics, to highlight other qualitative dimensions of the approach. In particular, we seek answers to the following questions:
To support that the classification decisions are dependable, beyond the high performance scores, DREBIN outputs some explanations which show that the model has reasonably captured information that is relevant to the accuracy of the decision. We propose to dissect these explanations further through various analyses:
Our study explores DREBIN along with its initial dataset but also considered three other datasets that were carefully collected to assess the generalisability of the findings. Our insights with this study will serve the community for better understanding the strengths and limitation of the DREBIN contribution, and thus offer a systematisation of knowledge around DREBIN. We expect this study to better drive the exploitation of DREBIN as a key contribution in the literature towards opening novel research directions and producing reliable and effective models for malware detection.
Among other findings, the study yields that:
❶
The feature set of DREBIN is sufficiently generic to capture enough concepts that are relevant to a diverse set of malware samples across time.
❷
The feature set of DREBIN contains a huge number of id-features (e.g., features such as component names). However, the relevance and importance of each feature remains challenging to quantify w.r.t. the overall performance of DREBIN.
❸
Most features in DREBIN are at best redundant and at worst useless. Indeed, a subset of DREBIN features, smaller than the whole set by three orders of magnitude, is enough to provide similar performance with the whole feature set.
❹
DREBIN includes some features which, singlely, can offer a surprisingly high detection rates on some datasets.
❺
DREBIN explanations do not reflect how much the features contribute to the prediction.
❻
DREBIN explanations are often inconsistent across samples from a specific malware family.
1.1 Background on DREBIN
DREBIN is an Android malware detector that makes use of static analysis and ML techniques to decide if a given Android application is likely to be malware or goodware. This approach has been developed and validated using 5,560 malware samples and 123,453 goodware apps whose compilation dates are all within the period from August 2010 to October 2012.
Leveraging the Manifest file
1 which is included in each Android app package, DREBIN extracts four sets of string features from this file using the Android Asset Packaging Tool: Hardware components, Requested permissions, App components, and Filtered intents. DREBIN further considers the information contained in the disassembled code of the apps to extract four additional sets of string features: Restricted API calls, Used permissions, Suspicious API calls, and Network addresses.
To feed the extracted sets of features to the classifier, a multi-dimensional vector space has to be created, using the combination of all the features, from the eight categories, that are extracted from the training apps. DREBIN feeds the multi-dimensional vectors of the training apps to a Linear SVM classifier to make it learn the relationship between the inputs (features vectors) and the outputs (samples are either malware or goodware). The trained classifier is then used to predict the class of new and unseen Android apps (i.e., the test set). In DREBIN’s paper, the authors state that this classifier detects malware apps with a recall of 0.94. The ML algorithm used by DREBIN, SVM, has already been used by many prior works on Android malware detection. Therefore, we assume that the defining contribution that gave DREBIN its performance are attributable to DREBIN’s selection of feature rather than to its choice of algorithm.
Besides its effectiveness in detecting malware apps, DREBIN has made a breakthrough in the field by providing explanations of its decisions. DREBIN relies on the weights of the Linear SVM classifier to determine the features that contributed the most to the prediction. The notion of explainability is extended to malware families, where DREBIN explains a family based on the explanations given to the malware samples of that family.
2 Dataset and Statistical Analysis of DREBIN Features
2.1 Dataset
To collect our dataset, we have mainly relied on AndroZoo, which is a large collection of more than 16 million Android apps and regularly growing [
5]. Our dataset is built by considering malware and goodware samples that span several years (2017, 2018, and 2019), allowing to ensure that the insights that we draw are related to the properties of DREBIN, not the properties of a given dataset.
AndroZoo makes available, for each app it stores, the number
\({vt\_detection}\) that represents the number of VirusTotal
2-hosted antivirus that have detected this app as malware. Our goodware samples are defined as the apps that have
\({vt\_detection = 0}\), to ensure that they have not been flagged by any antivirus engine. Following on past studies, we have set the
\({vt\_detection}\) value to 6, which means that our malware apps have been detected by at least six antivirus engines from VirusTotal.
At the time of the experiments, we were able to collect 15,892 malware apps for the 2019 dataset. The same number of malware apps (i.e., 15,892) was collected for the 2018 and 2017 datasets to have comparable settings. As for the goodware, we have considered all the apps that meet our criteria (i.e., the apps that have \({vt\_detection = 0}\)). Consequently, we have collected 175,327, 297,272, and 276,750 apps for 2019, 2018, and 2017 datasets, respectively.
We name 2019_data, 2018_data, and 2017_data the collection of malware and goodware samples that have been collected for 2019, 2018, and 2017 datasets, respectively.
In addition to the three recent datasets we collect (
2019_data,
2018_data, and
2017_data), we also conduct our analysis on DREBIN’s original dataset to verify that our results are generic and stable across the time. To this end, we have leveraged DREBIN’s malware dataset, which consists of 5,560 malware apps provided by the original authors upon request. As for the goodware dataset, the raw apps are not provided directly by the authors. We have searched them in AndroZoo using the list of the APKs’ SHA256 hashes provided by the authors. Consequently, we were able to collect 57,307 apps (i.e., only 46.42% of the total number of goodware samples used in DREBIN’s paper). To have a dataset that is similar to the one used in the original paper, we have complemented the goodware samples with 66,146 apps from the same period (i.e., August 2010 to October 2012), and having
\({vt\_detection = 0}\). This dataset is denoted as
DREBIN_Like_data. We note that 81 out of 5,560 malware applications have failed in the features extraction process. Table
1 summarises the number of applications in our collected datasets.
2.2 DREBIN Replication
To conduct our experiments, we replicated the DREBIN approach [
11] and checked its performance using our datasets. We aim to assess DREBIN’s performance as described in the original publication to compare our analysis with the results of this implementation.
We split each of our four datasets into subsets (80% for training and 20% for testing). We then fit Linear SVM classifiers on the training datasets and calculate the performance scores on the test subsets. We note that all our experiments are performed using the well-known ML framework Scikit-learn.
3 We report in Table
2 the number of features that are used to perform the classification for our four dataset, as well as the three performance measures: recall, precision, and F1-score.
As we can see from the table, the three performance measures for the four datasets are very high, especially for 2019_data, which reports an F1-score of 0.979 on a test dataset of 58,529 Android apps. These scores confirm that DREBIN performs very well, not only with the DREBIN_Like_data but also when evaluated with our own collected samples. DREBIN was proposed in 2014 with a dataset of malware and goodware samples that belong to the period from August 2010 to October 2012, and here it still performs very well with our recent collected datasets.
Evaluating the performance of malware classifiers—like all security systems—is not a straightforward task [
44]. Some experiments may provide brilliant results, but still they do not reflect how good the classifier is when it is deployed in realistic settings. The results presented in Table
2 show the performance of DREBIN when the training and test datasets belong to the same year. However, when DREBIN is evaluated with training apps that are temporally precedent to testing apps, TESSERACT [
35] has reported that the performance of this classifier drops remarkably. The realistic setting of using Android malware detectors showed the limitation of DREBIN, which questions the way it works and to what extent it captures the behaviour of Android apps. Inspired by the results of TESSERACT [
35], we aim to perform an in-depth study on DREBIN to have a better understanding of its learning.
2.3 Statistical Analysis of the Features
Based on the results presented in Table
2, we can see that the four classifiers use a large number of features to perform the classification. If we take the example of
2017_data, DREBIN uses 1,486,191 features, and hence for each sample it creates a 1,486,191-dimensional vector as part of its learning/predicting process. This huge number of features is explained by the fact that DREBIN extracts as much features as it finds in the apps, and it does not make any cleaning nor evaluation of the importance nor the relevance of these features. In practice, not every feature in a dataset carries information useful for discriminating samples; some features might be redundant or irrelevant, and they may add randomness to the results. Given that DREBIN trains an SVM classifier with more than 1 million features, it is unclear what exact features are responsible for making SVM decide how to classify a given Android app, nor how many features (out of 1 million) are actually needed to enable it to make this decision.
Another issue that needs to be discussed is related to the
quality of some features used by DREBIN. As we have presented earlier, DREBIN uses string features that belong to
App components and
Network addresses categories. For instance, the names of Android components
4 are extracted from the apps. Since these components names are defined and attributed by the developers of the app, using them as features increases the possibility of DREBIN to overfit the dataset and decreases its capability to generalise. Malware developers can easily change the names of app components they use in their applications (such as with simple obfuscation techniques). Consequently, if DREBIN relies on these features to classify the malware, attackers can easily bypass its detection. The same applies for the
Network addresses category. Note that in their paper, DREBIN’s authors stated that collecting the components’ names may help identify the well-known components of malware. Although this may reveal relevancy within a specific time frame, we are concerned that changes in component names within variants will affect the classifier performance.
Analysing the features used in the previous section reveals that in our four datasets, at least 88.9% of the features belong to the App components and Network addresses categories. These are string features that are chosen by the developer (e.g., a class name) and identify elements that do not necessarily hold any semantic meaning beyond the scope of the sample. We refer to them as “id-features”. Id-features are extracted from components (Activity, Service, Content Provider, Broadcast Receiver) names and Network addresses. Since these features are extracted from the apps of a specific dataset, they are ad hoc to that dataset. Indeed, an id-feature that is present in a dataset may not appear in another one, especially when malware evolves. In addition, id-features can be changed without affecting the behaviour of the apps. Consequently, a DREBIN classifier that considers id-features of a given dataset is likely to perform poorly outside this dataset.
We present in Table
3 the exact number of id-features in
2019_data,
2018_data,
2017_data, and
DREBIN_Like_data.
3 Discriminatory Power of Drebin’s Features
As we have established in the previous section, DREBIN uses a huge number of features to perform its task of prediction. Consequently, it is difficult to get insights about what DREBIN learns and on what basis it performs the prediction. Specifically, we cannot tell if DREBIN needs all these features for its task of detection nor what its main decisive features are. For a given Android app, we are curious to know to what extent DREBIN has captured its malware/benign behaviour using its set of features. This problem is our main motivation for this section, where we aim to reduce the number of features of our four classifiers while preserving a comparable performance.
3.1 What Are the Key Features That Enable the Prediction?
In this section, we conduct several experiments to analyse what the DREBIN classifier learns. Our aim is to create variations of DREBIN classifiers that use a small subset of the initial features and still perform well on the dataset. Analysing these features will make it possible to understand what this classifier captures, as well as the key features that direct its predictions.
3.1.1 Method.
We have developed our own custom feature selection approach to identify the smallest set of features that still yields reasonably high classification scores. Our feature selection technique unfolds the following four steps:
(1)
We split each of our four datasets into subsets (80% for training and 20% for testing), and we train a DREBIN classifier using all the features (i.e., exactly what we have done in the previous section).
(2)
With the feature ranking given by SVM, we train classifiers using the features that contribute the most in the prediction, starting with the most important feature and adding gradually the other features. Each time we add a feature, we compute the F1-score. This procedure is repeated until a subset of features yields a reasonably high F1-score. In our experimental setup, we consider an F1-score \({ \ge }\) 0.8 to be reasonably high.
(3)
We repeat steps (1) and (2) \({K}\)-times, where we vary, each time, the training and testing sets. This step is necessary given that the feature set is dependent on the training set. At the end of this step, we have \({K}\) subsets of features, with each subset offering reasonably high performance.
(4)
The union of the \({K}\) sets of top features is further reduced by iteratively discarding features that can be removed without any loss in the performance. Therefore, at the end of this step, we have the minimal set of features that can achieve the same performance as the reference classifier trained with the whole union of the \({K}\) sets.
We note that with our wrapper customised method, we do not aim to improve the prediction performance nor to optimise DREBIN’s approach. Our aim is to create a classifier that reports good performance and uses a small number of features.
3.1.2 Results.
We have applied our features reduction method from the previous section on the four datasets using
\({K = 10}\). For each dataset, we have evaluated the performance of DREBIN using our reduced sets of features that we denote as
Red1. Our evaluation is conducted using the 5-fold cross-validation technique. We present in Table
4 the number of features in
Red1, for the four datasets, and the results of predictions: recall, precision, and F1-score.
As can be seen in Table
4, we were able to build DREBIN classifiers that, unlike DREBIN’s replication, use only a small number of features with
2019_data,
2018_data,
2017_data, and
DREBIN_Like_data datasets. These classifiers have been created in such a way that even if they use a small number of features, they still perform well and report high scores.
For 2019_data, we have been able to use 19 features to get a good F1-score (0.908), whereas with DREBIN replication, SVM uses 1,230,854 features to achieve an F1-score of 0.979.
For 2018_data, 156 features that represent again a very small subset of the features used in DREBIN replication (0.01%) are needed to recover 88.22% of the F1-score. Similar results have been achieved with 2017_data, where 158 features (0.01% of the initial features) have enabled the reporting of good scores (e.g., F1-score = 0.846), which recovers 91.16% of the initial F1-score. As for the DREBIN_Like_data, we were able to recover 87.08% of the F1-score using 0.048% of the initial features.
We now assess to what extent the reduced sets of features are indeed relevant for the DREBIN classifier by investigating their information gain. The information gain of a feature represents the amount of information gained about the predicted class when observing this feature itself. The information gain can be used to verify if the reduced features sets indeed capture an important amount of information.
For each dataset, we have calculated the information gain of all the features, and we have ranked the features by their information gain. If a feature has a good ranking, it means that it captures more information (than the other features). Consequently, the highly ranked features are important to the prediction. We present in Figure
1 the distribution of rankings (based on info gain values) of features included in the reduced features sets, and we represent their median values in blue colour. The distribution of rankings highlights that the features in the reduced sets are among those having the highest information gain.
3.1.3 Analysis.
As shown previously, we were able to collect sets of features that not only are the most relevant features for the DREBIN classifier but also enable to report good prediction performance when they are used “alone” to train the classifier. Examining these features can lead to further insights about DREBIN’s discriminative features.
Specifically, studying the distribution of the features in the dataset and their malicious/benign character is a potential lead to understand how the classifier works. In addition, analysing the features helps to validate DREBIN’s capabilities, especially to confirm whether or not it learns general characteristics of goodware and malware apps.
We list in the following the 19 features present in the reduced features sets Red1 of the 2019_data dataset:
(1)
activitylist_com.silverbat.knightage.temmidlet
(2)
activitylist_com.stub.stub08.appupdateactivity
(3)
activitylist_com.twofloorhousedesign.lukoni.adsactivity
(4)
activitylist_com.e4a.runtime.android.mainactivity
(5)
restrictedapilist_android.telephony.telephonymanager.getline1number
(6)
restrictedapilist_android.support.v4.view.
accessibility.accessibilitynodeprovidercompat.performaction
(7)
usedpermissionslist_android.permission.wake_lock
(8)
suspiciousapilist_ljava/lang/runtime;->exec
(9)
usedpermissionslist_android.permission.read_phone_state
(10)
restrictedapilist_android.provider.settings$system.putstring
(11)
usedpermissionslist_android.permission.get_accounts
(12)
suspiciousapilist_landroid/telephony/telephonymanager.getsubscriberid
(13)
restrictedapilist_android.media.mediaplayer.stop
(14)
restrictedapilist_android.os.vibrator.cancel
(15)
restrictedapilist_android.net.connectivitymanager.isactivenetworkmetered
(16)
requestedpermissionlist_android.permission.read_phone_state
(17)
restrictedapilist_android.accounts.accountmanager.getaccounts
(18)
suspiciousapilist_landroid/view/keyevent.getdeviceid
(19)
restrictedapilist_android.media.audiorecord.<init>
For 2019_data, we can see that the reduced features sets contain four id-features (from the App components category), which are features numbers 1, 2, 3, and 4. The other features are not id-features, and they can give insights about some activities the app performs.
As for the reduced features sets of 2018_data, 2017_data, and DREBIN_Like_data, they also contain id-features that belong to both App components and Network addresses categories. Specifically, the id-features represent 61%, 60%, and 77% of Red1 for 2018_data, 2017_data, and DREBIN_Like_data, respectively. We remind that Red1 is the selection of the most important features of DREBIN.
The good results of DREBIN may be explained by code reuse among malware, coupled to the significant presence of id-features that excel at capturing elements shared across apps. Specifically, when the code of a malware app is reused, the attacker may not change the app components features set (i.e., activity names for example), which results in DREBIN predicting an application as malware because it contains the same component name of the app from which the code is reused (but also because of the presence of other relevant features when they are present in the app).
To illustrate the impact of id-features on the prediction, we have examined a malware APK
5 that contains one id-feature from
Red1 (i.e.,
activitylist_com.silverbat.knightage.temmidlet). This application is predicted by DREBIN as malware. However, when we make a slight modification in the app by changing the name of this activity to another arbitrary name, the app escapes the detection of DREBIN. Note that changing the name of this id-feature does not impact the actual functioning of the app. It is a simple name that is chosen by the developer (or the attacker). Consequently, malware apps can easily evade the detection by changing the names of the id-features that are considered relevant by DREBIN, which raises a concern of generalisability.
Our method allows to identify the most important features that are related to the dataset at hand. The number and the composition of the features that we have collected differ from one dataset to another. This situation makes the reduction of feature sets dependent to the datasets, as we have noted with the four datasets (cf. Table
4).
3.2 Are All the Features Needed to Achieve Similar Results?
The performance achieved using our proposed reduced set of features is inferior to the full DREBIN. We are curious to know how many features are effectively needed to achieve similar performance.
3.2.1 Method.
For this experiment, we select the features based on the information gain as follows:
(1)
We split each of our four datasets into subsets (80% for training and 20% for testing), and we calculate the F1-score using all the features (the ones constructed based on the training set).
(2)
We calculate the information gain of all the features and rank them in descending order.
(3)
We train SVM classifiers with these features starting with the one feature that has the highest rank and adding gradually the other features. Each time we add a feature, we calculate the F1-score. This procedure is repeated until the difference between this F1-score and the one obtained in step (1) is smaller than 0.01.
(4)
The features from the previous step are further reduced by iteratively discarding features that can be removed without any loss in the performance. At the end of this step, we have the minimal set of features that can achieve the same performance as our replicated DREBIN. The resulted features are denoted as Red2.
3.2.2 Results.
We apply the previous method to our four datasets, and we report the results in Table
5. We note that in this setup, it is not possible to use the cross-validation technique since we aim to compare the features and the scores of two dependent classifiers: one uses all the features (that are specific and depend on the training data), and the other classifier is trained with features that are an optimisation of the first classifier’s features.
Specifically, our results in Table
5 are compared to the results in Table
2.
Overall, the results demonstrate that only a small set of features is needed to report scores that are as high as the scores achieved with all the features. We were able to reduce the initial features set by at least 99.9% for 2019_data, 2017_data, and DREBIN_Like_data, respectively. As for 2018_data, we observe that the number of features in Red2 is higher. This can be explained by the diversity and the composition of this dataset (i.e., it contains the highest number of goodware apps).
We remind the reader that our aim through this investigation is not to replace DREBIN’s features with those that we collect. Instead, we aim to identify DREBIN’s most relevant features—that is, those that are necessary to its predictions. Such features can be further analysed to get insights about DREBIN’s capabilities.
3.2.3 The Impact of DREBIN’s Features on the Learning Process.
To get insights about how the performance of DREBIN evolves with the number of features, we provide in Figure
2 the plots that show the impact of adding gradually the features ranked with the information gain on the prediction. Each dataset is split into two subsets (80% for training and 20% for testing). We calculate the recall, the precision, and the F1-score starting with the most important feature and adding gradually the other features based on their information gain rank. For each dataset, the first graph represents the plots of recall, precision, and F1-score as a function of the number of ranked features used in the prediction. The plots in the right are a zoom of the first plots on recall, precision, and F1-score using the first 4,000 features.
As we can see from the first plots, the performance of prediction is generally stable after the first highly ranked features for the four datasets. The recall and the precision may increase at a certain time (after adding more features), but they evolve inversely. The F1-score metric that captures both the recall and the precision is almost stable when adding more features. The zoom plots confirm this statement since with the top 4,000 highly ranked features, the performance metrics are generally high and they are close to the recall_all, precision_all, and f1_all that we present in the plots with straight lines. These metrics are calculated using all the features from the training datasets.
These results suggest that DREBIN is notably impacted by a small number of dominant features that have the highest info gain. However, it is hardly affected by a huge number of features that have less information gain. We also observe that the detection performance does not monotonically increase with the addition of features. Indeed, for three of our datasets, there is actually a drop in performance after a number of features are included, and it can take up to 1 million additional features to come back to and marginally improve the performance obtained with the Red2 features.
3.2.4 ONE-Feature DREBIN.
In the previous section, we have seen that many features have a minor impact on DREBIN’s prediction. Based on Figure
2, we also observe that a single feature (i.e., the first feature that has the highest information gain) enables DREBIN to yield a high detection performance for
2019_data and
2018_data. As for
2017_data and
DREBIN_Like_data, the detection performance of DREBIN using the single feature is null. We present in Table
6 the prediction scores of DREBIN trained using the first feature that has the highest info gain and evaluated with 5-fold cross validation.
We observe that the overall performance of DREBIN using this ONE feature is surprisingly good for 2019_data and 2018_data. For both of these two datasets, the ONE feature is from DREBIN’s “Suspicious API Calls” features set, which contains API calls that have access to sensitive data or resources. This feature is landroid/content/context.getsystemservice. After inspecting this feature’s presence in 2019_data apps, we found that it is present in 17.3% of the malware and 97.7% of the benign apps. In 2018_data, this feature exists in 38.6% malware and 98% benign apps. In these two datasets, the ONE feature is present in a large majority of the benign apps, and it is less frequent in malware apps (i.e., it is present in less than 50% of the apps).
Concerning 2017_data and DREBIN_Like_data, their ONE feature is restrictedapilist _android.support.v4.view.accessibility.accessibilitynodeprovidercompat.perform- action and requestedpermissionlist_android.permission.send_sms, respectively. The ONE feature of 2017_data is present in 8.8% of malware and 69.6% of goodware. As for DREBIN_Like_data ONE feature, it exists in 53.4% of malware and 4.3% of benign apps. For these two datasets, the performance of DREBIN using the ONE feature is null.
We have also tested the ONE feature of 2019_data and 2018_data on both 2017_data and DREBIN_Like_data, but the detection performance is again null. The examination of this feature’s presence in 2017_data and DREBIN_Like_data reveals that it is present in 75.1% and 81.6% of their malware apps and in 96.9% and 72% of their benign apps, respectively. This observation suggests that when DREBIN is trained using a feature that is predominant in goodware apps (or malware apps), it tends to associate its presence with the benign class (or the malware class).
4 Analysis of DREBIN Classifier Explanations
In this section, we aim to perform an in-depth analysis of DREBIN’s explanations to assess how well this approach explains the prediction (Section
4.2), and how consistent the explanations given by DREBIN are to malware samples of the same family (Section
4.3). Before diving into the analysis, in Section
4.1 we first present an overview of the malware families present in our datasets.
4.1 An Overview of Malware Families
In this section, we present and study the distribution of malware families in our four datasets. For each malware sample, we collected from VirusTotal the detection reports provided by the hosted antivirus engines. Then, we leveraged AVCLASS [
40] to infer a unique malware family label for each sample. We present in Figure
3 the distribution of malware families in
2019_data,
2018_data,
2017_data, and
DREBIN_Like_data as provided by AVCLASS.
We notice that we have a dominant malware family in three datasets. The family ( jiagu ) represents 78.6%, 50.24%, and 28.29% of the malware in 2019_data, 2018_data, and 2017_data, respectively. The difference of the distribution of this malware family in the three datasets can explain the very good results we were able to achieve for 2019_data . Indeed, in this specific dataset, around 80% of malware samples are from the same family. We postulate that by classifying these samples correctly, DREBIN can exhibit a very good recall. With this dataset, we were also able to retrieve a small number of features either to get insights about DREBIN’s learning or to report similar scores as when all the features are used. The other datasets required more features to perform similar performance, potentially due to the diversity of apps in these datasets. For our next experiments, we will limit ourselves to DREBIN_Like_data malware since this dataset is the most diverse in terms of represented malware families.
4.2 How Informative Are the Explanations Given by DREBIN?
In this section, we aim to assess the quality of the explanations provided by the DREBIN classifier. Specifically, we examine whether the information contained in these explanations is complete and adequately explains the prediction. We use “positive features” to refer to the features that have a positive SVM weight. DREBIN suggests that the top k positive features can be used to explain the prediction of malware applications. We formulate the following hypothesis.
To verify our hypothesis, we examine two components:
(1)
The contribution of the top positive features represented by their SVM weights;
(2)
The number of the top positive features to use in the explanation.
4.2.1 The Contribution of DREBIN Top Positive Features.
One key contribution of DREBIN is its ability to explain the detection by outputting the features that have the highest SVM weights after applying its linear transformation.
An example of DREBIN’s top positive feature of a malware app that belongs to “droidkungfu” family in DREBIN_Like_data is (0.966, ’intentfilterlist_android.intent.action.sig_str’). In this example, 0.966represents the SVM weight associated with the feature ’intentfilterlist_android.intent.action.sig_str’.
DREBIN provides top k (with \(k=5\) in the experiments of the original paper) features that are relevant to the prediction, with their weight values. However, this information may actually not be sufficient to infer to what extent each feature was actually important (in terms of overall contribution for the detection, in comparison with all other features). Indeed, the contribution of the top five features may still be lower than the contribution of all other features combined. For instance, the value of 0.966 in this example does not reflect how powerful ’intentfilterlist_android.intent.action.sig_str’ is compared to the other positive features. The raw SVM weights suggested by DREBIN do not then quantify the importance of each individual feature to the prediction.
4.2.2 The Number of DREBIN’s Top Positive Features.
Another aspect we aim to investigate about DREBIN’s explainability is how many top positive features are needed to explain the prediction. DREBIN’s publication suggests that Android apps can be explained using the top k features, but it does not specify how to choose this number. Nonetheless, its evaluation experiments have been conducted using \(k=5\) top positive features for each family.
To verify if a specific number of the top positive features can be fixed to explain the prediction of the apps, we examine the distribution of the positive features in the apps of the top 10 malware families of DREBIN_Like_data .
Specifically, if the number of the positive features varies remarkably between the apps in general and between samples of the same family in particular, it would be difficult to decide about the threshold to fix for this number.
We provide in Figure
4 the number of all the positive features in the malware of the top 10 families of
DREBIN_Like_data .
We notice that the number of all positive features differs remarkably from one app to another, and it is even different between apps that belong to the same family. This number ranges from 4 to more than 100 features in the malware of the DREBIN_Like_data test set. If this number is fixed at 5, we are confident that the top five features adequately explain the apps containing a small number of positive features. However, the top five features might fail to capture the primary features that manipulate the prediction of the apps that contain a large number of positive features, especially if the positive features have similar contributions. This problem is accentuated by the fact that the contribution of the features is not determined. For instance, if an application contains many positive features, each with a small contribution, we will not be able to decide about the threshold to fix for k to explain the prediction.
4.3 How Consistent Are the Explanations Given to Samples of the Same Malware Family?
After showing that the number of positive features differs between samples of the same malware family, we seek to investigate the consistency of the top positive features within the same family using DREBIN_Like_data . In our experiments, we use top five positive features to explain the prediction of the apps as it is suggested in DREBIN’s paper.
To verify our hypothesis, we visualise the distribution of the explanations in each of the top 10 families of
DREBIN_Like_data . We plot for each malware sample the top 5 positive features that explain its prediction. Figure
5 shows the distribution of these explanations. For each malware sample, we plot the top five positive features with different colours to distinguish their rank (in explaining the prediction of the app). The interpretation of the five colours is explained in the figure. Each row represents a positive feature that has been used to explain at least one app that belongs to the family, and each column represents a malware sample. Ideally, we should see five straight lines with the following ordering of colours from top to bottom: red, blue, green, pink, and brown. This ideal graph reflects that the same five features are used to explain malware apps that belong to the same family. A line of the same colour shows that a feature have the same importance in the prediction of all the malware that belong to the family. The visual lecture of Figure
5 suggests that some families use the same features to explain most of their apps (e.g., “droidkungfu” and “plankton” families). However, the overlapping colours in the sub-graphs show inconsistency of explanations.
We have also calculated the overlap between the top five positive features in the malware samples of the 10 families. We provide in Figure
6 the distribution of the results.
We notice that for some families (e.g., basebridge and batterydoctor), there is a significant overlap of top features across samples in the family: there is a consistency of explanations for the majority of samples in the family. In some other families (e.g., fakeinst and opfake), however, the top features are not consistent across family samples. These findings suggest that top features that are used for explanations are not systematically sufficient to characterise a malware family.
We further report in Table
7 the total number of distinct top five features used in the top 10 malware families of
DREBIN_Like_data . For each family, we provide the number of samples (found in the test set) and the total number of distinct features used to explain its malware samples (number of rows in Figure
5 visualisations). We notice that the number of distinct top five features used to explain the families’ samples clearly exceeds five features. This number is generally higher when there are more samples in the malware family. The numbers in Table
7 suggest that DREBIN refers to a variety of features to “explain” samples within the same family.
5 Assessment of Drebin’s Learning Potential
In this section, we aim to evaluate DREBIN’s ability to capture the concept of malware families, both within the family (i.e., if it is able to assemble malware samples that belong to the same family), and across families (i.e., if it separates malware that belong to different families).
5.1 Does DREBIN Capture the Concept of Malware Family (within the Family)?
To answer our question, we represent malware samples in DREBIN_Like_data by their embedded vectors, and we make the following hypothesis.
To verify our hypothesis, we rely on the Euclidean distance between feature vectors to examine if DREBIN’s features allow assembling of malware samples that belong to the same family in the n-dimensional space. Our aim is to compare the distribution of the distances and verify if malware samples that belong to the same family have smaller distances (i.e., they are close to each other in the n-dimensional space) compared to those from the 10 families combined. Specifically, for each family, we compare the Euclidean distance within the family and within the top 10 families of DREBIN_Like_data combined.
We plot in Figure
7 the distribution of the distances within the top 10 families combined, which we denote as “all”, and the distribution of the distance within each of the 10 families from
DREBIN_Like_data, which we denote as “family”.
As we can see in Figure
7, the average Euclidean distances within a family are always smaller than the distances calculated using malware from the 10 families combined.
We have further compared the distribution of the distances within a family and within the top 10 families combined using the Mann-Whitney-Wilcoxon test [
29,
46]. This test is used to verify whether two data distributions are identical.
For each of the top 10 families, we calculate the
p-value of the Mann-Whitney-Wilcoxon test using the distances within the family and the distances within the 10 families combined. In the 10 Mann-Whitney-Wilcoxon tests, the
p-value 6 is smaller than
\(9.42e^{-59}\). This result indicates that the null hypothesis is rejected and the difference between the data distributions is statistically significant. Therefore, we can conclude that the distributions of the distances within a family and within the 10 families combined are different.
5.2 Does DREBIN Capture the Concept of Malware Family (across Families)?
After evaluating DREBIN’s ability to assemble malware samples within the same family, we seek to investigate if, in its feature space, DREBIN separates samples from different families.
To verify our hypothesis, we compute, for each of the top 10 malware families in
DREBIN_Like_data, the Euclidean distance within the family (similarly to Section
5.1) and across families. We denote (
\(F_i\) –
\(F_j\)) as the distances across family
i and family
j for (
\(1 \le i \le 10\)) and (
\(1 \le j \le 10\)). Similarly to the previous section, we also calculate the distances within all top 10 families combined that we denote as “all”. We provide in Figure
8 the distribution of these distances in the top 10 families. We draw the median of the distances within the family with a straight blue line to facilitate the comparison. Overall, the average distances within a family are smaller than the distances across families.
For each of the 10 top families, we have also compared the distribution of the distances within a family and across families using the Mann-Whitney-Wilcoxon test. We have conducted the test as explained in Section
5.1. For all the families, the p-value is smaller than
\(1.29e^{-10}\). The null hypothesis is then rejected, which indicates that the difference between the distributions of the distances within families and across families is statistically significant.
6 Discussion
Today, in the literature, a malware detection approach is considered to “work” when its assessment exhibits high detection scores. However, these scores have been proven to be biased by datasets [
35] or evaluation methodology [
3,
4]. In reality, practitioners expect more in-depth analysis that provides confidence or guarantees that approaches work. An example first step in this direction was implemented by DREBIN, which proposes explanations for its predictions, towards facilitating manual validation by analysts.
Given the importance of DREBIN as a literature milestone, we proposed to dive deep into its inner working to perform an investigation around its features, its explanations, and so on. Unfortunately, in the absence of guidelines in the literature, our explorations were guided by our past experiences and intuitions. Overall, our study findings are somewhat inconclusive, which highlights that the problem of answering “why it works” is a fundamentally difficult and unresolved problem.
6.1 Answering “Why It Works”
A large body of malware detection literature reports high performance scores using ML. Our in-depth analysis in this work revealed that it is hard to comprehensively characterise the reasons behind the good performance of a state-of-the-art classifier: (1) a substantial number of used features appear to be redundant, and (2) the detection explanations are not always consistent for samples in the same family.
Our first tentative to understand “why a classifier works” calls for a more focused research initiative around a framework that would provide tools and techniques to thoroughly assess malware detectors beyond detection scores. Such a framework would define the axes of analysis that each new malware detector needs to investigate beyond ML detection scores. The framework would suggest requirements and analyses that are crucial to gain insights into Android malware approaches’ inner-workings, scope their properties, highlight their strengths, and uncover their limitations (or even pitfalls). The overall ambition is thus to ensure that researchers can associate their reported performance with specific key design choices in their contributions.
6.2 Implications of Our Findings
Our study revealed that many features which are sufficient to provide reasonable performance for the DREBIN classifier are actually id-features. We postulate that this may pose a generalisation problem. Our study further showed that many of the relevant features that contributed to the performance of DREBIN are indeed id-features.
We noted that a large proportion of malware can be detected using a single feature for two of our datasets, which further highlights both (1) the fundamental importance of evaluation datasets (in particular, of the diversity in these datasets), and (2) the necessity to assess whether features capture actual indicators of maliciousness or spurious correlations instead.
The consistency, or lack thereof, of top features for predicting samples in a same family raises questions on the relevance of DREBIN explanations.
8 Conclusion
We have presented an exploratory analysis of the DREBIN malware detector with the aim to uncover insights about “how/why it works”. Our study has extensively investigated the discriminatory power of the huge number of features that are used by DREBIN. We have thus identified DREBIN’s most important features and showed that its discriminatory features contain id-features. We have also proposed a set of experiments both to evaluate the consistency of the explanations provided by DREBIN for malware samples of the same family and to assess DREBIN’s ability to capture the concept of malware families.
Overall:
•
DREBIN, given the stability of its performance across datasets, is a strong reference that researchers should consider when assessing the performance of malware detection approaches.
•
Id-features are predominant within the feature set of DREBIN. Their importance in the decisions suggests that more research is needed on feature engineering that captures maliciousness in a more abstract (hence generalised) way. In the meantime, researchers should provide analysis on the presence of id-features and their impact on the reported performance. Further research should also investigate techniques for exploring id-features while mitigating their generalisation issues.
•
The shortcomings of our first-step analyses suggest that the community needs further research on the design of quality metrics for classifiers beyond classical quantitative metrics of precision and recall.
Our work thus calls for more research into techniques and tools for analysing the performance of malware detection approaches. Although it is important to report high performance scores for state-of-the-art classifiers, the literature should also provide a principle-based systematic assessment into the inner-workings of the approach.
Malware detection today has become a critical concern that creates a challenge for managers, developers, and end users. Having tools that can demonstrate that a malware detector has a sound inner-working is thus paramount. Consequently, developing black-box systems with 100% performance scores in the lab does not inspire confidence anymore. Instead, more attention should be given to understanding how the systems work, and to what extent they are able to capture the malicious character of the apps so they can be trusted and deployed in real-world settings. Our work is an initial step for initiating a research roadmap into the assessment of malware detection classifiers. We hope that this research direction will be embraced by the community.