Abstract
Machine learning (ML) models are often adopted in malware detection systems. To ensure the detection performance in such ML-based systems, updating ML models with new data is crucial for minimizing the influence of data variation over time. After an update, validating the new model is commonly done using the detection accuracy as a metric. However, the accuracy does not include detailed information, such as changes in the features used for prediction. Such information is beneficial for avoiding unexpected updates, such as overfitting or noneffective updates. We, therefore, propose a method for understanding ML model updates in malware detection systems by using a feature attribution method called Shapley additive explanations (SHAP), which interprets the output of an ML model by assigning an importance value called a SHAP value to each feature. In our method, we identify patterns of feature attribution changes that cause a change in the prediction. In this method, we first obtain the feature attributions for each sample, which change before and after the update. Then, we obtain the patterns of the changes in the feature attributions that are common for multiple samples by clustering the changes in the feature attributions. In this study, we conduct experiments using an open dataset of Android malware and demonstrate that our method can identify the causes of performance changes, such as overfitting or noneffective updates.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ahmadi, M., Ulyanov, D., Semenov, S., Trofimov, M., Giacinto, G.: Novel feature extraction, selection and fusion for effective malware family classification. In: Proceedings of the 6th ACM Conference on Data and Application Security and Privacy, pp. 183–194 (2016)
Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Parzen, E., Tanabe, K., Kitagawa, G. (eds.) Selected Papers of Hirotugu Akaike, pp. 199–213. Springer, New York (1998). https://doi.org/10.1007/978-1-4612-1694-0_15
Allix, K., Bissyandé, T.F., Klein, J., Le Traon, Y.: Androzoo: collecting millions of android apps for the research community. In: Proceedings of the 13th IEEE/ACM Working Conference on Mining Software Repositories, pp. 468–471 (2016)
Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H., Rieck, K.: Drebin: effective and explainable detection of android malware in your pocket. In: Proceedings of the 2014 Network and Distributed System Security Symposium (2014)
Bouktif, S., Fiaz, A., Ouni, A., Serhani, M.A.: Single and multi-sequence deep learning models for short and medium term electric load forecasting. Energies 12(1), 149 (2019)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Canali, D., Cova, M., Vigna, G., Kruegel, C.: Prophiler: a fast filter for the large-scale detection of malicious web pages. In: Proceedings of the 20th International Conference on World Wide Web, pp. 197–206 (2011)
Friedman, J.H., Meulman, J.J.: Multiple additive regression trees with application in epidemiology. Stat. Med. 22(9), 1365–1381 (2003)
Jaccard, P.: The distribution of the flora in the alpine zone. 1. New Phytologist 11(2), 37–50 (1912)
Jordaney, R., et al.: Transcend: detecting concept drift in malware classification models. In: Proceedings of the 26th USENIX Security Symposium, pp. 625–642 (2017)
Karlaš, B., et al.: Building continuous integration services for machine learning. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2407–2415 (2020)
Lundberg, S.M., et al.: From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2(1), 2522–5839 (2020)
Lundberg, S.M., Erion, G.G., Lee, S.I.: Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888 (2018)
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Proceedings of the 31st Advances in Neural Information Processing Systems, pp. 4765–4774 (2017)
Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1245–1254 (2009)
Miller, B., et al.: Reviewer integration and performance measurement for malware detection. In: Proceedings of the 13th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pp. 122–141 (2016)
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Pendlebury, F., Pierazzi, F., Jordaney, R., Kinder, J., Cavallaro, L.: TESSERACT: eliminating experimental bias in malware classification across space and time. In: Proceedings of the 28th USENIX Security Symposium, pp. 729–746 (2019)
Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should I trust you?”: explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Sood, G.: virustotal: R Client for the virustotal API (2017). R package version 0.2.1
Tsymbal, A.: The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin, vol. 106, no. 2, p. 58 (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Detailed Experimental Setup
A Detailed Experimental Setup
Dataset. Each training dataset had a similar size of approximately 3,800 benign samples and 420 malicious samples, and each test dataset contained approximately 5,000 benign samples and 550 malicious samples. The number of samples is shown in Table 9.
Feature. To extract features in our experiments, we used Drebin [4], a lightweight method for detecting malicious APK files based on broad static analyses. Features are extracted from the manifest and disassembled dex codes of the APK file. From these, Drebin collects discriminative strings, such as permissions, API calls, and network addresses. Drebin extracts eight sets of strings: four from manifests and four from dex code.
-
1.
Hardware components
-
2.
Requested permissions
-
3.
App components
-
4.
Filtered intents
-
5.
Restricted API calls
-
6.
Used permissions
-
7.
Suspicious API calls
-
8.
Network addresses
The features are embedded into an N-dimensional vector space, where each element is either 0 or 1. Each element corresponds to a string, with 1 representing the presence of the string and 0 representing its absence. The extracted feature vector \(\mathbf{x}\) is denoted as
The feature vector can be used as input for a machine-learning model.
Classification Models. Our experiments use random forest [6], which is well known for its excellent classification performance and can be applied to many tasks, including malware detection. Random forest is an ensemble of decision trees. Each decision tree is built using a randomly sampled subset of data and features. By creating an ensemble of many decision trees, random forest achieves high classification performance even when the dimensions of feature vectors exceed the dataset size. Furthermore, the SHAP package [12] associated with Ref. [13] provides a high-speed algorithm called TreeExplainer for tree ensemble methods, including random forests.
Hyperparameter Optimization. When training random forest models, we conduct a grid search for each model to determine the best combination of parameters among the following candidates:
-
1.
Number of trees: 10, 100, 200, 300, 400.
-
2.
Maximum depth of each tree: 10, 100, 300, 500.
-
3.
Ratio of features used for each tree: 0.02, 0.05, 0.07, 0.1, 0.2.
-
4.
Minimum number of samples required at a leaf node: 5, 7, 10, 20.
Each candidate combination is validated using five-fold cross validation. Specifically, we calculated an average of five AUC scores for each combination and selected the best combination in terms of the average AUC score as the result of the grid search.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Fan, Y., Shibahara, T., Ohsita, Y., Chiba, D., Akiyama, M., Murata, M. (2021). Understanding Update of Machine-Learning-Based Malware Detection by Clustering Changes in Feature Attributions. In: Nakanishi, T., Nojima, R. (eds) Advances in Information and Computer Security. IWSEC 2021. Lecture Notes in Computer Science(), vol 12835. Springer, Cham. https://doi.org/10.1007/978-3-030-85987-9_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-85987-9_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85986-2
Online ISBN: 978-3-030-85987-9
eBook Packages: Computer ScienceComputer Science (R0)