Understanding Update of Machine-Learning-Based Malware Detection by Clustering Changes in Feature Attributions

Fan, Yun; Shibahara, Toshiki; Ohsita, Yuichi; Chiba, Daiki; Akiyama, Mitsuaki; Murata, Masayuki

doi:10.1007/978-3-030-85987-9_6

Yun Fan¹⁰,
Toshiki Shibahara¹¹,
Yuichi Ohsita¹⁰,
Daiki Chiba¹¹,
Mitsuaki Akiyama¹¹ &
…
Masayuki Murata¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12835))

Included in the following conference series:

International Workshop on Security

446 Accesses
2 Citations

Abstract

Machine learning (ML) models are often adopted in malware detection systems. To ensure the detection performance in such ML-based systems, updating ML models with new data is crucial for minimizing the influence of data variation over time. After an update, validating the new model is commonly done using the detection accuracy as a metric. However, the accuracy does not include detailed information, such as changes in the features used for prediction. Such information is beneficial for avoiding unexpected updates, such as overfitting or noneffective updates. We, therefore, propose a method for understanding ML model updates in malware detection systems by using a feature attribution method called Shapley additive explanations (SHAP), which interprets the output of an ML model by assigning an importance value called a SHAP value to each feature. In our method, we identify patterns of feature attribution changes that cause a change in the prediction. In this method, we first obtain the feature attributions for each sample, which change before and after the update. Then, we obtain the patterns of the changes in the feature attributions that are common for multiple samples by clustering the changes in the feature attributions. In this study, we conduct experiments using an open dataset of Android malware and demonstrate that our method can identify the causes of performance changes, such as overfitting or noneffective updates.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ahmadi, M., Ulyanov, D., Semenov, S., Trofimov, M., Giacinto, G.: Novel feature extraction, selection and fusion for effective malware family classification. In: Proceedings of the 6th ACM Conference on Data and Application Security and Privacy, pp. 183–194 (2016)
Google Scholar
Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Parzen, E., Tanabe, K., Kitagawa, G. (eds.) Selected Papers of Hirotugu Akaike, pp. 199–213. Springer, New York (1998). https://doi.org/10.1007/978-1-4612-1694-0_15
Chapter Google Scholar
Allix, K., Bissyandé, T.F., Klein, J., Le Traon, Y.: Androzoo: collecting millions of android apps for the research community. In: Proceedings of the 13th IEEE/ACM Working Conference on Mining Software Repositories, pp. 468–471 (2016)
Google Scholar
Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H., Rieck, K.: Drebin: effective and explainable detection of android malware in your pocket. In: Proceedings of the 2014 Network and Distributed System Security Symposium (2014)
Google Scholar
Bouktif, S., Fiaz, A., Ouni, A., Serhani, M.A.: Single and multi-sequence deep learning models for short and medium term electric load forecasting. Energies 12(1), 149 (2019)
Article Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Canali, D., Cova, M., Vigna, G., Kruegel, C.: Prophiler: a fast filter for the large-scale detection of malicious web pages. In: Proceedings of the 20th International Conference on World Wide Web, pp. 197–206 (2011)
Google Scholar
Friedman, J.H., Meulman, J.J.: Multiple additive regression trees with application in epidemiology. Stat. Med. 22(9), 1365–1381 (2003)
Article Google Scholar
Jaccard, P.: The distribution of the flora in the alpine zone. 1. New Phytologist 11(2), 37–50 (1912)
Google Scholar
Jordaney, R., et al.: Transcend: detecting concept drift in malware classification models. In: Proceedings of the 26th USENIX Security Symposium, pp. 625–642 (2017)
Google Scholar
Karlaš, B., et al.: Building continuous integration services for machine learning. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2407–2415 (2020)
Google Scholar
Lundberg, S.M., et al.: From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2(1), 2522–5839 (2020)
Article Google Scholar
Lundberg, S.M., Erion, G.G., Lee, S.I.: Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888 (2018)
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Proceedings of the 31st Advances in Neural Information Processing Systems, pp. 4765–4774 (2017)
Google Scholar
Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1245–1254 (2009)
Google Scholar
Miller, B., et al.: Reviewer integration and performance measurement for malware detection. In: Proceedings of the 13th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pp. 122–141 (2016)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Pendlebury, F., Pierazzi, F., Jordaney, R., Kinder, J., Cavallaro, L.: TESSERACT: eliminating experimental bias in malware classification across space and time. In: Proceedings of the 28th USENIX Security Symposium, pp. 729–746 (2019)
Google Scholar
Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should I trust you?”: explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)
Google Scholar
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Article MathSciNet Google Scholar
Sood, G.: virustotal: R Client for the virustotal API (2017). R package version 0.2.1
Google Scholar
Tsymbal, A.: The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin, vol. 106, no. 2, p. 58 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Osaka University, Osaka, Japan
Yun Fan, Yuichi Ohsita & Masayuki Murata
NTT, Tokyo, Japan
Toshiki Shibahara, Daiki Chiba & Mitsuaki Akiyama

Authors

Yun Fan
View author publications
You can also search for this author in PubMed Google Scholar
Toshiki Shibahara
View author publications
You can also search for this author in PubMed Google Scholar
Yuichi Ohsita
View author publications
You can also search for this author in PubMed Google Scholar
Daiki Chiba
View author publications
You can also search for this author in PubMed Google Scholar
Mitsuaki Akiyama
View author publications
You can also search for this author in PubMed Google Scholar
Masayuki Murata
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuichi Ohsita .

Editor information

Editors and Affiliations

Hiroshima University, Hiroshima, Japan
Toru Nakanishi
National Institute of Information and Communications Technology, Tokyo, Japan
Ryo Nojima

A Detailed Experimental Setup

Dataset. Each training dataset had a similar size of approximately 3,800 benign samples and 420 malicious samples, and each test dataset contained approximately 5,000 benign samples and 550 malicious samples. The number of samples is shown in Table 9.

Feature. To extract features in our experiments, we used Drebin [4], a lightweight method for detecting malicious APK files based on broad static analyses. Features are extracted from the manifest and disassembled dex codes of the APK file. From these, Drebin collects discriminative strings, such as permissions, API calls, and network addresses. Drebin extracts eight sets of strings: four from manifests and four from dex code.

1.
Hardware components
2.
Requested permissions
3.
App components
4.
Filtered intents
5.
Restricted API calls
6.
Used permissions
7.
Suspicious API calls
8.
Network addresses

The features are embedded into an N-dimensional vector space, where each element is either 0 or 1. Each element corresponds to a string, with 1 representing the presence of the string and 0 representing its absence. The extracted feature vector $\mathbf{x}$ is denoted as

$$\begin{aligned} \mathbf{x}=\left( \cdots ~0~1~\cdots ~0~1 ~\cdots \right) . \end{aligned}$$

The feature vector can be used as input for a machine-learning model.

Classification Models. Our experiments use random forest [6], which is well known for its excellent classification performance and can be applied to many tasks, including malware detection. Random forest is an ensemble of decision trees. Each decision tree is built using a randomly sampled subset of data and features. By creating an ensemble of many decision trees, random forest achieves high classification performance even when the dimensions of feature vectors exceed the dataset size. Furthermore, the SHAP package [12] associated with Ref. [13] provides a high-speed algorithm called TreeExplainer for tree ensemble methods, including random forests.

Table 9. Number of samples in each dataset

Full size table

Hyperparameter Optimization. When training random forest models, we conduct a grid search for each model to determine the best combination of parameters among the following candidates:

1.
Number of trees: 10, 100, 200, 300, 400.
2.
Maximum depth of each tree: 10, 100, 300, 500.
3.
Ratio of features used for each tree: 0.02, 0.05, 0.07, 0.1, 0.2.
4.
Minimum number of samples required at a leaf node: 5, 7, 10, 20.

Each candidate combination is validated using five-fold cross validation. Specifically, we calculated an average of five AUC scores for each combination and selected the best combination in terms of the average AUC score as the result of the grid search.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fan, Y., Shibahara, T., Ohsita, Y., Chiba, D., Akiyama, M., Murata, M. (2021). Understanding Update of Machine-Learning-Based Malware Detection by Clustering Changes in Feature Attributions. In: Nakanishi, T., Nojima, R. (eds) Advances in Information and Computer Security. IWSEC 2021. Lecture Notes in Computer Science(), vol 12835. Springer, Cham. https://doi.org/10.1007/978-3-030-85987-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-85987-9_6
Published: 27 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85986-2
Online ISBN: 978-3-030-85987-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Understanding Update of Machine-Learning-Based Malware Detection by Clustering Changes in Feature Attributions

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Detailed Experimental Setup

A Detailed Experimental Setup

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation