Skip to main content

Understanding Update of Machine-Learning-Based Malware Detection by Clustering Changes in Feature Attributions

  • Conference paper
  • First Online:
Advances in Information and Computer Security (IWSEC 2021)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12835))

Included in the following conference series:

Abstract

Machine learning (ML) models are often adopted in malware detection systems. To ensure the detection performance in such ML-based systems, updating ML models with new data is crucial for minimizing the influence of data variation over time. After an update, validating the new model is commonly done using the detection accuracy as a metric. However, the accuracy does not include detailed information, such as changes in the features used for prediction. Such information is beneficial for avoiding unexpected updates, such as overfitting or noneffective updates. We, therefore, propose a method for understanding ML model updates in malware detection systems by using a feature attribution method called Shapley additive explanations (SHAP), which interprets the output of an ML model by assigning an importance value called a SHAP value to each feature. In our method, we identify patterns of feature attribution changes that cause a change in the prediction. In this method, we first obtain the feature attributions for each sample, which change before and after the update. Then, we obtain the patterns of the changes in the feature attributions that are common for multiple samples by clustering the changes in the feature attributions. In this study, we conduct experiments using an open dataset of Android malware and demonstrate that our method can identify the causes of performance changes, such as overfitting or noneffective updates.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ahmadi, M., Ulyanov, D., Semenov, S., Trofimov, M., Giacinto, G.: Novel feature extraction, selection and fusion for effective malware family classification. In: Proceedings of the 6th ACM Conference on Data and Application Security and Privacy, pp. 183–194 (2016)

    Google Scholar 

  2. Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Parzen, E., Tanabe, K., Kitagawa, G. (eds.) Selected Papers of Hirotugu Akaike, pp. 199–213. Springer, New York (1998). https://doi.org/10.1007/978-1-4612-1694-0_15

    Chapter  Google Scholar 

  3. Allix, K., Bissyandé, T.F., Klein, J., Le Traon, Y.: Androzoo: collecting millions of android apps for the research community. In: Proceedings of the 13th IEEE/ACM Working Conference on Mining Software Repositories, pp. 468–471 (2016)

    Google Scholar 

  4. Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H., Rieck, K.: Drebin: effective and explainable detection of android malware in your pocket. In: Proceedings of the 2014 Network and Distributed System Security Symposium (2014)

    Google Scholar 

  5. Bouktif, S., Fiaz, A., Ouni, A., Serhani, M.A.: Single and multi-sequence deep learning models for short and medium term electric load forecasting. Energies 12(1), 149 (2019)

    Article  Google Scholar 

  6. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  7. Canali, D., Cova, M., Vigna, G., Kruegel, C.: Prophiler: a fast filter for the large-scale detection of malicious web pages. In: Proceedings of the 20th International Conference on World Wide Web, pp. 197–206 (2011)

    Google Scholar 

  8. Friedman, J.H., Meulman, J.J.: Multiple additive regression trees with application in epidemiology. Stat. Med. 22(9), 1365–1381 (2003)

    Article  Google Scholar 

  9. Jaccard, P.: The distribution of the flora in the alpine zone. 1. New Phytologist 11(2), 37–50 (1912)

    Google Scholar 

  10. Jordaney, R., et al.: Transcend: detecting concept drift in malware classification models. In: Proceedings of the 26th USENIX Security Symposium, pp. 625–642 (2017)

    Google Scholar 

  11. Karlaš, B., et al.: Building continuous integration services for machine learning. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2407–2415 (2020)

    Google Scholar 

  12. Lundberg, S.M., et al.: From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2(1), 2522–5839 (2020)

    Article  Google Scholar 

  13. Lundberg, S.M., Erion, G.G., Lee, S.I.: Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888 (2018)

  14. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Proceedings of the 31st Advances in Neural Information Processing Systems, pp. 4765–4774 (2017)

    Google Scholar 

  15. Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1245–1254 (2009)

    Google Scholar 

  16. Miller, B., et al.: Reviewer integration and performance measurement for malware detection. In: Proceedings of the 13th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pp. 122–141 (2016)

    Google Scholar 

  17. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  18. Pendlebury, F., Pierazzi, F., Jordaney, R., Kinder, J., Cavallaro, L.: TESSERACT: eliminating experimental bias in malware classification across space and time. In: Proceedings of the 28th USENIX Security Symposium, pp. 729–746 (2019)

    Google Scholar 

  19. Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should I trust you?”: explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016)

    Google Scholar 

  20. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)

    Article  MathSciNet  Google Scholar 

  21. Sood, G.: virustotal: R Client for the virustotal API (2017). R package version 0.2.1

    Google Scholar 

  22. Tsymbal, A.: The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin, vol. 106, no. 2, p. 58 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuichi Ohsita .

Editor information

Editors and Affiliations

A Detailed Experimental Setup

A Detailed Experimental Setup

Dataset. Each training dataset had a similar size of approximately 3,800 benign samples and 420 malicious samples, and each test dataset contained approximately 5,000 benign samples and 550 malicious samples. The number of samples is shown in Table 9.

Feature. To extract features in our experiments, we used Drebin [4], a lightweight method for detecting malicious APK files based on broad static analyses. Features are extracted from the manifest and disassembled dex codes of the APK file. From these, Drebin collects discriminative strings, such as permissions, API calls, and network addresses. Drebin extracts eight sets of strings: four from manifests and four from dex code.

  1. 1.

    Hardware components

  2. 2.

    Requested permissions

  3. 3.

    App components

  4. 4.

    Filtered intents

  5. 5.

    Restricted API calls

  6. 6.

    Used permissions

  7. 7.

    Suspicious API calls

  8. 8.

    Network addresses

The features are embedded into an N-dimensional vector space, where each element is either 0 or 1. Each element corresponds to a string, with 1 representing the presence of the string and 0 representing its absence. The extracted feature vector \(\mathbf{x}\) is denoted as

$$\begin{aligned} \mathbf{x}=\left( \cdots ~0~1~\cdots ~0~1 ~\cdots \right) . \end{aligned}$$

The feature vector can be used as input for a machine-learning model.

Classification Models. Our experiments use random forest [6], which is well known for its excellent classification performance and can be applied to many tasks, including malware detection. Random forest is an ensemble of decision trees. Each decision tree is built using a randomly sampled subset of data and features. By creating an ensemble of many decision trees, random forest achieves high classification performance even when the dimensions of feature vectors exceed the dataset size. Furthermore, the SHAP package [12] associated with Ref. [13] provides a high-speed algorithm called TreeExplainer for tree ensemble methods, including random forests.

Table 9. Number of samples in each dataset

Hyperparameter Optimization. When training random forest models, we conduct a grid search for each model to determine the best combination of parameters among the following candidates:

  1. 1.

    Number of trees: 10, 100, 200, 300, 400.

  2. 2.

    Maximum depth of each tree: 10, 100, 300, 500.

  3. 3.

    Ratio of features used for each tree: 0.02, 0.05, 0.07, 0.1, 0.2.

  4. 4.

    Minimum number of samples required at a leaf node: 5, 7, 10, 20.

Each candidate combination is validated using five-fold cross validation. Specifically, we calculated an average of five AUC scores for each combination and selected the best combination in terms of the average AUC score as the result of the grid search.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fan, Y., Shibahara, T., Ohsita, Y., Chiba, D., Akiyama, M., Murata, M. (2021). Understanding Update of Machine-Learning-Based Malware Detection by Clustering Changes in Feature Attributions. In: Nakanishi, T., Nojima, R. (eds) Advances in Information and Computer Security. IWSEC 2021. Lecture Notes in Computer Science(), vol 12835. Springer, Cham. https://doi.org/10.1007/978-3-030-85987-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-85987-9_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-85986-2

  • Online ISBN: 978-3-030-85987-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics