Abstract
In malware detection, dynamic analysis extracts the runtime behavior of malware samples in a controlled environment and static analysis extracts features using reverse engineering tools. While the former faces the challenges of anti-virtualization and evasive behavior of malware samples, the latter faces the challenges of code obfuscation. To tackle these drawbacks, prior works proposed to develop detection models by aggregating dynamic and static features, thus leveraging the advantages of both approaches. However, simply concatenating dynamic and static features raises an issue of imbalanced contribution due to the heterogeneous dimensions of feature vectors to the performance of malware detection models. Yet, dynamic analysis is a time-consuming task and requires a secure environment, leading to detection delays and high costs for maintaining the analysis infrastructure. In this paper, we first introduce a method of constructing aggregated features via concatenating latent features learned through deep learning with equally-contributed dimensions. We then develop a knowledge distillation technique to transfer knowledge learned from aggregated features by a teacher model to a student model trained only on static features and use the trained student model for the detection of new malware samples. We carry out extensive experiments with a dataset of \(86\,709\) samples including both benign and malware samples. The experimental results show that the teacher model trained on aggregated features constructed by our method outperforms the state-of-the-art models with an improvement of up to \(2.38\%\) in detection accuracy. The distilled student model not only achieves high performance (\(97.81\%\) in terms of accuracy) as that of the teacher model but also significantly reduces the detection time (from \(70\,046.6\) ms to 194.9 ms) without requiring dynamic analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Dynamic/static features are features extracted from dynamic/static analysis method.
- 2.
In [26], API-ARG was reported with a different name Method2.
- 3.
The hash value of samples will be provided on request for experiment reproducibility.
- 4.
After experimenting with several optimizers, e.g., Adam, AdamW, RMSprop, SGD, and SGD with momentum, we observed that SGD with momentum performs best.
- 5.
There is no variance of XGBoost’s result because it does not include any stochastic in models; therefore yielding the same results regardless of running many times.
- 6.
Analysis delays are calculated as the average time of executing samples in the test set (i.e., \(17\,342\) samples) using dynamic or static analysis.
References
Abhishek, S., Zheng, B.: Hot knives through butter: Evading file-based sandboxes. Technical report, Fire Eye (2013)
Anderson, H.S., Roth, P.: EMBER: an open dataset for training static PE malware machine learning models. arXiv preprint arXiv:1804.04637 (2018)
Bucila, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of ACM SIGKDD, pp. 535–541. KDD 2006, ACM, New York, NY, USA (2006)
contributors, T.X.: Extreme gradient boosting open-source software library. https://xgboost.readthedocs.io/en/latest/parameter.html. Accessed 12 Mar 2022
Damodaran, A., Di Troia, F., Visaggio, C.A., Austin, T.H., Stamp, M.: A comparison of static, dynamic, and hybrid analysis for malware detection. J. Comput. Virol. Hack. Techn. 13(1), 1–12 (2017)
Directorate, N.R.: Ghidra: a software reverse engineering (SRE) suite of tools in support of the cybersecurity mission. https://ghidra-sre.org/. Accessed 12 June 2022
Fan, Y., Ye, Y., Chen, L.: Malicious sequential pattern mining for automatic malware detection. Expert Syst. Appl. 52, 16–25 (2016)
horsicq GitHub: Detect it easy, or abbreviated “die” is a program for determining types of files, 13 June 2022. https://github.com/horsicq/Detect-It-Easy#
Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015)
Han, W., Xue, J., Wang, Y., Huang, L., Kong, Z., Mao, L.: MalDAE: detecting and explaining malware based on correlation and fusion of static and dynamic characteristics. Comput. Secur. 83, 208–233 (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hex-Rays: Ida pro: A binary code analysis tool–a powerful disassembler and a versatile debugger. https://hex-rays.com/IDA-pro/. Accessed 12 June 2022
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network (2015)
Kadiyala, S.P., Kartheek, A., Truong-Huu, T.: Program Behavior Analysis and Clustering using Performance Counters. In: Proceedings of 2020 DYnamic and Novel Advances in Machine Learning and Intelligent Cyber Security (DYNAMICS) Workshop. Virtual Event, December 2020
Kang, B., Yerima, S.Y., Mclaughlin, K., Sezer, S.: N-opcode analysis for android malware classification and categorization. In: 2016 International Conference On Cyber Security And Protection Of Digital Services (Cyber Security) (2016)
Kim, T., Oh, J., Kim, N., Cho, S., Yun, S.: Comparing kullback-leibler divergence and mean squared error loss in knowledge distillation. CoRR abs/2105.08919 (2021)
Kundu, P.P., Anatharaman, L., Truong-Huu, T.: An empirical evaluation of automated machine learning techniques for malware detection. In: Proceedings of the 2021 ACM Workshop on Security and Privacy Analytics, pp. 75–81. Virtual Event, USA (2021)
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. arXiv preprint arXiv:2201.03545 (2022)
Moser, A., Kruegel, C., Kirda, E.: Limits of static analysis for malware detection. In: ACSAC 2007, pp. 421–430 (2007)
Ndibanje, B., Kim, K.H., Kang, Y.J., Kim, H.H., Kim, T.Y., Lee, H.J.: Cross-method-based analysis and classification of malicious behavior by API calls extraction. Appl. Sci. 9(2), 239 (2019)
Or-Meir, O., Nissim, N., Elovici, Y., Rokach, L.: Dynamic malware analysis in the modern era-a state of the art survey. ACM Comput. Surv. 52(5) (2019)
Oracle: Oracle virtualbox. https://www.virtualbox.org/. Accessed 12 June 2022
Oramas, S., Nieto, O., Sordo, M., Serra, X.: A deep multimodal approach for cold-start music recommendation. In: Proceedings of 2nd Workshop on Deep Learning for Recommender Systems, pp. 32–37. Como, Italy, August 2017
Pytorch: Resnet implementation in pytorch. https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py. Accessed 12 June 2022
Quarkslab: Lief: Library to instrument executable formats. https://lief-project.github.io/. Accessed 18 Feb 2022
Rabadi, D., Teo, S.G.: Advanced windows methods on malware detection and classification. In: Annual Computer Security Applications Conference, pp. 54–68. ACSAC 2020, Association for Computing Machinery, New York, NY, USA (2020)
Rhode, M., Burnap, P., Jones, K.: Early-stage malware prediction using recurrent neural networks. Comput. Secur. 77, 578–594 (2018)
Sandbox, C.: Distributed cuckoo. https://cuckoo.readthedocs.io/en/latest/usage/dist/. Accessed 12 June 2022
Santos, I., Brezo, F., Ugarte-Pedrero, X., Bringas, P.G.: Opcode sequences as representation of executables for data-mining-based unknown malware detection. Inf. Sci. 231, 64–82 (2013)
Scikit-Learn: Hashing vectorizer function. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html. Accessed 18 Feb 2022
Tan, W.L., Truong-Huu, T.: Enhancing robustness of malware detection using synthetically-adversarial samples. In: GLOBECOM 2020–2020 IEEE Global Communications Conference, Taipei, Taiwan, December 2020
URsoftware-W32DASM: W32dasm: a disassembler tool made to translate machine language back into assembly language. https://www.softpedia.com/get/Programming/Debuggers-Decompilers-Dissasemblers/WDASM.shtml. Accessed 12 June 2022
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500 (2017)
Zhang, J., et al.: Scarecrow: deactivating evasive malware via its own evasive logic. In: 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 76–87 (2020)
Zhang, J., Qin, Z., Yin, H., Ou, L., Zhang, K.: A feature-hybrid malware variants detection using CNN based opcode embedding and BPNN based API embedding. Comput. Secur. 84, 376–392 (2019)
Zhi, Y., Xi, N., Liu, Y., Hui, H.: A lightweight android malware detection framework based on knowledge distillation. In: Yang, M., Chen, C., Liu, Y. (eds.) NSS 2021. LNCS, vol. 13041, pp. 116–130. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-92708-0_7
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Ablation Studies
1.1 A.1 Transfer Learning for API-ARG Features
Even though transferring knowledge to a student model trained with dynamic features is not a focus of our work, we evaluate the performance of the distilled student models trained with dynamic API-ARG features in this section for completeness. Since the training time for a student model trained with API-ARG features is much longer than that for the models trained with EMBER features or OPCODE features, we only evaluate one set of the recommended hyper-parameters from [13]: \(\alpha =0.1, \tau =5\) for KD-KL, and \(\alpha =0.1\) for KD-MSE. As shown in Table 7, the distilled student models for API-ARG features perform better than the student-alone model. The performance trends also follow our conjecture that knowledge distillation helps transfer knowledge to a distilled student model with a dynamic API-ARG feature vector that positively contributes to the teacher model (i.e., Agg2-Lat). We also observe that the KD-KL student model obtains better performance compared to the KD-MSE student model, which is different from the results of the case where the student model is trained with EMBER features. This explains why we explored two loss functions during transfer learning for completeness, even though in [16] the authors stated that the KL-loss function is more suitable for a noisy label case, which is not our case since the labels of the malware samples are stabilized and provided by the security vendor.
1.2 A.2 Experiments with Different Neural Network Architectures
Besides ResNet1D used for all the experiments presented above, we adopt recent advanced neural network architectures of ResNet in computer vision such as ResNeXt [33], inverted ResNeXt [33], and recently ConvNeXt [18]. We adjust our ResNet1D basic block and implement its variants, namely ResNeXt1D, Inverted ResNeXt1D, and ConvNeXt1D basic blocks accordingly, presented in Fig. 8. We reuse the high-level neural network architectures as defined for ResNet1D in Table 9, Table 10, and Table 11 for each individual feature vector (i.e., EMBER, OPCODE, and API-ARG).
The average performance over 5 runs of ResNet1D (with kernel size \(K=3\) and \(K=1\)) and other variants of the student-alone models trained with EMBER, OPCODE features, and the teacher model trained with Agg2-Lat feature vector are presented in Table 8. We observe that ResNet1D with kernel size \(K=3\) obtains the highest accuracy and small size for all the models trained with EMBER, OPCODE, and Agg2-Lat feature vectors. This explains why we chose ResNet1D with kernel size \(K=3\) as our basic block for all the experiments presented above.
B Implementation of Neural Network Architecture for Individual Feature Vectors
Detailed implementation of neural network architectures for the EMBER feature vector, OPCODE feature vector and API-ARG feature vector are shown in Table 9, Table 10, and Table 11, respectively.
C Neural Network Architecture of Aggregated Original Features
In Table 12 and Table 13, we present the architectures of the 1D-CNN models for aggregated feature vectors from 2 original feature vectors (Agg2-Org—EMBER + API-ARG) and 3 original feature vectors (Agg3-Org—EMBER + OPCODE + API-ARG), which are developed and evaluated in our work.
D Speeding up Dynamic Analysis with Distributed Cuckoo Infrastructure
As we discussed earlier, it is a time-consuming task of dynamic analysis to produce analysis reports for malware samples. It depends on the user-defined parameter of the longest time that a sample is analyzed in the Cuckoo sandbox but the default setting is two minutes. Based on our daily experiments, using a single Cuckoo sandbox, we were able to analyze 8000 samples per week. We could use multiple sandboxes and manually submit malware samples to speed up the analysis. However, this is quite laborious as the virtual machines hosting Cuckoo sandboxes frequently crash, thus requiring close monitoring for fixing occurring issues.
In this work, we developed a parallel dynamic analysis infrastructure using the preliminary version of distributed Cuckoo [28]. We enriched the infrastructure by adding more automation for fault tolerance, which includes
-
A re-submission mechanism to resubmit the samples that have not been successfully analyzed. The number of resubmissions is a user-predefined parameter (e.g., three times).
-
A monitoring mechanism to check the status of virtual machines whether they are in normal working conditions or crashed. We implement an active monitoring technique that periodically sends monitoring requests to Oracle VirtualBox [22] to check virtual machine status (Cuckoo uses Oracle VirtualBox to host virtual machines and sandboxes).
-
A virtual machine (VM) instantiation mechanism to instantiate new VMs to replace the crashed ones. This process is automatically invoked when the monitoring system detects a crashed VM. This allows us to maximize the utilization of computing resources (i.e., the available VMs hosted on physical servers).
With the developed infrastructure and the available resources in our lab, we could run up to 12 Cuckoo sandboxes at the same time. The implemented fault-tolerance mechanisms also relieve us from the burden of manual monitoring and submission. With more than \(86\,000\) samples used in our experiments (discussed further in Sect. 6), we could complete the analysis in just 10 days.
E Discussion on Model Updating
As new malware samples are introduced and evolve daily, model updating is needed to keep the model up to date, thus being able to handle such new samples. However, we believe that this is worth for separate work to develop new methods for model updating such as online learning and dealing with data drift problems. A naive solution is to retrain the teacher model and re-transfer to the student model. The old and new student models are tested on a new test set, and we only deploy the new student if it outperforms the old one by a certain threshold. We will keep this for our future work.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ngo, M.V., Truong-Huu, T., Rabadi, D., Loo, J.Y., Teo, S.G. (2023). Fast and Efficient Malware Detection with Joint Static and Dynamic Features Through Transfer Learning. In: Tibouchi, M., Wang, X. (eds) Applied Cryptography and Network Security. ACNS 2023. Lecture Notes in Computer Science, vol 13905. Springer, Cham. https://doi.org/10.1007/978-3-031-33488-7_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-33488-7_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-33487-0
Online ISBN: 978-3-031-33488-7
eBook Packages: Computer ScienceComputer Science (R0)