Fast and Efficient Malware Detection with Joint Static and Dynamic Features Through Transfer Learning

Ngo, Mao V.; Truong-Huu, Tram; Rabadi, Dima; Loo, Jia Yi; Teo, Sin G.

doi:10.1007/978-3-031-33488-7_19

Mao V. Ngo⁹,
Tram Truong-Huu¹⁰,
Dima Rabadi¹¹,
Jia Yi Loo¹² &
…
Sin G. Teo¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13905))

Included in the following conference series:

International Conference on Applied Cryptography and Network Security

1026 Accesses

Abstract

In malware detection, dynamic analysis extracts the runtime behavior of malware samples in a controlled environment and static analysis extracts features using reverse engineering tools. While the former faces the challenges of anti-virtualization and evasive behavior of malware samples, the latter faces the challenges of code obfuscation. To tackle these drawbacks, prior works proposed to develop detection models by aggregating dynamic and static features, thus leveraging the advantages of both approaches. However, simply concatenating dynamic and static features raises an issue of imbalanced contribution due to the heterogeneous dimensions of feature vectors to the performance of malware detection models. Yet, dynamic analysis is a time-consuming task and requires a secure environment, leading to detection delays and high costs for maintaining the analysis infrastructure. In this paper, we first introduce a method of constructing aggregated features via concatenating latent features learned through deep learning with equally-contributed dimensions. We then develop a knowledge distillation technique to transfer knowledge learned from aggregated features by a teacher model to a student model trained only on static features and use the trained student model for the detection of new malware samples. We carry out extensive experiments with a dataset of $86\,709$ samples including both benign and malware samples. The experimental results show that the teacher model trained on aggregated features constructed by our method outperforms the state-of-the-art models with an improvement of up to $2.38\%$ in detection accuracy. The distilled student model not only achieves high performance ($97.81\%$ in terms of accuracy) as that of the teacher model but also significantly reduces the detection time (from $70\,046.6$ ms to 194.9 ms) without requiring dynamic analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Data Mining and Machine Learning Techniques for Malware Detection

Data augmentation and transfer learning to classify malware images in a deep learning context

Article Open access 08 April 2021

Static Malware Analysis Using Machine and Deep Learning

Notes

1.
Dynamic/static features are features extracted from dynamic/static analysis method.
2.
In [26], API-ARG was reported with a different name Method2.
3.
The hash value of samples will be provided on request for experiment reproducibility.
4.
After experimenting with several optimizers, e.g., Adam, AdamW, RMSprop, SGD, and SGD with momentum, we observed that SGD with momentum performs best.
5.
There is no variance of XGBoost’s result because it does not include any stochastic in models; therefore yielding the same results regardless of running many times.
6.
Analysis delays are calculated as the average time of executing samples in the test set (i.e., $17\,342$ samples) using dynamic or static analysis.

References

Abhishek, S., Zheng, B.: Hot knives through butter: Evading file-based sandboxes. Technical report, Fire Eye (2013)
Google Scholar
Anderson, H.S., Roth, P.: EMBER: an open dataset for training static PE malware machine learning models. arXiv preprint arXiv:1804.04637 (2018)
Bucila, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of ACM SIGKDD, pp. 535–541. KDD 2006, ACM, New York, NY, USA (2006)
Google Scholar
contributors, T.X.: Extreme gradient boosting open-source software library. https://xgboost.readthedocs.io/en/latest/parameter.html. Accessed 12 Mar 2022
Damodaran, A., Di Troia, F., Visaggio, C.A., Austin, T.H., Stamp, M.: A comparison of static, dynamic, and hybrid analysis for malware detection. J. Comput. Virol. Hack. Techn. 13(1), 1–12 (2017)
Article Google Scholar
Directorate, N.R.: Ghidra: a software reverse engineering (SRE) suite of tools in support of the cybersecurity mission. https://ghidra-sre.org/. Accessed 12 June 2022
Fan, Y., Ye, Y., Chen, L.: Malicious sequential pattern mining for automatic malware detection. Expert Syst. Appl. 52, 16–25 (2016)
Article Google Scholar
horsicq GitHub: Detect it easy, or abbreviated “die” is a program for determining types of files, 13 June 2022. https://github.com/horsicq/Detect-It-Easy#
Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015)
Han, W., Xue, J., Wang, Y., Huang, L., Kong, Z., Mao, L.: MalDAE: detecting and explaining malware based on correlation and fusion of static and dynamic characteristics. Comput. Secur. 83, 208–233 (2019)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hex-Rays: Ida pro: A binary code analysis tool–a powerful disassembler and a versatile debugger. https://hex-rays.com/IDA-pro/. Accessed 12 June 2022
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network (2015)
Google Scholar
Kadiyala, S.P., Kartheek, A., Truong-Huu, T.: Program Behavior Analysis and Clustering using Performance Counters. In: Proceedings of 2020 DYnamic and Novel Advances in Machine Learning and Intelligent Cyber Security (DYNAMICS) Workshop. Virtual Event, December 2020
Google Scholar
Kang, B., Yerima, S.Y., Mclaughlin, K., Sezer, S.: N-opcode analysis for android malware classification and categorization. In: 2016 International Conference On Cyber Security And Protection Of Digital Services (Cyber Security) (2016)
Google Scholar
Kim, T., Oh, J., Kim, N., Cho, S., Yun, S.: Comparing kullback-leibler divergence and mean squared error loss in knowledge distillation. CoRR abs/2105.08919 (2021)
Google Scholar
Kundu, P.P., Anatharaman, L., Truong-Huu, T.: An empirical evaluation of automated machine learning techniques for malware detection. In: Proceedings of the 2021 ACM Workshop on Security and Privacy Analytics, pp. 75–81. Virtual Event, USA (2021)
Google Scholar
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. arXiv preprint arXiv:2201.03545 (2022)
Moser, A., Kruegel, C., Kirda, E.: Limits of static analysis for malware detection. In: ACSAC 2007, pp. 421–430 (2007)
Google Scholar
Ndibanje, B., Kim, K.H., Kang, Y.J., Kim, H.H., Kim, T.Y., Lee, H.J.: Cross-method-based analysis and classification of malicious behavior by API calls extraction. Appl. Sci. 9(2), 239 (2019)
Article Google Scholar
Or-Meir, O., Nissim, N., Elovici, Y., Rokach, L.: Dynamic malware analysis in the modern era-a state of the art survey. ACM Comput. Surv. 52(5) (2019)
Google Scholar
Oracle: Oracle virtualbox. https://www.virtualbox.org/. Accessed 12 June 2022
Oramas, S., Nieto, O., Sordo, M., Serra, X.: A deep multimodal approach for cold-start music recommendation. In: Proceedings of 2nd Workshop on Deep Learning for Recommender Systems, pp. 32–37. Como, Italy, August 2017
Google Scholar
Pytorch: Resnet implementation in pytorch. https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py. Accessed 12 June 2022
Quarkslab: Lief: Library to instrument executable formats. https://lief-project.github.io/. Accessed 18 Feb 2022
Rabadi, D., Teo, S.G.: Advanced windows methods on malware detection and classification. In: Annual Computer Security Applications Conference, pp. 54–68. ACSAC 2020, Association for Computing Machinery, New York, NY, USA (2020)
Google Scholar
Rhode, M., Burnap, P., Jones, K.: Early-stage malware prediction using recurrent neural networks. Comput. Secur. 77, 578–594 (2018)
Google Scholar
Sandbox, C.: Distributed cuckoo. https://cuckoo.readthedocs.io/en/latest/usage/dist/. Accessed 12 June 2022
Santos, I., Brezo, F., Ugarte-Pedrero, X., Bringas, P.G.: Opcode sequences as representation of executables for data-mining-based unknown malware detection. Inf. Sci. 231, 64–82 (2013)
Google Scholar
Scikit-Learn: Hashing vectorizer function. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html. Accessed 18 Feb 2022
Tan, W.L., Truong-Huu, T.: Enhancing robustness of malware detection using synthetically-adversarial samples. In: GLOBECOM 2020–2020 IEEE Global Communications Conference, Taipei, Taiwan, December 2020
Google Scholar
URsoftware-W32DASM: W32dasm: a disassembler tool made to translate machine language back into assembly language. https://www.softpedia.com/get/Programming/Debuggers-Decompilers-Dissasemblers/WDASM.shtml. Accessed 12 June 2022
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500 (2017)
Google Scholar
Zhang, J., et al.: Scarecrow: deactivating evasive malware via its own evasive logic. In: 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 76–87 (2020)
Google Scholar
Zhang, J., Qin, Z., Yin, H., Ou, L., Zhang, K.: A feature-hybrid malware variants detection using CNN based opcode embedding and BPNN based API embedding. Comput. Secur. 84, 376–392 (2019)
Article Google Scholar
Zhi, Y., Xi, N., Liu, Y., Hui, H.: A lightweight android malware detection framework based on knowledge distillation. In: Yang, M., Chen, C., Liu, Y. (eds.) NSS 2021. LNCS, vol. 13041, pp. 116–130. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-92708-0_7
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Singapore University of Technology and Design, Singapore, Singapore
Mao V. Ngo
Singapore Institute of Technology, Singapore, Singapore
Tram Truong-Huu
Penn State Shenango, Pennsylvania, USA
Dima Rabadi
Institute for Infocomm Research, A*STAR, Singapore, Singapore
Jia Yi Loo & Sin G. Teo

Authors

Mao V. Ngo
View author publications
You can also search for this author in PubMed Google Scholar
Tram Truong-Huu
View author publications
You can also search for this author in PubMed Google Scholar
Dima Rabadi
View author publications
You can also search for this author in PubMed Google Scholar
Jia Yi Loo
View author publications
You can also search for this author in PubMed Google Scholar
Sin G. Teo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tram Truong-Huu .

Editor information

Editors and Affiliations

NTT Social Informatics Laboratories, Tokyo, Japan
Mehdi Tibouchi
Indiana University at Bloomington, Bloomington, IN, USA
XiaoFeng Wang

Appendices

A Ablation Studies

Table 7. Performance (in percentage) of API-ARG student models transferred with different loss functions

Full size table

1.1 A.1 Transfer Learning for API-ARG Features

Even though transferring knowledge to a student model trained with dynamic features is not a focus of our work, we evaluate the performance of the distilled student models trained with dynamic API-ARG features in this section for completeness. Since the training time for a student model trained with API-ARG features is much longer than that for the models trained with EMBER features or OPCODE features, we only evaluate one set of the recommended hyper-parameters from [13]: $\alpha =0.1, \tau =5$ for KD-KL, and $\alpha =0.1$ for KD-MSE. As shown in Table 7, the distilled student models for API-ARG features perform better than the student-alone model. The performance trends also follow our conjecture that knowledge distillation helps transfer knowledge to a distilled student model with a dynamic API-ARG feature vector that positively contributes to the teacher model (i.e., Agg2-Lat). We also observe that the KD-KL student model obtains better performance compared to the KD-MSE student model, which is different from the results of the case where the student model is trained with EMBER features. This explains why we explored two loss functions during transfer learning for completeness, even though in [16] the authors stated that the KL-loss function is more suitable for a noisy label case, which is not our case since the labels of the malware samples are stabilized and provided by the security vendor.

Table 8. Accuracy (in percentage) and size (in MB) of models with different types of basic blocks (see Fig. 8) for EMBER, OPCODE, and aggregated feature vectors

Full size table

Table 9. Detailed network architecture for EMBER feature vector with $2\,381$ dimensions

Full size table

1.2 A.2 Experiments with Different Neural Network Architectures

Besides ResNet1D used for all the experiments presented above, we adopt recent advanced neural network architectures of ResNet in computer vision such as ResNeXt [33], inverted ResNeXt [33], and recently ConvNeXt [18]. We adjust our ResNet1D basic block and implement its variants, namely ResNeXt1D, Inverted ResNeXt1D, and ConvNeXt1D basic blocks accordingly, presented in Fig. 8. We reuse the high-level neural network architectures as defined for ResNet1D in Table 9, Table 10, and Table 11 for each individual feature vector (i.e., EMBER, OPCODE, and API-ARG).

The average performance over 5 runs of ResNet1D (with kernel size $K=3$ and $K=1$) and other variants of the student-alone models trained with EMBER, OPCODE features, and the teacher model trained with Agg2-Lat feature vector are presented in Table 8. We observe that ResNet1D with kernel size $K=3$ obtains the highest accuracy and small size for all the models trained with EMBER, OPCODE, and Agg2-Lat feature vectors. This explains why we chose ResNet1D with kernel size $K=3$ as our basic block for all the experiments presented above.

B Implementation of Neural Network Architecture for Individual Feature Vectors

Detailed implementation of neural network architectures for the EMBER feature vector, OPCODE feature vector and API-ARG feature vector are shown in Table 9, Table 10, and Table 11, respectively.

Table 10. Detailed network architecture for OPCODE feature vector with $33\,338$ dimensions

Full size table

Table 11. Detailed network architecture for API-ARG feature vector with $1\,048\,576$ dimensions

Full size table

C Neural Network Architecture of Aggregated Original Features

In Table 12 and Table 13, we present the architectures of the 1D-CNN models for aggregated feature vectors from 2 original feature vectors (Agg2-Org—EMBER + API-ARG) and 3 original feature vectors (Agg3-Org—EMBER + OPCODE + API-ARG), which are developed and evaluated in our work.

D Speeding up Dynamic Analysis with Distributed Cuckoo Infrastructure

As we discussed earlier, it is a time-consuming task of dynamic analysis to produce analysis reports for malware samples. It depends on the user-defined parameter of the longest time that a sample is analyzed in the Cuckoo sandbox but the default setting is two minutes. Based on our daily experiments, using a single Cuckoo sandbox, we were able to analyze 8000 samples per week. We could use multiple sandboxes and manually submit malware samples to speed up the analysis. However, this is quite laborious as the virtual machines hosting Cuckoo sandboxes frequently crash, thus requiring close monitoring for fixing occurring issues.

Table 12. Detailed network architectures for Agg2-Org feature vector ($2\,381 + 1\,048\,576 = 1\,050\,957$ dimensions)

Full size table

Table 13. Detailed network architectures for Agg3-Org feature vector ($2\,381+33\,338+1\,048\,576=1\,084\,295$ dimensions)

Full size table

In this work, we developed a parallel dynamic analysis infrastructure using the preliminary version of distributed Cuckoo [28]. We enriched the infrastructure by adding more automation for fault tolerance, which includes

A re-submission mechanism to resubmit the samples that have not been successfully analyzed. The number of resubmissions is a user-predefined parameter (e.g., three times).
A monitoring mechanism to check the status of virtual machines whether they are in normal working conditions or crashed. We implement an active monitoring technique that periodically sends monitoring requests to Oracle VirtualBox [22] to check virtual machine status (Cuckoo uses Oracle VirtualBox to host virtual machines and sandboxes).
A virtual machine (VM) instantiation mechanism to instantiate new VMs to replace the crashed ones. This process is automatically invoked when the monitoring system detects a crashed VM. This allows us to maximize the utilization of computing resources (i.e., the available VMs hosted on physical servers).

With the developed infrastructure and the available resources in our lab, we could run up to 12 Cuckoo sandboxes at the same time. The implemented fault-tolerance mechanisms also relieve us from the burden of manual monitoring and submission. With more than $86\,000$ samples used in our experiments (discussed further in Sect. 6), we could complete the analysis in just 10 days.

E Discussion on Model Updating

As new malware samples are introduced and evolve daily, model updating is needed to keep the model up to date, thus being able to handle such new samples. However, we believe that this is worth for separate work to develop new methods for model updating such as online learning and dealing with data drift problems. A naive solution is to retrain the teacher model and re-transfer to the student model. The old and new student models are tested on a new test set, and we only deploy the new student if it outperforms the old one by a certain threshold. We will keep this for our future work.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ngo, M.V., Truong-Huu, T., Rabadi, D., Loo, J.Y., Teo, S.G. (2023). Fast and Efficient Malware Detection with Joint Static and Dynamic Features Through Transfer Learning. In: Tibouchi, M., Wang, X. (eds) Applied Cryptography and Network Security. ACNS 2023. Lecture Notes in Computer Science, vol 13905. Springer, Cham. https://doi.org/10.1007/978-3-031-33488-7_19

Download citation

DOI: https://doi.org/10.1007/978-3-031-33488-7_19
Published: 29 May 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-33487-0
Online ISBN: 978-3-031-33488-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fast and Efficient Malware Detection with Joint Static and Dynamic Features Through Transfer Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Data Mining and Machine Learning Techniques for Malware Detection

Data augmentation and transfer learning to classify malware images in a deep learning context

Static Malware Analysis Using Machine and Deep Learning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Ablation Studies

1.1 A.1 Transfer Learning for API-ARG Features

1.2 A.2 Experiments with Different Neural Network Architectures

B Implementation of Neural Network Architecture for Individual Feature Vectors

C Neural Network Architecture of Aggregated Original Features

D Speeding up Dynamic Analysis with Distributed Cuckoo Infrastructure

E Discussion on Model Updating

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us