Abstract
Malware classification is an important and challenging problem in information security. Modern malware classification techniques rely on machine learning models that can be trained on features such as opcode sequences, API calls, and byte n-grams, among many others. In this research, we consider opcode features and we implement machine learning techniques, where we apply word embedding techniques—specifically, Word2Vec, HMM2Vec, BERT, and ELMo—as a feature engineering step. The resulting embedding vectors are then used as features for classification algorithms. The classification algorithms that we employ are support vector machines (SVM), k-nearest neighbor (kNN), random forests (RF), and convolutional neural networks (CNN). We conduct substantial experiments involving seven malware families. Our experiments extend beyond previous related work in this field. We show that we can obtain slightly better performance than in comparable previous work, with significantly faster model training times.
Similar content being viewed by others
References
Aycock, J.: Computer Viruses and Malware. Springer, New York (2006)
Beek, C. et al.: McAfee labs threats report. https://www.mcafee.com/enterprise/en-us/assets/reports/rp-quarterly-threats-aug-2019.pdf, August (2019)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. https://arxiv.org/abs/1810.04805 (2018)
Dhanasekar, D., Di Troia, F., Potika, K., Stamp, M.: Detecting encrypted and polymorphic malware using hidden Markov models. In: Guide to Vulnerability Analysis for Computer Networks and Systems: An Artificial Intelligence Approach, pp. 281–299. Springer (2018)
DistilBERT. https://huggingface.co/transformers/model_doc/distilbert.html (2021)
Gael, V.: hmmlearn. https://github.com/hmmlearn/hmmlearn (2014)
Kale, A.S., Di Troia, F., Stamp, M.: Malware classification with word embedding features. In: Mori, P., Lenzini, G., Furnell, S. (eds.) Proceedings of the 7th International Conference on Information Systems Security and Privacy, ICISSP, pp. 733–742 (2021)
Keras. https://github.com/fchollet/keras (2015)
Kim, S.: PE header analysis for malware detection. Master’s thesis, San Jose State University, Department of Computer Science. https://scholarworks.sjsu.edu/etd_projects/624/ (2018)
Kolter, J.Z., Maloof, M.A.: Learning to detect and classify malicious executables in the wild. J. Mach. Learn. Res. 7(99), 2721–2744 (2006)
Krogh, A., Brown, M., Mian, I.S., Sjölander, K., Haussler, D.: Hidden Markov models in computational biology: applications to protein modeling. J. Mol. Biol. 235(5), 1501–1531 (1994)
Lo, W.W., Yang, X., Wang, Y.: An Xception convolutional neural network for malware classification with transfer learning. In: 10th IFIP International Conference on New Technologies, Mobility and Security, NTMS, pp. 1–5 (2019)
Microsoft Security Intelligence. Rogue:Win32/FakeRean. https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?Name=Rogue:Win32/FakeRean &threatId=124161 (2020)
Microsoft Security Intelligence. Trojan:Win32/BHO.BO. https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?Name=Trojan:Win32/BHO.BO (2020)
Microsoft Security Intelligence. Trojan:Win32/OnLineGames.A. https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?Name=Trojan:Win32/OnLineGames.A (2020)
Microsoft Security Intelligence. VirTool:Win32/CeeInject. https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?Name=VirTool%3AWin32%2FCeeInject (2020)
Microsoft Security Intelligence. Win32/Renos. https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?Name=Win32%2FRenos (2020)
Microsoft Security Intelligence. Win32/Vobfus. https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?name=win32%2Fvobfus (2020)
Microsoft Security Intelligence. Win32/Winwebsec. https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?Name=Win32%2FWinwebsec (2020)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. https://arxiv.org/abs/1301.3781 (2013)
Optuna. https://optuna.org/ (2021)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. https://arxiv.org/abs/1802.05365 (2018)
Popov, I.: Malware detection using machine learning based on word2vec embeddings of machine code instructions. In: Siberian Symposium on Data Science and Engineering, SSDSE, pp. 1–4 (2017)
Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recognition. In: Proceedings of the IEEE, vol. 77(2), pp. 257–286 (1989)
Santos, I., Brezo, F., Nieves, J., Penya, Y.K., Sanz, B., Laorden, C., Bringas, P.G.: Idea: Opcode-sequence-based malware detection. In: International Symposium on Engineering Secure Software and Systems, pp. 35–43 (2010)
Sethi, A.: Classification of malware models. Master’s thesis, San Jose State University, Department of Computer Science. https://scholarworks.sjsu.edu/etd_projects/703/ (2019)
Shaily, S., Mangat, V.: The hidden Markov model and its application to human activity recognition. In: 2nd International Conference on Recent Advances in Engineering Computational Sciences, RAECS, pp. 1–4 (2015)
Stamp, M.: A revealing introduction to hidden Markov models. http://www.cs.sjsu.edu/faculty/stamp/RUA/HMM.pdf (2004)
Stamp, M.: Introduction to Machine Learning with Applications in Information Security. Chapman and Hall, CRC, Boca Raton (2017)
Vemparala, S., Di Troia, F., Visaggio, C.A., Austin, T.H, Stamp, M.: Malware detection using dynamic birthmarks. In: Verma, R.M., Rusinowitch, M. (eds.) Proceedings of the 2016 ACM on International Workshop on Security And Privacy Analytics, pp. 41–46 (2016)
Wong, W., Stamp, M.: Hunting for metamorphic engines. J. Comput. Virol. 2(3), 211–229 (2006)
Zhang, Z.: Improved Adam optimizer for deep neural networks. In: 2018 IEEE/ACM 26th International Symposium on Quality of Service, IWQoS, pp. 1–2 (2018)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kale, A.S., Pandya, V., Di Troia, F. et al. Malware classification with Word2Vec, HMM2Vec, BERT, and ELMo. J Comput Virol Hack Tech 19, 1–16 (2023). https://doi.org/10.1007/s11416-022-00424-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11416-022-00424-3