Malware classification with Word2Vec, HMM2Vec, BERT, and ELMo

Kale, Aparna Sunil; Pandya, Vinay; Di Troia, Fabio; Stamp, Mark

doi:10.1007/s11416-022-00424-3

Malware classification with Word2Vec, HMM2Vec, BERT, and ELMo

Invited Paper
Published: 22 April 2022

Volume 19, pages 1–16, (2023)
Cite this article

Journal of Computer Virology and Hacking Techniques Aims and scope Submit manuscript

Aparna Sunil Kale¹,
Vinay Pandya¹,
Fabio Di Troia¹ &
…
Mark Stamp ORCID: orcid.org/0000-0002-3803-8368¹

1434 Accesses
15 Citations
Explore all metrics

Abstract

Malware classification is an important and challenging problem in information security. Modern malware classification techniques rely on machine learning models that can be trained on features such as opcode sequences, API calls, and byte n-grams, among many others. In this research, we consider opcode features and we implement machine learning techniques, where we apply word embedding techniques—specifically, Word2Vec, HMM2Vec, BERT, and ELMo—as a feature engineering step. The resulting embedding vectors are then used as features for classification algorithms. The classification algorithms that we employ are support vector machines (SVM), k-nearest neighbor (kNN), random forests (RF), and convolutional neural networks (CNN). We conduct substantial experiments involving seven malware families. Our experiments extend beyond previous related work in this field. We show that we can obtain slightly better performance than in comparable previous work, with significantly faster model training times.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BERT for Malware Classification

A Comparison of Word2Vec, HMM2Vec, and PCA2Vec for Malware Classification

Ensemble Malware Classification Using Neural Networks

References

Aycock, J.: Computer Viruses and Malware. Springer, New York (2006)
Google Scholar
Beek, C. et al.: McAfee labs threats report. https://www.mcafee.com/enterprise/en-us/assets/reports/rp-quarterly-threats-aug-2019.pdf, August (2019)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Article MATH Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. https://arxiv.org/abs/1810.04805 (2018)
Dhanasekar, D., Di Troia, F., Potika, K., Stamp, M.: Detecting encrypted and polymorphic malware using hidden Markov models. In: Guide to Vulnerability Analysis for Computer Networks and Systems: An Artificial Intelligence Approach, pp. 281–299. Springer (2018)
DistilBERT. https://huggingface.co/transformers/model_doc/distilbert.html (2021)
Gael, V.: hmmlearn. https://github.com/hmmlearn/hmmlearn (2014)
Kale, A.S., Di Troia, F., Stamp, M.: Malware classification with word embedding features. In: Mori, P., Lenzini, G., Furnell, S. (eds.) Proceedings of the 7th International Conference on Information Systems Security and Privacy, ICISSP, pp. 733–742 (2021)
Keras. https://github.com/fchollet/keras (2015)
Kim, S.: PE header analysis for malware detection. Master’s thesis, San Jose State University, Department of Computer Science. https://scholarworks.sjsu.edu/etd_projects/624/ (2018)
Kolter, J.Z., Maloof, M.A.: Learning to detect and classify malicious executables in the wild. J. Mach. Learn. Res. 7(99), 2721–2744 (2006)
MathSciNet MATH Google Scholar
Krogh, A., Brown, M., Mian, I.S., Sjölander, K., Haussler, D.: Hidden Markov models in computational biology: applications to protein modeling. J. Mol. Biol. 235(5), 1501–1531 (1994)
Article Google Scholar
Lo, W.W., Yang, X., Wang, Y.: An Xception convolutional neural network for malware classification with transfer learning. In: 10th IFIP International Conference on New Technologies, Mobility and Security, NTMS, pp. 1–5 (2019)
Microsoft Security Intelligence. Rogue:Win32/FakeRean. https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?Name=Rogue:Win32/FakeRean &threatId=124161 (2020)
Microsoft Security Intelligence. Trojan:Win32/BHO.BO. https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?Name=Trojan:Win32/BHO.BO (2020)
Microsoft Security Intelligence. Trojan:Win32/OnLineGames.A. https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?Name=Trojan:Win32/OnLineGames.A (2020)
Microsoft Security Intelligence. VirTool:Win32/CeeInject. https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?Name=VirTool%3AWin32%2FCeeInject (2020)
Microsoft Security Intelligence. Win32/Renos. https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?Name=Win32%2FRenos (2020)
Microsoft Security Intelligence. Win32/Vobfus. https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?name=win32%2Fvobfus (2020)
Microsoft Security Intelligence. Win32/Winwebsec. https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-description?Name=Win32%2FWinwebsec (2020)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. https://arxiv.org/abs/1301.3781 (2013)
Optuna. https://optuna.org/ (2021)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. https://arxiv.org/abs/1802.05365 (2018)
Popov, I.: Malware detection using machine learning based on word2vec embeddings of machine code instructions. In: Siberian Symposium on Data Science and Engineering, SSDSE, pp. 1–4 (2017)
Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recognition. In: Proceedings of the IEEE, vol. 77(2), pp. 257–286 (1989)
Santos, I., Brezo, F., Nieves, J., Penya, Y.K., Sanz, B., Laorden, C., Bringas, P.G.: Idea: Opcode-sequence-based malware detection. In: International Symposium on Engineering Secure Software and Systems, pp. 35–43 (2010)
Sethi, A.: Classification of malware models. Master’s thesis, San Jose State University, Department of Computer Science. https://scholarworks.sjsu.edu/etd_projects/703/ (2019)
Shaily, S., Mangat, V.: The hidden Markov model and its application to human activity recognition. In: 2nd International Conference on Recent Advances in Engineering Computational Sciences, RAECS, pp. 1–4 (2015)
Stamp, M.: A revealing introduction to hidden Markov models. http://www.cs.sjsu.edu/faculty/stamp/RUA/HMM.pdf (2004)
Stamp, M.: Introduction to Machine Learning with Applications in Information Security. Chapman and Hall, CRC, Boca Raton (2017)
Book MATH Google Scholar
Vemparala, S., Di Troia, F., Visaggio, C.A., Austin, T.H, Stamp, M.: Malware detection using dynamic birthmarks. In: Verma, R.M., Rusinowitch, M. (eds.) Proceedings of the 2016 ACM on International Workshop on Security And Privacy Analytics, pp. 41–46 (2016)
Wong, W., Stamp, M.: Hunting for metamorphic engines. J. Comput. Virol. 2(3), 211–229 (2006)
Article Google Scholar
Zhang, Z.: Improved Adam optimizer for deep neural networks. In: 2018 IEEE/ACM 26th International Symposium on Quality of Service, IWQoS, pp. 1–2 (2018)

Download references

Author information

Authors and Affiliations

Department of Computer Science, San Jose State University, San Jose, CA, USA
Aparna Sunil Kale, Vinay Pandya, Fabio Di Troia & Mark Stamp

Authors

Aparna Sunil Kale
View author publications
You can also search for this author in PubMed Google Scholar
Vinay Pandya
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Di Troia
View author publications
You can also search for this author in PubMed Google Scholar
Mark Stamp
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mark Stamp.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kale, A.S., Pandya, V., Di Troia, F. et al. Malware classification with Word2Vec, HMM2Vec, BERT, and ELMo. J Comput Virol Hack Tech 19, 1–16 (2023). https://doi.org/10.1007/s11416-022-00424-3

Download citation

Received: 15 October 2021
Accepted: 18 March 2022
Published: 22 April 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s11416-022-00424-3

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Malware classification with Word2Vec, HMM2Vec, BERT, and ELMo

Abstract

Access this article

Similar content being viewed by others

BERT for Malware Classification

A Comparison of Word2Vec, HMM2Vec, and PCA2Vec for Malware Classification

Ensemble Malware Classification Using Neural Networks

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Navigation

Malware classification with Word2Vec, HMM2Vec, BERT, and ELMo

Abstract

Access this article

Similar content being viewed by others

BERT for Malware Classification

A Comparison of Word2Vec, HMM2Vec, and PCA2Vec for Malware Classification

Ensemble Malware Classification Using Neural Networks

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation