Malware Classification Based on Semi-Supervised Learning

Ding, Yu; Zhang, XiaoYu; Li, BinBin; Xing, Jian; Qiang, Qian; Qi, ZiSen; Guo, MengHan; Jia, SiYu; Wang, HaiPing

doi:10.1007/978-3-031-17551-0_19

Yu Ding^10,11,
XiaoYu Zhang¹⁰,
BinBin Li¹⁰,
Jian Xing^10,11,12,
Qian Qiang^10,11,13,
ZiSen Qi^10,11,
MengHan Guo¹⁰,
SiYu Jia^10,11 &
…
HaiPing Wang^10,11

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13580))

Included in the following conference series:

International Conference on Science of Cyber Security

1155 Accesses

Abstract

With the rapid evolution of malware in the past few years, it caused serious threats and damage to network security. To handle this, researchers began to propose effective classification approaches for various malware variants. However, these widely-used methods based on deep learning are in fully supervised manner, which suffers from two inevitable problems: 1) time-consuming: manually labeling data before training fully-supervised models require huge manual efforts. 2) resource-redundancy: a large amount of unlabeled data is not fully used, resulting in a resource waste. To solve the above problems, in this paper we propose a Malware Classification Method based on Semi-Supervised Learning namely MCM-SSL, which divides the model training into a pre-train stage using unlabeled data and a finetune stage using labeled data. The method proposed in this paper effectively uses a large amount of unlabeled data, and only needs a small amount of labeled data to achieve excellent performance. As a result, our method achieves an accuracy of 90.51% on the open-source Virus-MNIST dataset, which is superior to recent state-of-the-art methods. We also verify the generality and robustness of our method using a variety of common neural network algorithms. For the same algorithm, the accuracy of the pre-trained model is on average 2.4% higher than the model without pre-training.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

AMR: Kaspersky security bulletin 2021. statistics. https://securelist.com/kaspersky-security-bulletin-2021-statistics/105205/. Accessed 15 Dec 2021
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.E.: Big self-supervised models are strong semi-supervised learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 22243–22255 (2020)
Google Scholar
Ding, C., Luktarhan, N., Lu, B., Zhang, W.: A hybrid analysis-based approach to android malware family classification. Entropy 23(8), 1009 (2021)
Article Google Scholar
Duarte-Garcia, H.L., et al.: A semi-supervised learning methodology for malware categorization using weighted word embeddings. In: 2019 IEEE European Symposium on Security and Privacy Workshops (EuroS &PW), pp. 238–246. IEEE (2019)
Google Scholar
Gandotra, E., Bansal, D., Sofat, S.: Malware analysis and classification: a survey. J. Inf. Secur. 2014 (2014)
Google Scholar
Gao, T., Zhao, L., Li, X., Chen, W.: Malware detection based on semi-supervised learning with malware visualization. Math. Biosci. Eng. 18(5), 5995–6011 (2021)
Article MathSciNet Google Scholar
Goyal, M., Kumar, R.: A survey on malware classification using machine learning and deep learning. In. J. Comput. Networks Appl. 8(6), 758–775 (2021)
MathSciNet Google Scholar
Kalash, M., Rochan, M., Mohammed, N., Bruce, N.D., Wang, Y., Iqbal, F.: Malware classification with deep convolutional neural networks. In: 2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS), pp. 1–5. IEEE (2018)
Google Scholar
Larsen, E., MacVittie, K., Lilly, J.: Virus-mnist: machine learning baseline calculations for image classification. arXiv preprint arXiv:2111.02375 (2021)
Mahdavifar, S., Kadir, A.F.A., Fatemi, R., Alhadidi, D., Ghorbani, A.A.: Dynamic android malware category classification using semi-supervised deep learning. In: 2020 IEEE International Conference on Dependable, Autonomic and Secure Computing, International Conference on Pervasive Intelligence and Computing, International Conference on Cloud and Big Data Computing, International Conference on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), pp. 515–522. IEEE (2020)
Google Scholar
Mohamed, G.A.N., Ithnin, N.B.: Survey on representation techniques for malware detection system. Am. J. Appl. Sci. 14(11), 1049–1069 (2017). https://doi.org/10.3844/ajassp.2017.1049.1069, https://thescipub.com/abstract/ajassp.2017.1049.1069
Nataraj, L., Karthikeyan, S., Jacob, G., Manjunath, B.S.: Malware images: visualization and automatic classification. In: Proceedings of the 8th International Symposium on Visualization for Cyber Security, pp. 1–7 (2011)
Google Scholar
Noever, D., Noever, S.E.M.: Virus-mnist: a benchmark malware dataset. arXiv preprint arXiv:2103.00602 (2021)
Rezaei, T., Hamze, A.: An efficient approach for malware detection using PE header specifications. In: 2020 6th International Conference on Web Research (ICWR), pp. 234–239. IEEE (2020)
Google Scholar
Santos, I., Nieves, J., Bringas, P.G.: Semi-supervised learning for unknown malware detection. In: Abraham, A., Corchado, J.M., Gonzilez, S.R., De Paz Santana, J.F. (eds.) International Symposium on Distributed Computing and Artificial Intelligence. pp. 415–422. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19934-9_53
Santos, I., Sanz, B., Laorden, C., Brezo, F., Bringas, P.G.: Opcode-sequence-based semi-supervised unknown malware detection. In: Herrero, Á., Corchado, E. (eds.) CISIS 2011. LNCS, vol. 6694, pp. 50–57. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21323-6_7
Chapter Google Scholar
Sihwail, R., Omar, K., Ariffin, K.Z.: A survey on malware analysis techniques: static, dynamic, hybrid and memory analysis. Int. J. Adv. Sci. Eng. Inf. Technol. 8(42), 1662–1671 (2018)
Article Google Scholar
Sriram, S., Vinayakumar, R., Sowmya, V., Alazab, M., Soman, K.: Multi-scale learning based malware variant detection using spatial pyramid pooling network. In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 740–745. IEEE (2020)
Google Scholar
Wong, W., Juwono, F.H., Apriono, C.: Vision-based malware detection: a transfer learning approach using optimal ECOC-SVM configuration. IEEE Access 9, 159262–159270 (2021)
Article Google Scholar
Zhang, X.Y., Shi, H., Zhu, X., Li, P.: Active semi-supervised learning based on self-expressive correlation with generative adversarial networks. Neurocomputing 345, 103–113 (2019)
Article Google Scholar
Zhang, X.-Y., Wang, S., Jin, X., Zhu, X., Li, B.: Effective semi-supervised learning based on local correlation. In: Shi, Y., et al. (eds.) ICCS 2018. LNCS, vol. 10862, pp. 775–781. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93713-7_75
Chapter Google Scholar
Zhang, X.Y., Wang, S., Yun, X.: Bidirectional active learning: a two-way exploration into unlabeled and labeled data set. IEEE Trans. Neural Networks Learn. Syst. 26(12), 3034–3044 (2015)
Article MathSciNet Google Scholar
Zhang, X., et al.: Enhancing state-of-the-art classifiers with API semantics to detect evolved android malware. In: Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications security, pp. 757–770 (2020)
Google Scholar
Zhang, Z., Qi, P., Wang, W.: Dynamic malware analysis with feature engineering and feature learning. In: AAAI (2020)
Google Scholar

Download references

Acknowledgment

We would like to thank Jie Yuan from Iowa State University for the valuable discussions and insightful comments. This work was supported by the National Natural Science Foundation of China (Grant 61871378 and U2003111), and Defense Industrial Technology Development Program (Grant JCKY2021906A001).

Author information

Authors and Affiliations

Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Yu Ding, XiaoYu Zhang, BinBin Li, Jian Xing, Qian Qiang, ZiSen Qi, MengHan Guo, SiYu Jia & HaiPing Wang
School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Yu Ding, Jian Xing, Qian Qiang, ZiSen Qi, SiYu Jia & HaiPing Wang
National Computer Network Emergency Response Technical Team/Coordination Center of China Xinjiang Branch, Urumqi, China
Jian Xing
National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing, China
Qian Qiang

Authors

Yu Ding
View author publications
You can also search for this author in PubMed Google Scholar
XiaoYu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
BinBin Li
View author publications
You can also search for this author in PubMed Google Scholar
Jian Xing
View author publications
You can also search for this author in PubMed Google Scholar
Qian Qiang
View author publications
You can also search for this author in PubMed Google Scholar
ZiSen Qi
View author publications
You can also search for this author in PubMed Google Scholar
MengHan Guo
View author publications
You can also search for this author in PubMed Google Scholar
SiYu Jia
View author publications
You can also search for this author in PubMed Google Scholar
HaiPing Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to XiaoYu Zhang .

Editor information

Editors and Affiliations

University of Aizu, Aizuwakamatsu, Fukushima, Japan
Chunhua Su
Kyushu University, Fukuoka, Japan
Kouichi Sakurai
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Feng Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ding, Y. et al. (2022). Malware Classification Based on Semi-Supervised Learning. In: Su, C., Sakurai, K., Liu, F. (eds) Science of Cyber Security. SciSec 2022. Lecture Notes in Computer Science, vol 13580. Springer, Cham. https://doi.org/10.1007/978-3-031-17551-0_19

Download citation

DOI: https://doi.org/10.1007/978-3-031-17551-0_19
Published: 30 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17550-3
Online ISBN: 978-3-031-17551-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Malware Classification Based on Semi-Supervised Learning