Skip to main content

Malware Classification Based on Semi-Supervised Learning

  • Conference paper
  • First Online:
Science of Cyber Security (SciSec 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13580))

Included in the following conference series:

  • 1155 Accesses

Abstract

With the rapid evolution of malware in the past few years, it caused serious threats and damage to network security. To handle this, researchers began to propose effective classification approaches for various malware variants. However, these widely-used methods based on deep learning are in fully supervised manner, which suffers from two inevitable problems: 1) time-consuming: manually labeling data before training fully-supervised models require huge manual efforts. 2) resource-redundancy: a large amount of unlabeled data is not fully used, resulting in a resource waste. To solve the above problems, in this paper we propose a Malware Classification Method based on Semi-Supervised Learning namely MCM-SSL, which divides the model training into a pre-train stage using unlabeled data and a finetune stage using labeled data. The method proposed in this paper effectively uses a large amount of unlabeled data, and only needs a small amount of labeled data to achieve excellent performance. As a result, our method achieves an accuracy of 90.51% on the open-source Virus-MNIST dataset, which is superior to recent state-of-the-art methods. We also verify the generality and robustness of our method using a variety of common neural network algorithms. For the same algorithm, the accuracy of the pre-trained model is on average 2.4% higher than the model without pre-training.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.kaggle.com/datamunge/virusmnist.

  2. 2.

    https://github.com/reveondivad/virus-mnist.

References

  1. AMR: Kaspersky security bulletin 2021. statistics. https://securelist.com/kaspersky-security-bulletin-2021-statistics/105205/. Accessed 15 Dec 2021

  2. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)

    Google Scholar 

  3. Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.E.: Big self-supervised models are strong semi-supervised learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 22243–22255 (2020)

    Google Scholar 

  4. Ding, C., Luktarhan, N., Lu, B., Zhang, W.: A hybrid analysis-based approach to android malware family classification. Entropy 23(8), 1009 (2021)

    Article  Google Scholar 

  5. Duarte-Garcia, H.L., et al.: A semi-supervised learning methodology for malware categorization using weighted word embeddings. In: 2019 IEEE European Symposium on Security and Privacy Workshops (EuroS &PW), pp. 238–246. IEEE (2019)

    Google Scholar 

  6. Gandotra, E., Bansal, D., Sofat, S.: Malware analysis and classification: a survey. J. Inf. Secur. 2014 (2014)

    Google Scholar 

  7. Gao, T., Zhao, L., Li, X., Chen, W.: Malware detection based on semi-supervised learning with malware visualization. Math. Biosci. Eng. 18(5), 5995–6011 (2021)

    Article  MathSciNet  Google Scholar 

  8. Goyal, M., Kumar, R.: A survey on malware classification using machine learning and deep learning. In. J. Comput. Networks Appl. 8(6), 758–775 (2021)

    MathSciNet  Google Scholar 

  9. Kalash, M., Rochan, M., Mohammed, N., Bruce, N.D., Wang, Y., Iqbal, F.: Malware classification with deep convolutional neural networks. In: 2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS), pp. 1–5. IEEE (2018)

    Google Scholar 

  10. Larsen, E., MacVittie, K., Lilly, J.: Virus-mnist: machine learning baseline calculations for image classification. arXiv preprint arXiv:2111.02375 (2021)

  11. Mahdavifar, S., Kadir, A.F.A., Fatemi, R., Alhadidi, D., Ghorbani, A.A.: Dynamic android malware category classification using semi-supervised deep learning. In: 2020 IEEE International Conference on Dependable, Autonomic and Secure Computing, International Conference on Pervasive Intelligence and Computing, International Conference on Cloud and Big Data Computing, International Conference on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), pp. 515–522. IEEE (2020)

    Google Scholar 

  12. Mohamed, G.A.N., Ithnin, N.B.: Survey on representation techniques for malware detection system. Am. J. Appl. Sci. 14(11), 1049–1069 (2017). https://doi.org/10.3844/ajassp.2017.1049.1069, https://thescipub.com/abstract/ajassp.2017.1049.1069

  13. Nataraj, L., Karthikeyan, S., Jacob, G., Manjunath, B.S.: Malware images: visualization and automatic classification. In: Proceedings of the 8th International Symposium on Visualization for Cyber Security, pp. 1–7 (2011)

    Google Scholar 

  14. Noever, D., Noever, S.E.M.: Virus-mnist: a benchmark malware dataset. arXiv preprint arXiv:2103.00602 (2021)

  15. Rezaei, T., Hamze, A.: An efficient approach for malware detection using PE header specifications. In: 2020 6th International Conference on Web Research (ICWR), pp. 234–239. IEEE (2020)

    Google Scholar 

  16. Santos, I., Nieves, J., Bringas, P.G.: Semi-supervised learning for unknown malware detection. In: Abraham, A., Corchado, J.M., Gonzilez, S.R., De Paz Santana, J.F. (eds.) International Symposium on Distributed Computing and Artificial Intelligence. pp. 415–422. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19934-9_53

  17. Santos, I., Sanz, B., Laorden, C., Brezo, F., Bringas, P.G.: Opcode-sequence-based semi-supervised unknown malware detection. In: Herrero, Á., Corchado, E. (eds.) CISIS 2011. LNCS, vol. 6694, pp. 50–57. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21323-6_7

    Chapter  Google Scholar 

  18. Sihwail, R., Omar, K., Ariffin, K.Z.: A survey on malware analysis techniques: static, dynamic, hybrid and memory analysis. Int. J. Adv. Sci. Eng. Inf. Technol. 8(42), 1662–1671 (2018)

    Article  Google Scholar 

  19. Sriram, S., Vinayakumar, R., Sowmya, V., Alazab, M., Soman, K.: Multi-scale learning based malware variant detection using spatial pyramid pooling network. In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 740–745. IEEE (2020)

    Google Scholar 

  20. Wong, W., Juwono, F.H., Apriono, C.: Vision-based malware detection: a transfer learning approach using optimal ECOC-SVM configuration. IEEE Access 9, 159262–159270 (2021)

    Article  Google Scholar 

  21. Zhang, X.Y., Shi, H., Zhu, X., Li, P.: Active semi-supervised learning based on self-expressive correlation with generative adversarial networks. Neurocomputing 345, 103–113 (2019)

    Article  Google Scholar 

  22. Zhang, X.-Y., Wang, S., Jin, X., Zhu, X., Li, B.: Effective semi-supervised learning based on local correlation. In: Shi, Y., et al. (eds.) ICCS 2018. LNCS, vol. 10862, pp. 775–781. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93713-7_75

    Chapter  Google Scholar 

  23. Zhang, X.Y., Wang, S., Yun, X.: Bidirectional active learning: a two-way exploration into unlabeled and labeled data set. IEEE Trans. Neural Networks Learn. Syst. 26(12), 3034–3044 (2015)

    Article  MathSciNet  Google Scholar 

  24. Zhang, X., et al.: Enhancing state-of-the-art classifiers with API semantics to detect evolved android malware. In: Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications security, pp. 757–770 (2020)

    Google Scholar 

  25. Zhang, Z., Qi, P., Wang, W.: Dynamic malware analysis with feature engineering and feature learning. In: AAAI (2020)

    Google Scholar 

Download references

Acknowledgment

We would like to thank Jie Yuan from Iowa State University for the valuable discussions and insightful comments. This work was supported by the National Natural Science Foundation of China (Grant 61871378 and U2003111), and Defense Industrial Technology Development Program (Grant JCKY2021906A001).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to XiaoYu Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ding, Y. et al. (2022). Malware Classification Based on Semi-Supervised Learning. In: Su, C., Sakurai, K., Liu, F. (eds) Science of Cyber Security. SciSec 2022. Lecture Notes in Computer Science, vol 13580. Springer, Cham. https://doi.org/10.1007/978-3-031-17551-0_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-17551-0_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-17550-3

  • Online ISBN: 978-3-031-17551-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics