Skip to main content
Log in

Clustering for malware classification

  • Original Paper
  • Published:
Journal of Computer Virology and Hacking Techniques Aims and scope Submit manuscript

Abstract

In this research, we apply clustering techniques to the malware classification problem. We compute clusters using the well-known K-means and Expectation Maximization algorithms, with the underlying scores based on Hidden Markov Models. We compare the results obtained from these two clustering approaches and we carefully consider the interplay between the dimension (i.e., number of models used for clustering), and the number of clusters, with respect to the accuracy of the clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Alsabti, K., Ranka, S., Singh, V.: An efficient \(K\)-means clustering algorithm. Electrical Engineering and Computer Science. Paper 43. http://surface.syr.edu/eecs/43 (1997). Accessed 21 Jan 2016

  2. Al-Zoubi, M.B., Rawi, M.A.: An efficient approach for computing silhouette coefficients. J. Comput. Sci. 4(3), 252–255 (2008)

    Article  Google Scholar 

  3. Annachhatre, C., Austin, T.H., Stamp, M.: Hidden Markov model for malware classification. J. Comput. Virol. Hack. Tech. 11(2), 59–73 (2014)

    Article  Google Scholar 

  4. Austin, T.H., Filiol, E., Josse, S., Stamp, M.: Exploring hidden Markov models for virus analysis: a semantic approach. In: Proceedings of 46th Hawaii International Conference on System Sciences (HICSS 2013), pp. 5039–5048 (2013)

  5. Aycock, J.: Computer Viruses and Malware. Springer, Heidelberg (2006)

    Google Scholar 

  6. Babu, A.R., Markandeyulu, M., Nagarjuna, B.V.R.R.: Pattern clustering with similarity measures. Int. J. Comput. Technol. Appl. 3(1), 365–369 (2012)

    Google Scholar 

  7. Bailey, M., Oberheide, J., Andersen, J., Morley Mao, Z., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: Proceedings of the 10th International Conference on Recent Advances in Intrusion Detection (RAID ’07), pp. 178–197 (2007)

  8. Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30(7), 1145–1159 (1997)

    Article  Google Scholar 

  9. Denning, D.E.: An intrusion-detection model. IEEE Trans. Softw. Eng. 13(2), 222–232 (1987)

    Article  Google Scholar 

  10. Do, C.B., Batzoglou, S.: What is the expectation maximization algorithm? Nat. Biotechnol. 26(8), 897–899 (2008)

    Article  Google Scholar 

  11. EM clustering algorithm. http://jormungand.net/projects/misc/em/. Accessed 21 Jan 2016

  12. Fawcett, T.: An introduction to ROC analysis. http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf (2005). Accessed 21 Jan 2016

  13. Idika, N., Mathur, A.P.: A survey of malware detection techniques. http://cyberunited.com/wp-content/uploads/2013/03/A-Survey-of-Malware-Detection-Techniques.pdf (2007)

  14. Internet Security Threat Report, Symantec Inc. http://www.symantec.com/content/en/us/enterprise/other_resources/b-istr_main_report_v19_21291018.en-us.pdf (2014). Accessed 21 Jan 2016

  15. Kolter, J., Maloof, M.: Learning to detect malicious executables in the wild. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470–478 (2004)

  16. Kong, D., Yan, G.: Discriminant malware distance learning on structural information for automated malware classification. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1357–1365 (2013)

  17. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)

  18. Malicia Project Dataset—Driving in the Cloud. http://malicia-project.com/dataset.html. Accessed 21 Jan 2016

  19. Nappa, A., Zubair Rafique, M., Caballero, J.: Driving in the cloud: an analysis of drive-by download operations and abuse reporting. In: Proceedings of the 10th Conference on Detection of Intrusions and Malware and Vulnerability Assessment, Berlin, Germany, July (2013)

  20. Narra, U., Di Troia, F., Corrado, V.A., Austin, T.H., Stamp, M.: Clustering versus SVM for malware detection. J. Comput. Virol. Hack. Tech. doi:10.1007/s11416-015-0253-z

  21. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 7(2), 257–286 (1989)

    Article  Google Scholar 

  22. Rieck, K., Trinius, P., Willems, C., Holz, T.: Automatic analysis of malware behavior using machine learning. J. Comput. Secur. 19(4), 639–668 (2011)

    Article  Google Scholar 

  23. Schultz, M., Eskin, E., Zadok, F., Stolfo, S.: Data mining methods for detection of new malicious executables. In: Proceedings of 2001 IEEE Symposium on Security and Privacy, pp. 38–49 (2001)

  24. Smart HDD. Kaspersky lab technical report. http://support.kaspersky.com/viruses/rogue?qid=208286454. Accessed 21 Jan 2016

  25. Snakebyte, Next Generation Virus Construction Kit (NGVCK). http://vx.netlux.org/vx.php?id=tn02. Accessed 21 Jan 2016

  26. Stamp, M.: A revealing introduction to hidden Markov models. http://www.cs.sjsu.edu/faculty/stamp/RUA/HMM.pdf (2012). Accessed 21 Jan 2016

  27. Stamp, M.: Information Security: Principles and Practice, 2nd edn. Wiley, New York (2011)

    Book  Google Scholar 

  28. Stamp, M.: Machine learning with applications in information security (unpublished manuscript)

  29. Trojan.Zbot, Symantec. http://www.symantec.com/security_853response/writeup.jsp?docid=2010-011016-3514-99 (2010). Accessed 21 Jan 2016

  30. Trojan.Zeroaccess, Symantec. http://www.symantec.com/security_response/writeup.jsp?docid=2011-071314-0410-99 (2011). Accessed 21 Jan 2016

  31. WinWebSec, Enigma Software. http://www.enigmasoftware.com/winwebsec-removal/. Accessed 21 Jan 2016

  32. Wong, W., Stamp, M.: Hunting for metamorphic engines. J. Comput. Virol. 2(3), 211–229 (2006)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mark Stamp.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pai, S., Troia, F.D., Visaggio, C.A. et al. Clustering for malware classification. J Comput Virol Hack Tech 13, 95–107 (2017). https://doi.org/10.1007/s11416-016-0265-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11416-016-0265-3

Keywords

Navigation