Abstract
In this research, we apply clustering techniques to the malware classification problem. We compute clusters using the well-known K-means and Expectation Maximization algorithms, with the underlying scores based on Hidden Markov Models. We compare the results obtained from these two clustering approaches and we carefully consider the interplay between the dimension (i.e., number of models used for clustering), and the number of clusters, with respect to the accuracy of the clustering.
Similar content being viewed by others
References
Alsabti, K., Ranka, S., Singh, V.: An efficient \(K\)-means clustering algorithm. Electrical Engineering and Computer Science. Paper 43. http://surface.syr.edu/eecs/43 (1997). Accessed 21 Jan 2016
Al-Zoubi, M.B., Rawi, M.A.: An efficient approach for computing silhouette coefficients. J. Comput. Sci. 4(3), 252–255 (2008)
Annachhatre, C., Austin, T.H., Stamp, M.: Hidden Markov model for malware classification. J. Comput. Virol. Hack. Tech. 11(2), 59–73 (2014)
Austin, T.H., Filiol, E., Josse, S., Stamp, M.: Exploring hidden Markov models for virus analysis: a semantic approach. In: Proceedings of 46th Hawaii International Conference on System Sciences (HICSS 2013), pp. 5039–5048 (2013)
Aycock, J.: Computer Viruses and Malware. Springer, Heidelberg (2006)
Babu, A.R., Markandeyulu, M., Nagarjuna, B.V.R.R.: Pattern clustering with similarity measures. Int. J. Comput. Technol. Appl. 3(1), 365–369 (2012)
Bailey, M., Oberheide, J., Andersen, J., Morley Mao, Z., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: Proceedings of the 10th International Conference on Recent Advances in Intrusion Detection (RAID ’07), pp. 178–197 (2007)
Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30(7), 1145–1159 (1997)
Denning, D.E.: An intrusion-detection model. IEEE Trans. Softw. Eng. 13(2), 222–232 (1987)
Do, C.B., Batzoglou, S.: What is the expectation maximization algorithm? Nat. Biotechnol. 26(8), 897–899 (2008)
EM clustering algorithm. http://jormungand.net/projects/misc/em/. Accessed 21 Jan 2016
Fawcett, T.: An introduction to ROC analysis. http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf (2005). Accessed 21 Jan 2016
Idika, N., Mathur, A.P.: A survey of malware detection techniques. http://cyberunited.com/wp-content/uploads/2013/03/A-Survey-of-Malware-Detection-Techniques.pdf (2007)
Internet Security Threat Report, Symantec Inc. http://www.symantec.com/content/en/us/enterprise/other_resources/b-istr_main_report_v19_21291018.en-us.pdf (2014). Accessed 21 Jan 2016
Kolter, J., Maloof, M.: Learning to detect malicious executables in the wild. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470–478 (2004)
Kong, D., Yan, G.: Discriminant malware distance learning on structural information for automated malware classification. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1357–1365 (2013)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Malicia Project Dataset—Driving in the Cloud. http://malicia-project.com/dataset.html. Accessed 21 Jan 2016
Nappa, A., Zubair Rafique, M., Caballero, J.: Driving in the cloud: an analysis of drive-by download operations and abuse reporting. In: Proceedings of the 10th Conference on Detection of Intrusions and Malware and Vulnerability Assessment, Berlin, Germany, July (2013)
Narra, U., Di Troia, F., Corrado, V.A., Austin, T.H., Stamp, M.: Clustering versus SVM for malware detection. J. Comput. Virol. Hack. Tech. doi:10.1007/s11416-015-0253-z
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 7(2), 257–286 (1989)
Rieck, K., Trinius, P., Willems, C., Holz, T.: Automatic analysis of malware behavior using machine learning. J. Comput. Secur. 19(4), 639–668 (2011)
Schultz, M., Eskin, E., Zadok, F., Stolfo, S.: Data mining methods for detection of new malicious executables. In: Proceedings of 2001 IEEE Symposium on Security and Privacy, pp. 38–49 (2001)
Smart HDD. Kaspersky lab technical report. http://support.kaspersky.com/viruses/rogue?qid=208286454. Accessed 21 Jan 2016
Snakebyte, Next Generation Virus Construction Kit (NGVCK). http://vx.netlux.org/vx.php?id=tn02. Accessed 21 Jan 2016
Stamp, M.: A revealing introduction to hidden Markov models. http://www.cs.sjsu.edu/faculty/stamp/RUA/HMM.pdf (2012). Accessed 21 Jan 2016
Stamp, M.: Information Security: Principles and Practice, 2nd edn. Wiley, New York (2011)
Stamp, M.: Machine learning with applications in information security (unpublished manuscript)
Trojan.Zbot, Symantec. http://www.symantec.com/security_853response/writeup.jsp?docid=2010-011016-3514-99 (2010). Accessed 21 Jan 2016
Trojan.Zeroaccess, Symantec. http://www.symantec.com/security_response/writeup.jsp?docid=2011-071314-0410-99 (2011). Accessed 21 Jan 2016
WinWebSec, Enigma Software. http://www.enigmasoftware.com/winwebsec-removal/. Accessed 21 Jan 2016
Wong, W., Stamp, M.: Hunting for metamorphic engines. J. Comput. Virol. 2(3), 211–229 (2006)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pai, S., Troia, F.D., Visaggio, C.A. et al. Clustering for malware classification. J Comput Virol Hack Tech 13, 95–107 (2017). https://doi.org/10.1007/s11416-016-0265-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11416-016-0265-3