Abstract
VILO is a lazy learner system designed for malware classification and triage. It implements a nearest neighbor (NN) algorithm with similarities computed over Term Frequency \(\times \) Inverse Document Frequency (TFIDF) weighted opcode mnemonic permutation features (N-perms). Being an NN-classifier, VILO makes minimal structural assumptions about class boundaries, and thus is well suited for the constantly changing malware population. This paper presents an extensive study of application of VILO in malware analysis. Our experiments demonstrate that (a) VILO is a rapid learner of malware families, i.e., VILO’s learning curve stabilizes at high accuracies quickly (training on less than 20 variants per family is sufficient); (b) similarity scores derived from TDIDF weighted features should primarily be treated as ordinal measurements; and (c) VILO with N-perm feature vectors outperforms traditional N-gram feature vectors when used to classify real-world malware into their respective families.
Similar content being viewed by others
Notes
NGVCK (Next Generation Virus Creation Kit) is a metamorphic virus generator that outputs syntactically different, semantically equivalent x86 ASM source code for viruses.
References
Abou-Assaleh, T., Cercone, N., Keselj, V., Sweidan, R.: N-gram-based detection of new malicious code. In: Proceedings of the 28th IEEE Annual International Computer Software and Applications Conference, 2004 (COMPSAC’04), vol. 2, pp. 41–42 (2004)
Bailey, M., Oberheide, J., Andersen, J., Mao, Z.M., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: Proceedings of the 10th International Conference on Recent Advances in Intrusion Detection, RAID’07, pp. 178–197. Springer, Berlin, Heidelberg (2007)
Carrera, E., Erdélyi, G.: Digital genome mapping-advanced binary malware analysis. In: Virus Bulletin Conference, pp. 187–197 (2004)
Chess, D., White, S.: An undetectable computer virus. In: Proceedings of Virus Bulletin Conference, vol. 5 (2000)
Chouchane, M., Lakhotia, A.: Using engine signature to detect metamorphic malware. In: Proceedings of the 4th ACM Workshop on Recurring Malcode, pp. 73–78. ACM (2006)
Chouchane, M., Walenstein, A., Lakhotia, A.: Statistical signatures for fast filtering of instruction-substituting metamorphic malware. In: Proceedings of the 2007 ACM Workshop on Recurring Malcode, pp. 31–37. ACM (2007)
Christodorescu, M., Jha, S., Seshia, S., Song, D., Bryant, R.: Semantics-aware malware detection. In: IEEE Symposium on IEEE Security and Privacy, pp. 32–46 (2005)
Cohen, F.: Operating system protection through program evolution. Comput. Secur. 12(6), 565–584 (1993)
Duda, R., Hart, P., Stork, D.: Pattern Classification, vol. 2. Wiley, New York (2001)
Filiol, E., Josse, S.: A statistical model for undecidable viral detection. J. Comput. Virol 3(2), 65–74 (2007)
Flake, H.: More fun with graphs. In: Proceedings of BlackHat Federal (2003)
Flake, H.: Structural comparison of executable objects. In: Proceedings of the International GI Workshop on Detection of Intrusions and Malware & Vulnerability Assessment, number P-46 in Lecture Notes in Informatics (DIMVA’04), pp. 161–174 (2004)
Green, D., Swets, J.: Signal Detection Theory and Psychophysics, vol. 1974. Wiley, New York (1966)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann Publishers, San Francisco (2006)
Hoad, T., Zobel, J.: Methods for identifying versioned and plagiarized documents. J. Am. Soc. Inform. Sci. Technol. 54(3), 203–215 (2003)
Hogg, R., McKean, J., Craig, A.: Introduction to Mathematical Statistics. Prentice Hall, Englewood Cliffs (2005)
Jang, J., Brumley, D., Venkataraman, S.: Bitshred: feature hashing malware for scalable triage and semantic analysis. In: Proceedings of the 18th ACM Conference on Computer and Communications Security, pp. 309–320 (2011)
Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28(1), 11–21 (1972)
Karim, M., Walenstein, A., Lakhotia, A., Parida, L.: Malware phylogeny generation using permutations of code. J. Comput. Virol. 1(1), 13–23 (2005)
Kephart, J., Arnold, W.: Automatic extraction of computer virus signatures. In: 4th Virus Bulletin International Conference, pp. 178–184 (1994)
Kim, M., Notkin, D.: Program element matching for multi-version program analyses. In: Proceedings of the 2006 International Workshop on Mining Software Repositories, pp. 58–64 (2006)
Kinable, J., Kostakis, O.: Malware classification based on call graph clustering. J. Comput. Virol. 7(4), 233–245 (2011)
Kolter, J., Maloof, M.: Learning to detect malicious executables in the wild. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470–478 (2004)
Kolter, J., Maloof, M.: Learning to detect and classify malicious executables in the wild. J. Mach. Learn. Res. 7, 2721–2744 (2006)
Kruegel, C., Kirda, E., Mutz, D., Robertson, W., Vigna, G.: Polymorphic worm detection using structural information of executables. In: Recent Advances in Intrusion Detection, pp. 207–226. Springer, Berlin (2006)
Lakhotia, A., Singh, P.: Challenges in getting formal with viruses. Virus Bull. 9(1), 14–18 (2003)
Lin, D., Stamp, M.: Hunting for undetectable metamorphic viruses. J. Comput. Virol. 7(3), 201–214 (2011)
Masud, M., Khan, L., Thuraisingham, B.: Data Mining Tools for Malware Detection. CRC Press, Boca Raton (2011)
Masud, M.M., Khan, L., Thuraisingham, B.: A hybrid model to detect malicious executables. In: Proceedings of the IEEE International Conference on Communications (ICC 2007), pp. 1443–1448 (2007)
Microsoft. Microsoft Malware Protection Center Backdoor:Win32/Hupigon. http://www.microsoft.com/security/portal/Threat/Encyclopedia/Entry.aspx?Name=Backdoor
Microsoft. Microsoft Malware Protection Center Virus:Win32/Parite.b. http://www.microsoft.com/security/portal/Threat/Encyclopedia/Entry.aspx?name=Virus
Microsoft. Microsoft Malware Protection Center Backdoor:Win32/PcClient. http://www.microsoft.com/security/portal/Threat/Encyclopedia/Entry.aspx?Name=Backdoor
Microsoft. Microsoft security intelligence report July through December 2009. http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=11722. May 2010
Microsoft. Microsoft PE and COFF Specification. http://msdn.microsoft.com/en-us/windows/hardware/gg463119.aspx. October 2011
Miles, C., Lakhotia, A.: Personal correspondance with malware analysts. Personal, communication (2012)
Moskovitch, R., Feher, C., Tzachar, N., Berger, E., Gitelman, M., Dolev, S., Elovici, Y.: Unknown malcode detection using opcode representation. In: European Conference on Intelligence and Security Informatics 2008 (EuroISI08), Lectures Notes in Computer Sciences, vol. 5376, pp. 204–215. Springer, Berlin (2008)
Muttik, I.: Malware mining. In: Proceedings of 21st Virus Bulletin Conference (2011)
Pietraszek, T.: On the use of roc analysis for the optimization of abstaining classifiers. Mach. Learn. 68(2), 137–169 (2007)
Rodriguez, J., Perez, A., Lozano, J.: Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Trans. Pattern Anal. Mach. Intell. 32(3), 569–575 (2010)
Runwal, N., Low, R., Stamp, M.: Opcode graph similarity and metamorphic detection. J. Comput. Virol. 1–16 (2012)
Schultz, M.G., Eskin, E., Zadok, F., Stolfo, S.J.: Data mining methods for detection of new malicious executables. In: Proceedings of S &P 2001: IEEE Symposium on Security and Privacy, pp. 38–49 (2001)
Schultz, M.G., Eskin, E., Zadok, F., Stolfo, S.J.: Data mining methods for detection of new malicious executables. In: Proceedings of S &P 2001: the IEEE Symposium on Security and Privacy, pp. 38–49 (2001)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Tesauro, G., Kephart, J., Sorkin, G.: Neural networks for computer virus recognition. IEEE Expert 11(4), 5–6 (1996)
Tian, R., Batten, L., Versteeg, S.: Function length as a tool for malware classification. In: Proceedings of the 3rd International Conference on Malicious and Unwanted Software, 2008. MALWARE 2008, pp. 69–76 (2008)
Toderici, A., Stamp, M.: Chi-squared distance and metamorphic virus detection. J. Comput. Virol 1–14 (2012). doi: 10.1007/s11416-012-0171-2
Walenstein, A., Venable, M., Hayes, M., Thompson, C., Lakhotia, A.: Exploiting similarity between variants to defeat malware. In: Proceedings of BlackHat Briefings DC 2007 (2007)
Wang, J.H., Deng, P.S., Fan, Y.S., Jaw, L.J., Liu, Y.C.: Virus detection using data mining techniques. In: Proceedings of the 37th International Carnahan Conference on Security Techology, pp. 71–77 (2003)
Wong, W., Stamp, M.: Hunting for metamorphic engines. J. Comput. Virol. 2, 211–229 (2006)
Zobel, J., Moffat, A.: Exploring the similarity space. ACM SIGIR Forum 32(1), 18–34 (1998)
Acknowledgments
The authors are grateful for Prof. Mihai Giurcanu’s help in identifying proper statistical evaluation methods. Furthermore, we wish to thank Suresh Golconda, Chris Parich, Michael Venable, Matthew Hayes, and Christopher Thompson for their past work, without which this paper would not have been possible.
Author information
Authors and Affiliations
Corresponding author
Additional information
This research work was sponsored in part by funds from Air Force Research Lab and DARPA (FA8750-10-C-0171) and from Air Force Office of Scientific Research (FA9550-09-1-0715).
Rights and permissions
About this article
Cite this article
Lakhotia, A., Walenstein, A., Miles, C. et al. VILO: a rapid learning nearest-neighbor classifier for malware triage. J Comput Virol Hack Tech 9, 109–123 (2013). https://doi.org/10.1007/s11416-013-0178-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11416-013-0178-3