Skip to main content
Log in

VILO: a rapid learning nearest-neighbor classifier for malware triage

Journal of Computer Virology and Hacking Techniques Aims and scope Submit manuscript

Abstract

VILO is a lazy learner system designed for malware classification and triage. It implements a nearest neighbor (NN) algorithm with similarities computed over Term Frequency \(\times \) Inverse Document Frequency (TFIDF) weighted opcode mnemonic permutation features (N-perms). Being an NN-classifier, VILO makes minimal structural assumptions about class boundaries, and thus is well suited for the constantly changing malware population. This paper presents an extensive study of application of VILO in malware analysis. Our experiments demonstrate that (a) VILO is a rapid learner of malware families, i.e., VILO’s learning curve stabilizes at high accuracies quickly (training on less than 20 variants per family is sufficient); (b) similarity scores derived from TDIDF weighted features should primarily be treated as ordinal measurements; and (c) VILO with N-perm feature vectors outperforms traditional N-gram feature vectors when used to classify real-world malware into their respective families.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. http://www.gnu.org/software/binutils/.

  2. http://www.hex-rays.com/products/ida/index.shtml.

  3. http://www.ollydbg.de/.

  4. NGVCK (Next Generation Virus Creation Kit) is a metamorphic virus generator that outputs syntactically different, semantically equivalent x86 ASM source code for viruses.

References

  1. Abou-Assaleh, T., Cercone, N., Keselj, V., Sweidan, R.: N-gram-based detection of new malicious code. In: Proceedings of the 28th IEEE Annual International Computer Software and Applications Conference, 2004 (COMPSAC’04), vol. 2, pp. 41–42 (2004)

  2. Bailey, M., Oberheide, J., Andersen, J., Mao, Z.M., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: Proceedings of the 10th International Conference on Recent Advances in Intrusion Detection, RAID’07, pp. 178–197. Springer, Berlin, Heidelberg (2007)

  3. Carrera, E., Erdélyi, G.: Digital genome mapping-advanced binary malware analysis. In: Virus Bulletin Conference, pp. 187–197 (2004)

  4. Chess, D., White, S.: An undetectable computer virus. In: Proceedings of Virus Bulletin Conference, vol. 5 (2000)

  5. Chouchane, M., Lakhotia, A.: Using engine signature to detect metamorphic malware. In: Proceedings of the 4th ACM Workshop on Recurring Malcode, pp. 73–78. ACM (2006)

  6. Chouchane, M., Walenstein, A., Lakhotia, A.: Statistical signatures for fast filtering of instruction-substituting metamorphic malware. In: Proceedings of the 2007 ACM Workshop on Recurring Malcode, pp. 31–37. ACM (2007)

  7. Christodorescu, M., Jha, S., Seshia, S., Song, D., Bryant, R.: Semantics-aware malware detection. In: IEEE Symposium on IEEE Security and Privacy, pp. 32–46 (2005)

  8. Cohen, F.: Operating system protection through program evolution. Comput. Secur. 12(6), 565–584 (1993)

    Article  Google Scholar 

  9. Duda, R., Hart, P., Stork, D.: Pattern Classification, vol. 2. Wiley, New York (2001)

  10. Filiol, E., Josse, S.: A statistical model for undecidable viral detection. J. Comput. Virol 3(2), 65–74 (2007)

    Article  Google Scholar 

  11. Flake, H.: More fun with graphs. In: Proceedings of BlackHat Federal (2003)

  12. Flake, H.: Structural comparison of executable objects. In: Proceedings of the International GI Workshop on Detection of Intrusions and Malware & Vulnerability Assessment, number P-46 in Lecture Notes in Informatics (DIMVA’04), pp. 161–174 (2004)

  13. Green, D., Swets, J.: Signal Detection Theory and Psychophysics, vol. 1974. Wiley, New York (1966)

    Google Scholar 

  14. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann Publishers, San Francisco (2006)

  15. Hoad, T., Zobel, J.: Methods for identifying versioned and plagiarized documents. J. Am. Soc. Inform. Sci. Technol. 54(3), 203–215 (2003)

    Article  Google Scholar 

  16. Hogg, R., McKean, J., Craig, A.: Introduction to Mathematical Statistics. Prentice Hall, Englewood Cliffs (2005)

  17. Jang, J., Brumley, D., Venkataraman, S.: Bitshred: feature hashing malware for scalable triage and semantic analysis. In: Proceedings of the 18th ACM Conference on Computer and Communications Security, pp. 309–320 (2011)

  18. Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28(1), 11–21 (1972)

    Article  Google Scholar 

  19. Karim, M., Walenstein, A., Lakhotia, A., Parida, L.: Malware phylogeny generation using permutations of code. J. Comput. Virol. 1(1), 13–23 (2005)

    Article  Google Scholar 

  20. Kephart, J., Arnold, W.: Automatic extraction of computer virus signatures. In: 4th Virus Bulletin International Conference, pp. 178–184 (1994)

  21. Kim, M., Notkin, D.: Program element matching for multi-version program analyses. In: Proceedings of the 2006 International Workshop on Mining Software Repositories, pp. 58–64 (2006)

  22. Kinable, J., Kostakis, O.: Malware classification based on call graph clustering. J. Comput. Virol. 7(4), 233–245 (2011)

    Article  Google Scholar 

  23. Kolter, J., Maloof, M.: Learning to detect malicious executables in the wild. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470–478 (2004)

  24. Kolter, J., Maloof, M.: Learning to detect and classify malicious executables in the wild. J. Mach. Learn. Res. 7, 2721–2744 (2006)

    Google Scholar 

  25. Kruegel, C., Kirda, E., Mutz, D., Robertson, W., Vigna, G.: Polymorphic worm detection using structural information of executables. In: Recent Advances in Intrusion Detection, pp. 207–226. Springer, Berlin (2006)

  26. Lakhotia, A., Singh, P.: Challenges in getting formal with viruses. Virus Bull. 9(1), 14–18 (2003)

    Google Scholar 

  27. Lin, D., Stamp, M.: Hunting for undetectable metamorphic viruses. J. Comput. Virol. 7(3), 201–214 (2011)

    Article  Google Scholar 

  28. Masud, M., Khan, L., Thuraisingham, B.: Data Mining Tools for Malware Detection. CRC Press, Boca Raton (2011)

  29. Masud, M.M., Khan, L., Thuraisingham, B.: A hybrid model to detect malicious executables. In: Proceedings of the IEEE International Conference on Communications (ICC 2007), pp. 1443–1448 (2007)

  30. Microsoft. Microsoft Malware Protection Center Backdoor:Win32/Hupigon. http://www.microsoft.com/security/portal/Threat/Encyclopedia/Entry.aspx?Name=Backdoor

  31. Microsoft. Microsoft Malware Protection Center Virus:Win32/Parite.b. http://www.microsoft.com/security/portal/Threat/Encyclopedia/Entry.aspx?name=Virus

  32. Microsoft. Microsoft Malware Protection Center Backdoor:Win32/PcClient. http://www.microsoft.com/security/portal/Threat/Encyclopedia/Entry.aspx?Name=Backdoor

  33. Microsoft. Microsoft security intelligence report July through December 2009. http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=11722. May 2010

  34. Microsoft. Microsoft PE and COFF Specification. http://msdn.microsoft.com/en-us/windows/hardware/gg463119.aspx. October 2011

  35. Miles, C., Lakhotia, A.: Personal correspondance with malware analysts. Personal, communication (2012)

  36. Moskovitch, R., Feher, C., Tzachar, N., Berger, E., Gitelman, M., Dolev, S., Elovici, Y.: Unknown malcode detection using opcode representation. In: European Conference on Intelligence and Security Informatics 2008 (EuroISI08), Lectures Notes in Computer Sciences, vol. 5376, pp. 204–215. Springer, Berlin (2008)

  37. Muttik, I.: Malware mining. In: Proceedings of 21st Virus Bulletin Conference (2011)

  38. Pietraszek, T.: On the use of roc analysis for the optimization of abstaining classifiers. Mach. Learn. 68(2), 137–169 (2007)

    Article  Google Scholar 

  39. Rodriguez, J., Perez, A., Lozano, J.: Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Trans. Pattern Anal. Mach. Intell. 32(3), 569–575 (2010)

    Google Scholar 

  40. Runwal, N., Low, R., Stamp, M.: Opcode graph similarity and metamorphic detection. J. Comput. Virol. 1–16 (2012)

  41. Schultz, M.G., Eskin, E., Zadok, F., Stolfo, S.J.: Data mining methods for detection of new malicious executables. In: Proceedings of S &P 2001: IEEE Symposium on Security and Privacy, pp. 38–49 (2001)

  42. Schultz, M.G., Eskin, E., Zadok, F., Stolfo, S.J.: Data mining methods for detection of new malicious executables. In: Proceedings of S &P 2001: the IEEE Symposium on Security and Privacy, pp. 38–49 (2001)

  43. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  44. Tesauro, G., Kephart, J., Sorkin, G.: Neural networks for computer virus recognition. IEEE Expert 11(4), 5–6 (1996)

    Article  Google Scholar 

  45. Tian, R., Batten, L., Versteeg, S.: Function length as a tool for malware classification. In: Proceedings of the 3rd International Conference on Malicious and Unwanted Software, 2008. MALWARE 2008, pp. 69–76 (2008)

  46. Toderici, A., Stamp, M.: Chi-squared distance and metamorphic virus detection. J. Comput. Virol 1–14 (2012). doi: 10.1007/s11416-012-0171-2

  47. Walenstein, A., Venable, M., Hayes, M., Thompson, C., Lakhotia, A.: Exploiting similarity between variants to defeat malware. In: Proceedings of BlackHat Briefings DC 2007 (2007)

  48. Wang, J.H., Deng, P.S., Fan, Y.S., Jaw, L.J., Liu, Y.C.: Virus detection using data mining techniques. In: Proceedings of the 37th International Carnahan Conference on Security Techology, pp. 71–77 (2003)

  49. Wong, W., Stamp, M.: Hunting for metamorphic engines. J. Comput. Virol. 2, 211–229 (2006)

    Article  Google Scholar 

  50. Zobel, J., Moffat, A.: Exploring the similarity space. ACM SIGIR Forum 32(1), 18–34 (1998)

    Article  Google Scholar 

Download references

Acknowledgments

The authors are grateful for Prof. Mihai Giurcanu’s help in identifying proper statistical evaluation methods. Furthermore, we wish to thank Suresh Golconda, Chris Parich, Michael Venable, Matthew Hayes, and Christopher Thompson for their past work, without which this paper would not have been possible.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arun Lakhotia.

Additional information

This research work was sponsored in part by funds from Air Force Research Lab and DARPA (FA8750-10-C-0171) and from Air Force Office of Scientific Research (FA9550-09-1-0715).

Appendix

Appendix

Learning curves derived from usage of both N-perm and N-gram VILO feature vectors for Backdoor.Win32.Hupigon, Backdoor.Win32.PcClient, Rootkit.Win32.Agent, and Virus.Win32.Parite are shown herein (Figs. 6, 7, 8, 9).

Fig. 6
figure 6

Backdoor.Win32.Hupigon Learning Curves

Fig. 7
figure 7

Backdoor.Win32.PcClient Learning Curves

Fig. 8
figure 8

Rootkit.Win32.Agent Learning Curves

Fig. 9
figure 9

Virus.Win32.Parite Learning Curves

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lakhotia, A., Walenstein, A., Miles, C. et al. VILO: a rapid learning nearest-neighbor classifier for malware triage. J Comput Virol Hack Tech 9, 109–123 (2013). https://doi.org/10.1007/s11416-013-0178-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11416-013-0178-3

Keywords

Navigation