VILO: a rapid learning nearest-neighbor classifier for malware triage

Lakhotia, Arun; Walenstein, Andrew; Miles, Craig; Singh, Anshuman

doi:10.1007/s11416-013-0178-3

VILO: a rapid learning nearest-neighbor classifier for malware triage

Original Paper
Published: 05 March 2013

Volume 9, pages 109–123, (2013)
Cite this article

Journal of Computer Virology and Hacking Techniques Aims and scope Submit manuscript

Arun Lakhotia¹,
Andrew Walenstein²,
Craig Miles¹ &
…
Anshuman Singh¹

549 Accesses
14 Citations
Explore all metrics

Abstract

VILO is a lazy learner system designed for malware classification and triage. It implements a nearest neighbor (NN) algorithm with similarities computed over Term Frequency \(\times \) Inverse Document Frequency (TFIDF) weighted opcode mnemonic permutation features (N-perms). Being an NN-classifier, VILO makes minimal structural assumptions about class boundaries, and thus is well suited for the constantly changing malware population. This paper presents an extensive study of application of VILO in malware analysis. Our experiments demonstrate that (a) VILO is a rapid learner of malware families, i.e., VILO’s learning curve stabilizes at high accuracies quickly (training on less than 20 variants per family is sufficient); (b) similarity scores derived from TDIDF weighted features should primarily be treated as ordinal measurements; and (c) VILO with N-perm feature vectors outperforms traditional N-gram feature vectors when used to classify real-world malware into their respective families.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Automatic malware classification and new malware detection using machine learning

Article 01 September 2017

Liu Liu, Bao-sheng Wang, … Qiu-xi Zhong

Multifamily malware models

Article 10 January 2020

Samanvitha Basole, Fabio Di Troia & Mark Stamp

Using String Information for Malware Family Identification

Notes

http://www.gnu.org/software/binutils/.
http://www.hex-rays.com/products/ida/index.shtml.
http://www.ollydbg.de/.
NGVCK (Next Generation Virus Creation Kit) is a metamorphic virus generator that outputs syntactically different, semantically equivalent x86 ASM source code for viruses.

References

Abou-Assaleh, T., Cercone, N., Keselj, V., Sweidan, R.: N-gram-based detection of new malicious code. In: Proceedings of the 28th IEEE Annual International Computer Software and Applications Conference, 2004 (COMPSAC’04), vol. 2, pp. 41–42 (2004)
Bailey, M., Oberheide, J., Andersen, J., Mao, Z.M., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: Proceedings of the 10th International Conference on Recent Advances in Intrusion Detection, RAID’07, pp. 178–197. Springer, Berlin, Heidelberg (2007)
Carrera, E., Erdélyi, G.: Digital genome mapping-advanced binary malware analysis. In: Virus Bulletin Conference, pp. 187–197 (2004)
Chess, D., White, S.: An undetectable computer virus. In: Proceedings of Virus Bulletin Conference, vol. 5 (2000)
Chouchane, M., Lakhotia, A.: Using engine signature to detect metamorphic malware. In: Proceedings of the 4th ACM Workshop on Recurring Malcode, pp. 73–78. ACM (2006)
Chouchane, M., Walenstein, A., Lakhotia, A.: Statistical signatures for fast filtering of instruction-substituting metamorphic malware. In: Proceedings of the 2007 ACM Workshop on Recurring Malcode, pp. 31–37. ACM (2007)
Christodorescu, M., Jha, S., Seshia, S., Song, D., Bryant, R.: Semantics-aware malware detection. In: IEEE Symposium on IEEE Security and Privacy, pp. 32–46 (2005)
Cohen, F.: Operating system protection through program evolution. Comput. Secur. 12(6), 565–584 (1993)
Article Google Scholar
Duda, R., Hart, P., Stork, D.: Pattern Classification, vol. 2. Wiley, New York (2001)
Filiol, E., Josse, S.: A statistical model for undecidable viral detection. J. Comput. Virol 3(2), 65–74 (2007)
Article Google Scholar
Flake, H.: More fun with graphs. In: Proceedings of BlackHat Federal (2003)
Flake, H.: Structural comparison of executable objects. In: Proceedings of the International GI Workshop on Detection of Intrusions and Malware & Vulnerability Assessment, number P-46 in Lecture Notes in Informatics (DIMVA’04), pp. 161–174 (2004)
Green, D., Swets, J.: Signal Detection Theory and Psychophysics, vol. 1974. Wiley, New York (1966)
Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann Publishers, San Francisco (2006)
Hoad, T., Zobel, J.: Methods for identifying versioned and plagiarized documents. J. Am. Soc. Inform. Sci. Technol. 54(3), 203–215 (2003)
Article Google Scholar
Hogg, R., McKean, J., Craig, A.: Introduction to Mathematical Statistics. Prentice Hall, Englewood Cliffs (2005)
Jang, J., Brumley, D., Venkataraman, S.: Bitshred: feature hashing malware for scalable triage and semantic analysis. In: Proceedings of the 18th ACM Conference on Computer and Communications Security, pp. 309–320 (2011)
Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28(1), 11–21 (1972)
Article Google Scholar
Karim, M., Walenstein, A., Lakhotia, A., Parida, L.: Malware phylogeny generation using permutations of code. J. Comput. Virol. 1(1), 13–23 (2005)
Article Google Scholar
Kephart, J., Arnold, W.: Automatic extraction of computer virus signatures. In: 4th Virus Bulletin International Conference, pp. 178–184 (1994)
Kim, M., Notkin, D.: Program element matching for multi-version program analyses. In: Proceedings of the 2006 International Workshop on Mining Software Repositories, pp. 58–64 (2006)
Kinable, J., Kostakis, O.: Malware classification based on call graph clustering. J. Comput. Virol. 7(4), 233–245 (2011)
Article Google Scholar
Kolter, J., Maloof, M.: Learning to detect malicious executables in the wild. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470–478 (2004)
Kolter, J., Maloof, M.: Learning to detect and classify malicious executables in the wild. J. Mach. Learn. Res. 7, 2721–2744 (2006)
Google Scholar
Kruegel, C., Kirda, E., Mutz, D., Robertson, W., Vigna, G.: Polymorphic worm detection using structural information of executables. In: Recent Advances in Intrusion Detection, pp. 207–226. Springer, Berlin (2006)
Lakhotia, A., Singh, P.: Challenges in getting formal with viruses. Virus Bull. 9(1), 14–18 (2003)
Google Scholar
Lin, D., Stamp, M.: Hunting for undetectable metamorphic viruses. J. Comput. Virol. 7(3), 201–214 (2011)
Article Google Scholar
Masud, M., Khan, L., Thuraisingham, B.: Data Mining Tools for Malware Detection. CRC Press, Boca Raton (2011)
Masud, M.M., Khan, L., Thuraisingham, B.: A hybrid model to detect malicious executables. In: Proceedings of the IEEE International Conference on Communications (ICC 2007), pp. 1443–1448 (2007)
Microsoft. Microsoft Malware Protection Center Backdoor:Win32/Hupigon. http://www.microsoft.com/security/portal/Threat/Encyclopedia/Entry.aspx?Name=Backdoor
Microsoft. Microsoft Malware Protection Center Virus:Win32/Parite.b. http://www.microsoft.com/security/portal/Threat/Encyclopedia/Entry.aspx?name=Virus
Microsoft. Microsoft Malware Protection Center Backdoor:Win32/PcClient. http://www.microsoft.com/security/portal/Threat/Encyclopedia/Entry.aspx?Name=Backdoor
Microsoft. Microsoft security intelligence report July through December 2009. http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=11722. May 2010
Microsoft. Microsoft PE and COFF Specification. http://msdn.microsoft.com/en-us/windows/hardware/gg463119.aspx. October 2011
Miles, C., Lakhotia, A.: Personal correspondance with malware analysts. Personal, communication (2012)
Moskovitch, R., Feher, C., Tzachar, N., Berger, E., Gitelman, M., Dolev, S., Elovici, Y.: Unknown malcode detection using opcode representation. In: European Conference on Intelligence and Security Informatics 2008 (EuroISI08), Lectures Notes in Computer Sciences, vol. 5376, pp. 204–215. Springer, Berlin (2008)
Muttik, I.: Malware mining. In: Proceedings of 21st Virus Bulletin Conference (2011)
Pietraszek, T.: On the use of roc analysis for the optimization of abstaining classifiers. Mach. Learn. 68(2), 137–169 (2007)
Article Google Scholar
Rodriguez, J., Perez, A., Lozano, J.: Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Trans. Pattern Anal. Mach. Intell. 32(3), 569–575 (2010)
Google Scholar
Runwal, N., Low, R., Stamp, M.: Opcode graph similarity and metamorphic detection. J. Comput. Virol. 1–16 (2012)
Schultz, M.G., Eskin, E., Zadok, F., Stolfo, S.J.: Data mining methods for detection of new malicious executables. In: Proceedings of S &P 2001: IEEE Symposium on Security and Privacy, pp. 38–49 (2001)
Schultz, M.G., Eskin, E., Zadok, F., Stolfo, S.J.: Data mining methods for detection of new malicious executables. In: Proceedings of S &P 2001: the IEEE Symposium on Security and Privacy, pp. 38–49 (2001)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Tesauro, G., Kephart, J., Sorkin, G.: Neural networks for computer virus recognition. IEEE Expert 11(4), 5–6 (1996)
Article Google Scholar
Tian, R., Batten, L., Versteeg, S.: Function length as a tool for malware classification. In: Proceedings of the 3rd International Conference on Malicious and Unwanted Software, 2008. MALWARE 2008, pp. 69–76 (2008)
Toderici, A., Stamp, M.: Chi-squared distance and metamorphic virus detection. J. Comput. Virol 1–14 (2012). doi: 10.1007/s11416-012-0171-2
Walenstein, A., Venable, M., Hayes, M., Thompson, C., Lakhotia, A.: Exploiting similarity between variants to defeat malware. In: Proceedings of BlackHat Briefings DC 2007 (2007)
Wang, J.H., Deng, P.S., Fan, Y.S., Jaw, L.J., Liu, Y.C.: Virus detection using data mining techniques. In: Proceedings of the 37th International Carnahan Conference on Security Techology, pp. 71–77 (2003)
Wong, W., Stamp, M.: Hunting for metamorphic engines. J. Comput. Virol. 2, 211–229 (2006)
Article Google Scholar
Zobel, J., Moffat, A.: Exploring the similarity space. ACM SIGIR Forum 32(1), 18–34 (1998)
Article Google Scholar

Download references

Acknowledgments

The authors are grateful for Prof. Mihai Giurcanu’s help in identifying proper statistical evaluation methods. Furthermore, we wish to thank Suresh Golconda, Chris Parich, Michael Venable, Matthew Hayes, and Christopher Thompson for their past work, without which this paper would not have been possible.

Author information

Authors and Affiliations

Center for Advanced Computer Studies, University of Louisiana at Lafayette, Lafayette, LA, USA
Arun Lakhotia, Craig Miles & Anshuman Singh
School of Computing and Informatics, University of Louisiana at Lafayette, Lafayette, LA, USA
Andrew Walenstein

Authors

Arun Lakhotia
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Walenstein
View author publications
You can also search for this author in PubMed Google Scholar
Craig Miles
View author publications
You can also search for this author in PubMed Google Scholar
Anshuman Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arun Lakhotia.

Additional information

This research work was sponsored in part by funds from Air Force Research Lab and DARPA (FA8750-10-C-0171) and from Air Force Office of Scientific Research (FA9550-09-1-0715).

Appendix

Learning curves derived from usage of both N-perm and N-gram VILO feature vectors for Backdoor.Win32.Hupigon, Backdoor.Win32.PcClient, Rootkit.Win32.Agent, and Virus.Win32.Parite are shown herein (Figs. 6, 7, 8, 9).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lakhotia, A., Walenstein, A., Miles, C. et al. VILO: a rapid learning nearest-neighbor classifier for malware triage. J Comput Virol Hack Tech 9, 109–123 (2013). https://doi.org/10.1007/s11416-013-0178-3

Download citation

Received: 12 August 2012
Accepted: 28 January 2013
Published: 05 March 2013
Issue Date: August 2013
DOI: https://doi.org/10.1007/s11416-013-0178-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

VILO: a rapid learning nearest-neighbor classifier for malware triage

Abstract

Access this article

Similar content being viewed by others