Using String Information for Malware Family Identification

Shrestha, Prasha; Maharjan, Suraj; de la Rosa, Gabriela Ramírez; Sprague, Alan; Solorio, Thamar; Warner, Gary

doi:10.1007/978-3-319-12027-0_55

Prasha Shrestha⁶,
Suraj Maharjan⁶,
Gabriela Ramírez de la Rosa⁷,
Alan Sprague⁶,
Thamar Solorio⁶ &
…
Gary Warner⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8864))

Included in the following conference series:

Ibero-American Conference on Artificial Intelligence

1686 Accesses

Abstract

Classifying malware into correct families is an important task for anti-virus vendors. Currently, only some of them will recognize a particular malware. Even when they do, they either classify them into different families or use a generic family name, which does not provide much information. Our method for malware family identification is based on the observation that closely related malware have heavy overlap of strings. We first created two kinds of prototypes from printable strings in the malware: one using term frequency–inverse document frequency (tf-idf) and the other using the prominent strings extracted from the vocabulary. We then used these prototypes for classification. We achieved an accuracy of 91.02 % by considering the entire vocabulary and an accuracy of 80.52 % by considering 20 prominent strings for each malware family. Our accuracy is high enough for our system to be used to classify even those malware that can confuse the anti-virus vendors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Park, Y., Reeves, D., Mulukutla, V., Sundaravel, B.: Fast malware classification by automated behavioral graph matching. In: Proceedings of the Sixth Annual Workshop on Cyber Security and Information Intelligence Research, CSIIRW 2010, pp. 45:1–45:4. ACM, New York (2010)
Google Scholar
Bailey, M., Oberheide, J., Andersen, J., Mao, Z.M., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: Kruegel, C., Lippmann, R., Clark, A. (eds.) RAID 2007. LNCS, vol. 4637, pp. 178–197. Springer, Heidelberg (2007)
Chapter Google Scholar
Tian, R., Batten, L., Islam, M., Versteeg, S.: An automated classification system based on the strings of trojan and virus families. In: 2009 4th International Conference on Malicious and Unwanted Software (MALWARE), pp. 23–30 (2009)
Google Scholar
Shabtai, A., Moskovitch, R., Elovici, Y., Glezer, C.: Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey. Information Security Technical Report 14, 16–29 (2009)
Article Google Scholar
Han, E.-H.S., Karypis, G.: Centroid-based document classification: Analysis and experimental results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)
Chapter Google Scholar
Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Proceedings of the 2003 ACM Symposium on Applied Computing, SAC 2003, pp. 784–788. ACM, New York (2003)
Google Scholar
Wei, C., Sprague, A., Warner, G.: Clustering malware-generated spam emails with a novel fuzzy string matching algorithm. In: Proceedings of the 2009 ACM Symposium on Applied Computing, pp. 889–890. ACM (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Alabama at Birmingham, Birmingham, AL, USA
Prasha Shrestha, Suraj Maharjan, Alan Sprague, Thamar Solorio & Gary Warner
Universidad Autónoma Metropolitana, Unidad Cuajimalpa, Mexico, Mexico
Gabriela Ramírez de la Rosa

Authors

Prasha Shrestha
View author publications
You can also search for this author in PubMed Google Scholar
Suraj Maharjan
View author publications
You can also search for this author in PubMed Google Scholar
Gabriela Ramírez de la Rosa
View author publications
You can also search for this author in PubMed Google Scholar
Alan Sprague
View author publications
You can also search for this author in PubMed Google Scholar
Thamar Solorio
View author publications
You can also search for this author in PubMed Google Scholar
Gary Warner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Prasha Shrestha .

Editor information

Editors and Affiliations

Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil
Ana L.C. Bazzan
Pontifica Universidad Católica (PUC), Santiago de Chile, Chile
Karim Pichara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shrestha, P., Maharjan, S., de la Rosa, G.R., Sprague, A., Solorio, T., Warner, G. (2014). Using String Information for Malware Family Identification. In: Bazzan, A., Pichara, K. (eds) Advances in Artificial Intelligence -- IBERAMIA 2014. IBERAMIA 2014. Lecture Notes in Computer Science(), vol 8864. Springer, Cham. https://doi.org/10.1007/978-3-319-12027-0_55

Download citation

DOI: https://doi.org/10.1007/978-3-319-12027-0_55
Published: 12 November 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12026-3
Online ISBN: 978-3-319-12027-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics