Malware clustering using suffix trees

Oprişa, Ciprian; Cabău, George; Sebestyen Pal, Gheorghe

doi:10.1007/s11416-014-0227-6

Malware clustering using suffix trees

Invited Paper
Published: 24 October 2014

Volume 12, pages 1–10, (2016)
Cite this article

Journal of Computer Virology and Hacking Techniques Aims and scope Submit manuscript

Ciprian Oprişa¹,
George Cabău¹ &
Gheorghe Sebestyen Pal²

459 Accesses
7 Citations
3 Altmetric
Explore all metrics

Abstract

Clustering is an important problem in malware research, as the number of malicious samples that appear every day makes manual analysis impractical. Although these samples belong to a limited number of malware families, it is difficult to categorize them automatically as obfuscation is involved. By extracting relevant features we can apply clustering algorithms, then only analyze a couple of representatives from each cluster. However, classic clustering algorithms that compute the similarity between each pair of samples are slow when a large collection is involved. In this paper, the features will be strings of operation codes extracted from the binary code of each sample. With a modified suffix tree data structure we can find long enough substrings that correspond to portions of a program’s code. These substrings must be filtered against a database of known substrings so that common library code will be ignored. The items that have common substrings above a certain threshold will be grouped into the same cluster. Our algorithm was tested with data extracted from real-world malware and constructed quality clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ackermann, W.: Zum hilbertschen aufbau der reellen zahlen. Mathematische Annalen 99(1), 118–133 (1928)
Article MathSciNet MATH Google Scholar
AV-Test: Malware statistics (2014). http://www.av-test.org/en/statistics/malware/
Bayer, U., Comparetti, P.M., Hlauschek, C., Kruegel, C., Kirda, E.: Scalable, behavior-based malware clustering. In: NDSS 9, 8–11 (2009). Citeseer
Bilar, D.: Opcodes as predictor for malware. Int. J. Electr. Secur. Digit. Forensics 1(2), 156–168 (2007)
Article Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C., et al.: Introduction to algorithms, vol. 2. MIT press, Cambridge (2001)
MATH Google Scholar
Gower, J.C., Ross, G.: Minimum spanning trees and single linkage cluster analysis. Applied statistics pp. 54–64 (1969)
Gusfield, D.: Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)
Book MATH Google Scholar
Hu, X., Shin, K.G., Bhatkar, S., Griffin, K.: Mutantx-s: Scalable malware clustering based on static features. In: USENIX Annual Technical Conference, pp. 187–198 (2013)
Jana, P., Naik, A.: An efficient minimum spanning tree based clustering algorithm. In: ICM2CS 2009. Proceeding of International Conference on Methods and Models in Computer Science. pp. 1–5. IEEE Press, New York (2009)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. In: Soviet physics doklady, vol. 10, p. 707 (1966)
Manber, U., et al.: Finding similar files in a large file system. Usenix Winter 94, 1–10 (1994)
Google Scholar
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Article Google Scholar
Oprisa, C., Checiches, M., Nandrean, A.: Locality-sensitive hashing optimizations for fast malware clustering. In: IEEE International Conference on Intelligent Computer Communication and Processing (ICCP). IEEE Press, New York (2014)
Perdisci, R., Lee, W., Feamster, N.: Behavioral clustering of http-based malware and signature generation using malicious network traces. In: NSDI, pp. 391–404 (2010)
Shabtai, A., Moskovitch, R., Feher, C., Dolev, S., Elovici, Y.: Detecting unknown malicious code by applying classification techniques on opcode patterns. Secur. Inf. 1(1), 1–22 (2012)
Article Google Scholar
Sibson, R.: Slink: an optimally efficient algorithm for the single-link cluster method. Comput. J. 16(1), 30–34 (1973)
Article MathSciNet Google Scholar
Skiena, S.S.: The algorithm design manual (2008)
Ször, P., Ferrie, P.: Hunting for metamorphic. In: Virus Bulletin Conference (2001)
Tarjan, R.E.: Efficiency of a good but not linear set union algorithm. J. ACM 22(2), 215–225 (1975)
Article MathSciNet MATH Google Scholar
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Article MathSciNet MATH Google Scholar
Vatamanu, C., Gavriluţ, D., Benchea, R.: A practical approach on clustering malicious pdf documents. J. Comput. Virol. 8(4), 151–163 (2012)
Weiner, P.: Linear pattern matching algorithms. In: SWAT’08. IEEE Conference Record of 14th Annual Symposium on Switching and Automata Theory, pp. 1–11. IEEE Press, New York (1973)
Xu, Y., Olman, V., Xu, D.: Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics 18(4), 536–545 (2002)
Article Google Scholar
Zhang, Y.C., Che, M., Ma, J.: Analysis of the longest common substring algorithm. Comput. Simul. 12, 025 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Bitdefender, 1, Cuza Vodă Street, City Business Center, 400107, Cluj-Napoca, Romania
Ciprian Oprişa & George Cabău
Technical University of Cluj-Napoca, 28, Gh. Bariţiu Street, Room M01A, 400027, Cluj-Napoca, Romania
Gheorghe Sebestyen Pal

Authors

Ciprian Oprişa
View author publications
You can also search for this author in PubMed Google Scholar
George Cabău
View author publications
You can also search for this author in PubMed Google Scholar
Gheorghe Sebestyen Pal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ciprian Oprişa.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Oprişa, C., Cabău, G. & Sebestyen Pal, G. Malware clustering using suffix trees. J Comput Virol Hack Tech 12, 1–10 (2016). https://doi.org/10.1007/s11416-014-0227-6

Download citation

Received: 11 September 2014
Accepted: 07 October 2014
Published: 24 October 2014
Issue Date: February 2016
DOI: https://doi.org/10.1007/s11416-014-0227-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Malware clustering using suffix trees

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Siamese Neural Networks: An Overview

Longest Common Substring with Approximately k Mismatches

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Malware clustering using suffix trees

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Siamese Neural Networks: An Overview

Longest Common Substring with Approximately k Mismatches

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation