Skip to main content
Log in

Malware clustering using suffix trees

  • Invited Paper
  • Published:
Journal of Computer Virology and Hacking Techniques Aims and scope Submit manuscript

Abstract

Clustering is an important problem in malware research, as the number of malicious samples that appear every day makes manual analysis impractical. Although these samples belong to a limited number of malware families, it is difficult to categorize them automatically as obfuscation is involved. By extracting relevant features we can apply clustering algorithms, then only analyze a couple of representatives from each cluster. However, classic clustering algorithms that compute the similarity between each pair of samples are slow when a large collection is involved. In this paper, the features will be strings of operation codes extracted from the binary code of each sample. With a modified suffix tree data structure we can find long enough substrings that correspond to portions of a program’s code. These substrings must be filtered against a database of known substrings so that common library code will be ignored. The items that have common substrings above a certain threshold will be grouped into the same cluster. Our algorithm was tested with data extracted from real-world malware and constructed quality clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Ackermann, W.: Zum hilbertschen aufbau der reellen zahlen. Mathematische Annalen 99(1), 118–133 (1928)

    Article  MathSciNet  MATH  Google Scholar 

  2. AV-Test: Malware statistics (2014). http://www.av-test.org/en/statistics/malware/

  3. Bayer, U., Comparetti, P.M., Hlauschek, C., Kruegel, C., Kirda, E.: Scalable, behavior-based malware clustering. In: NDSS 9, 8–11 (2009). Citeseer

  4. Bilar, D.: Opcodes as predictor for malware. Int. J. Electr. Secur. Digit. Forensics 1(2), 156–168 (2007)

    Article  Google Scholar 

  5. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C., et al.: Introduction to algorithms, vol. 2. MIT press, Cambridge (2001)

    MATH  Google Scholar 

  6. Gower, J.C., Ross, G.: Minimum spanning trees and single linkage cluster analysis. Applied statistics pp. 54–64 (1969)

  7. Gusfield, D.: Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)

    Book  MATH  Google Scholar 

  8. Hu, X., Shin, K.G., Bhatkar, S., Griffin, K.: Mutantx-s: Scalable malware clustering based on static features. In: USENIX Annual Technical Conference, pp. 187–198 (2013)

  9. Jana, P., Naik, A.: An efficient minimum spanning tree based clustering algorithm. In: ICM2CS 2009. Proceeding of International Conference on Methods and Models in Computer Science. pp. 1–5. IEEE Press, New York (2009)

  10. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. In: Soviet physics doklady, vol. 10, p. 707 (1966)

  11. Manber, U., et al.: Finding similar files in a large file system. Usenix Winter 94, 1–10 (1994)

    Google Scholar 

  12. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)

    Article  Google Scholar 

  13. Oprisa, C., Checiches, M., Nandrean, A.: Locality-sensitive hashing optimizations for fast malware clustering. In: IEEE International Conference on Intelligent Computer Communication and Processing (ICCP). IEEE Press, New York (2014)

  14. Perdisci, R., Lee, W., Feamster, N.: Behavioral clustering of http-based malware and signature generation using malicious network traces. In: NSDI, pp. 391–404 (2010)

  15. Shabtai, A., Moskovitch, R., Feher, C., Dolev, S., Elovici, Y.: Detecting unknown malicious code by applying classification techniques on opcode patterns. Secur. Inf. 1(1), 1–22 (2012)

    Article  Google Scholar 

  16. Sibson, R.: Slink: an optimally efficient algorithm for the single-link cluster method. Comput. J. 16(1), 30–34 (1973)

    Article  MathSciNet  Google Scholar 

  17. Skiena, S.S.: The algorithm design manual (2008)

  18. Ször, P., Ferrie, P.: Hunting for metamorphic. In: Virus Bulletin Conference (2001)

  19. Tarjan, R.E.: Efficiency of a good but not linear set union algorithm. J. ACM 22(2), 215–225 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  20. Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  21. Vatamanu, C., Gavriluţ, D., Benchea, R.: A practical approach on clustering malicious pdf documents. J. Comput. Virol. 8(4), 151–163 (2012)

  22. Weiner, P.: Linear pattern matching algorithms. In: SWAT’08. IEEE Conference Record of 14th Annual Symposium on Switching and Automata Theory, pp. 1–11. IEEE Press, New York (1973)

  23. Xu, Y., Olman, V., Xu, D.: Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics 18(4), 536–545 (2002)

    Article  Google Scholar 

  24. Zhang, Y.C., Che, M., Ma, J.: Analysis of the longest common substring algorithm. Comput. Simul. 12, 025 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ciprian Oprişa.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Oprişa, C., Cabău, G. & Sebestyen Pal, G. Malware clustering using suffix trees. J Comput Virol Hack Tech 12, 1–10 (2016). https://doi.org/10.1007/s11416-014-0227-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11416-014-0227-6

Keywords

Navigation