Abstract
Clustering is an important problem in malware research, as the number of malicious samples that appear every day makes manual analysis impractical. Although these samples belong to a limited number of malware families, it is difficult to categorize them automatically as obfuscation is involved. By extracting relevant features we can apply clustering algorithms, then only analyze a couple of representatives from each cluster. However, classic clustering algorithms that compute the similarity between each pair of samples are slow when a large collection is involved. In this paper, the features will be strings of operation codes extracted from the binary code of each sample. With a modified suffix tree data structure we can find long enough substrings that correspond to portions of a program’s code. These substrings must be filtered against a database of known substrings so that common library code will be ignored. The items that have common substrings above a certain threshold will be grouped into the same cluster. Our algorithm was tested with data extracted from real-world malware and constructed quality clusters.






Similar content being viewed by others
References
Ackermann, W.: Zum hilbertschen aufbau der reellen zahlen. Mathematische Annalen 99(1), 118–133 (1928)
AV-Test: Malware statistics (2014). http://www.av-test.org/en/statistics/malware/
Bayer, U., Comparetti, P.M., Hlauschek, C., Kruegel, C., Kirda, E.: Scalable, behavior-based malware clustering. In: NDSS 9, 8–11 (2009). Citeseer
Bilar, D.: Opcodes as predictor for malware. Int. J. Electr. Secur. Digit. Forensics 1(2), 156–168 (2007)
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C., et al.: Introduction to algorithms, vol. 2. MIT press, Cambridge (2001)
Gower, J.C., Ross, G.: Minimum spanning trees and single linkage cluster analysis. Applied statistics pp. 54–64 (1969)
Gusfield, D.: Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)
Hu, X., Shin, K.G., Bhatkar, S., Griffin, K.: Mutantx-s: Scalable malware clustering based on static features. In: USENIX Annual Technical Conference, pp. 187–198 (2013)
Jana, P., Naik, A.: An efficient minimum spanning tree based clustering algorithm. In: ICM2CS 2009. Proceeding of International Conference on Methods and Models in Computer Science. pp. 1–5. IEEE Press, New York (2009)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. In: Soviet physics doklady, vol. 10, p. 707 (1966)
Manber, U., et al.: Finding similar files in a large file system. Usenix Winter 94, 1–10 (1994)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Oprisa, C., Checiches, M., Nandrean, A.: Locality-sensitive hashing optimizations for fast malware clustering. In: IEEE International Conference on Intelligent Computer Communication and Processing (ICCP). IEEE Press, New York (2014)
Perdisci, R., Lee, W., Feamster, N.: Behavioral clustering of http-based malware and signature generation using malicious network traces. In: NSDI, pp. 391–404 (2010)
Shabtai, A., Moskovitch, R., Feher, C., Dolev, S., Elovici, Y.: Detecting unknown malicious code by applying classification techniques on opcode patterns. Secur. Inf. 1(1), 1–22 (2012)
Sibson, R.: Slink: an optimally efficient algorithm for the single-link cluster method. Comput. J. 16(1), 30–34 (1973)
Skiena, S.S.: The algorithm design manual (2008)
Ször, P., Ferrie, P.: Hunting for metamorphic. In: Virus Bulletin Conference (2001)
Tarjan, R.E.: Efficiency of a good but not linear set union algorithm. J. ACM 22(2), 215–225 (1975)
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Vatamanu, C., Gavriluţ, D., Benchea, R.: A practical approach on clustering malicious pdf documents. J. Comput. Virol. 8(4), 151–163 (2012)
Weiner, P.: Linear pattern matching algorithms. In: SWAT’08. IEEE Conference Record of 14th Annual Symposium on Switching and Automata Theory, pp. 1–11. IEEE Press, New York (1973)
Xu, Y., Olman, V., Xu, D.: Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics 18(4), 536–545 (2002)
Zhang, Y.C., Che, M., Ma, J.: Analysis of the longest common substring algorithm. Comput. Simul. 12, 025 (2007)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Oprişa, C., Cabău, G. & Sebestyen Pal, G. Malware clustering using suffix trees. J Comput Virol Hack Tech 12, 1–10 (2016). https://doi.org/10.1007/s11416-014-0227-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11416-014-0227-6