Classy: fast clustering streams of call-graphs

Kostakis, Orestis

doi:10.1007/s10618-014-0367-9

Classy: fast clustering streams of call-graphs

Published: 10 July 2014

Volume 28, pages 1554–1585, (2014)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Orestis Kostakis^1,2

731 Accesses
20 Citations
Explore all metrics

Abstract

An abstraction resilient to common malware obfuscation techniques is the call-graph. A call-graph is the representation of an executable file as a directed graph with labeled vertices, where the vertices correspond to functions and the edges to function calls. Unfortunately, most of the interesting graph comparison problems, including full-graph comparison and computing the largest common subgraph, belong to the \(NP\)-hard class. This makes the study and use of graphs in large scale systems difficult. Existing work has focused only on offline clustering and has not addressed the issue of clustering streams of graphs. In this paper we present Classy, a scalable distributed system that clusters streams of large call-graphs for purposes including automated malware classification and facilitating malware analysts. Since algorithms aimed at clustering sets are not suitable for clustering streams of objects, we propose the use of a clustering algorithm that relies on the notion of candidate clusters and reference samples therein. We demonstrate via thorough experimentation that this approach yields results very close to the offline optimal. Graph similarity is determined by computing a graph edit distance (GED) of pairs of graphs using an adapted version of simulated annealing. Furthermore, we present a novel lower bound for the GED. We also study the problem of approximating statistics of clusters of graphs when the distances of only a fraction of all possible pairs have been computed. Finally, we present results and statistics from a real production-side system that has clustered and contains more than 0.8 million graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

Salman Salloum, Ruslan Dautov, … Joshua Zhexue Huang

A brief introduction to distributed systems

Article Open access 16 August 2016

Maarten van Steen & Andrew S. Tanenbaum

How different are different diff algorithms in Git?

Article Open access 11 September 2019

Yusuf Sulistyo Nugroho, Hideaki Hata & Kenichi Matsumoto

References

Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases-volume 29, VLDB Endowment, pp 81–92
Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the thirtieth international conference on very large data bases-volume 30, VLDB Endowment, pp 852–863
Aggarwal C, Zhao Y, Yu P (2010) On clustering graph streams. In: Proceedings of the SIAM international conference on data mining, pp 478–489
Akutsu T (1993) A polynomial time algorithm for finding a largest common subgraph of almost trees of bounded degree. IEICE Trans Fundam Electron Commun Comput Sci 76(9):1488–1493
Google Scholar
Bayer U, Comparetti PM, Hlauschek C, Kruegel C, Kirda E (2009) Scalable, behavior-based malware clustering. In: 16th Network & distributed system security conference, vol 9, pp 8–11
Bourquin M, King A, Robbins E (2013) Binslayer: accurate comparison of binary executables. In: Proceedings of the 2nd ACM SIGPLAN program protection and reverse engineering workshop, ACM, p 4
Briones I, Gomez A (2008) Graphs, entropy and grid computing: automatic comparison of malware. Proceedings of the virus bulletin conference, pp 1–12
Bunke H (1997) On a relation between graph edit distance and maximum common subgraph. Pattern Recognit Lett 18(8):689–694
Article MathSciNet Google Scholar
Burkhard W, Keller R (1973) Some approaches to best-match file searching. Commun ACM 16(4):230–236
Article MATH Google Scholar
Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the SIAM international conference on data mining, pp 328–339
Carrera E, Erdélyi G (2004) Digital genome mapping-advanced binary malware analysis. In: Proceedings of the virus bulletin conference, pp 187–197
Charikar M, O’Callaghan L, Panigrahy R (2003) Better streaming algorithms for clustering problems. In: Proceedings of the ACM symposium on theory of computing, ACM, pp 30–39
Cheng J, Ke Y, Ng W (2009) Efficient query processing on graph databases. ACM Trans Database Syst (TODS) 34(1):2
Article Google Scholar
Christodorescu M, Jha S (2004) Testing malware detectors. ACM SIGSOFT Softw Eng Notes 29(4):34–44
Article Google Scholar
Conte D, Foggia P, Sansone C, Vento M (2004) Thirty years of graph matching in pattern recognition. Int J Pattern Recognit Artif Intell 18(03):265–298
Article Google Scholar
Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the twentieth annual symposium on computational geometry, ACM, pp 253–262
Dean T, Boddy M (1988) An analysis of time-dependent planning. In: Proceedings of the 17th national conference on artificial intelligence, pp 49–54
Dullien T, Rolles R (2005) Graph-based comparison of executable objects. SSTIC 5:1–3
Google Scholar
Elhadi AAE, Maarof MA, Barry BI (2013) Improving the detection of malware behaviour using simplified data dependent api call graph. Int J Secur Appl 7(5):29–42
Google Scholar
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96:226–231
Google Scholar
Flake H (2004) Structural comparison of executable objects. In: Proceedings of the international GI workshop on detection of intrusions and malware & vulnerability assessment, pp 161–174
Floyd R (1962) Algorithm 97: shortest path. Commun ACM 5(6):345
Article Google Scholar
Gascon H, Yamaguchi F, Arp D, Rieck K (2013) Structural detection of android malware using embedded call graphs. In: Proceedings of the 2013 ACM workshop on artificial intelligence and security, ACM, pp 45–54
Gionis A, Indyk P, Motwani R et al (1999) Similarity search in high dimensions via hashing. VLDB 99:518–529
Google Scholar
Gionis A, Mannila H, Tsaparas P (2005) Clustering aggregation. In: Proceedings of the 21st international conference on data engineering (ICDE), IEEE, pp 341–352
Giugno R, Shasha D (2002) Graphgrep: a fast and universal method for querying graphs. In: Proceedings of the 16th international conference on pattern recognition, IEEE, vol 2, pp 112–115
Guha S, Meyerson A, Mishra N, Motwani R, O’Callaghan L (2003) Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng 15(3):515–528
Article Google Scholar
He H, Singh A (2006) Closure-tree: an index structure for graph queries. In: Proceedings of the 22nd international conference on data engineering, IEEE, pp 38–38
Hegedus J, Miche Y, Ilin A, Lendasse A (2011) Methodology for behavioral-based malware analysis and detection using random projections and k-nearest neighbors classifiers. In: Seventh international conference on computational intelligence and security (CIS), IEEE, pp 1016–1023
Hex-Rays (2008) Ida pro. http://www.hex-rays.com/
Hu X, Chiueh T, Shin K (2009) Large-scale malware indexing using function-call graphs. In: Proceedings of the 16th ACM conference on computer and communications security, ACM, pp 611–620
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218
Article Google Scholar
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing, ACM, pp 604–613
Jiang H, Wang H, Yu P, Zhou S (2007) Gstring: a novel approach for efficient search in graph databases. In: Proceedings of the IEEE 23rd international conference on data engineering, IEEE, pp 566–575
Kang MG, Poosankam P, Yin H (2007) Renovo: a hidden code extractor for packed executables. In: Proceedings of the 2007 ACM workshop on recurring malcode, ACM, pp 46–53
Kinable J, Kostakis O (2011) Malware classification based on call graph clustering. J Comput Virol 7(4):233–245
Article Google Scholar
Kolbitsch C, Comparetti PM, Kruegel C, Kirda E, Zhou Xy, Wang X (2009) Effective and efficient malware detection at the end host. In: USENIX security symposium, pp 351–366
Kollios G, Potamias M, Terzi E (2013) Clustering large probabilistic graphs. IEEE Trans Knowl Data Eng 25(2):325–336
Article Google Scholar
Kostakis O, Kinable J, Mahmoudi H, Mustonen K (2011) Improved call graph comparison using simulated annealing. In: Proceedings of the 2011 ACM symposium on applied computing, ACM, pp 1516–1523
Kriege N, Mutzel P (2012) Subgraph matching kernels for attributed graphs. arXiv preprint arXiv:1206.6483
Kulis B, Basu S, Dhillon I, Mooney R (2009) Semi-supervised graph clustering: a kernel approach. Mach Learn 74(1):1–22
Article Google Scholar
Lin IJ, Kung SY (1997) Coding and comparison of dag’s as a novel neural structure with applications to on-line handwriting recognition. IEEE Trans Signal Process 45(11):2701–2708
Article Google Scholar
Martignoni L, Christodorescu M, Jha S (2007) Omniunpack: fast, generic, and safe unpacking of malware. In: Twenty-third annual computer security applications conference (ACSAC) 2007, IEEE, pp 431–441
Mishra N, Schreiber R, Stanton I, Tarjan RE (2007) Clustering social networks. In: Algorithms and models for the web-graph. Springer, Berlin, pp 56–67
Moser A, Kruegel C, Kirda E (2007a) Exploring multiple execution paths for malware analysis. In: IEEE symposium on security and privacy, IEEE, pp 231–245
Moser A, Kruegel C, Kirda E, (2007b) Limits of static analysis for malware detection. In: Computer security applications conference, 2007. ACSAC 2007. Twenty-third annual, IEEE, pp 421–430
Papapetrou P, Athitsos V, Kollios G, Gunopulos D (2009) Reference-based alignment in large sequence databases. Proc VLDB Endow 2(1):205–216
Article Google Scholar
Ramon J, Gärtner T (2003) Expressivity versus efficiency of graph kernels. First international workshop on mining graphs, trees and sequences, pp 65–74
Rieck K, Holz T, Willems C, Düssel P, Laskov P (2008) Learning and classification of malware behavior. In: Detection of intrusions and malware, and vulnerability assessment. Springer, Berlin, pp 108–125
Riesen K, Bunke H (2009) Approximate graph edit distance computation by means of bipartite graph matching. Image Vis Comput 27(7):950–959
Article Google Scholar
Ryder BG (1979) Constructing the call graph of a program. IEEE Trans Softw Eng 3:216–226
Article MathSciNet Google Scholar
Schaeffer S (2007) Graph clustering. Comput Sci Rev 1(1):27–64
Article MathSciNet Google Scholar
Schietgat L, Ramon J, Bruynooghe M (2013) A polynomial-time maximum common subgraph algorithm for outerplanar graphs and its application to chemoinformatics. Ann Math Artif Intell 69(4):343–376
Article MATH MathSciNet Google Scholar
Seward HH (1954) Information sorting in the application of electronic digital computers to business operations. PhD thesis, Department of Electrical Engineering, Massachusetts Institute of Technology
Shervashidze N, Schweitzer P, Van Leeuwen EJ, Mehlhorn K, Borgwardt KM (2011) Weisfeiler–Lehman graph kernels. J Mach Learn Res 12:2539–2561
MATH MathSciNet Google Scholar
Snaker, Qwerton, Jibz (2006) Peid. http://www.aldeid.com/wiki/PEiD
Tarjan R, Van Leeuwen J (1984) Worst-case analysis of set union algorithms. J ACM 31(2):245–281
Article MATH Google Scholar
Tian Y, Patel J (2008) Tale: A tool for approximate large graph matching. In: Proceedings of the IEEE 24th international conference on data engineering, IEEE, pp 963–972
Veeramani R, Rai N (2012) Windows api based malware detection and framework analysis. In: International conference on networks and cyber security, p 25
Venkateswaran J, Lachwani D, Kahveci T, Jermaine C (2006) Reference-based indexing of sequence databases. In: Proceedings of the 32nd international conference on very large data bases, VLDB Endowment, pp 906–917
Vishwanathan S, Schraudolph NN, Kondor R, Borgwardt KM (2010) Graph kernels. J Mach Learn Res 11:1201–1242
MATH MathSciNet Google Scholar
Warshall S (1962) A theorem on Boolean matrices. J ACM 9(1):11–12
Article MATH MathSciNet Google Scholar
Willems C, Holz T, Freiling F (2007) Toward automated dynamic malware analysis using cwsandbox. Proceedings of the 28th IEEE symposium on security and privacy, vol 5(2), pp 32–39
Williams D, Huan J, Wang W (2007) Graph database indexing using structured graph decomposition. In: Proceedings of the IEEE 23rd international conference on data engineering, IEEE, pp 976–985
Xu JY, Sung AH, Chavez P, Mukkamala S (2004) Polymorphic malicious executable scanner by api sequence analysis. In: Fourth international conference on hybrid intelligent systems, HIS’04., IEEE, pp 378–383
Xu M, Wu L, Qi S, Xu J, Zhang H, Ren Y, Zheng N (2013) A similarity metric method of obfuscated malware using function-call graph. J Comput Virol Hacking Tech 9(1):35–47
Article Google Scholar
Yan X, Yu P, Han J (2005) Substructure similarity search in graph databases. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data, ACM, pp 766–777
Zeng Z, Tung A, Wang J, Feng J, Zhou L (2009) Comparing stars: on approximating graph edit distance. Proc VLDB Endow 2(1):25–36
Article Google Scholar
Zhao P, Yu J, Yu P (2007) Graph indexing: tree+ delta\(\le \) graph. In: Proceedings of the 33rd international conference on very large data bases, VLDB Endowment, pp 938–949
Zhou Y, Cheng H, Yu JX (2009) Graph clustering based on structural/attribute similarities. Proc VLDB Endow 2(1):718–729
Article Google Scholar

Download references

Acknowledgments

This work was supported by TEKES as part of the Future Internet Programme of TIVIT (Finnish Strategic Centre for Science, Technology and Innovation in the field of ICT). Special thanks to Paolo Palumbo for providing the file filtering rules, Gergely Erdélyi for his support on IDA Python and the call-graph unpacking code, and Stefan Lundström for the early integration of the system with the backend APIs.

Author information

Authors and Affiliations

Labs, F-Secure, Tammasaarenkatu 7, 00180 , Helsinki, Finland
Orestis Kostakis
Aalto University, 02150 , Espoo, Finland
Orestis Kostakis

Authors

Orestis Kostakis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Orestis Kostakis.

Additional information

Responsible editors: Toon Calders, Floriana Esposito, Eyke Hüllermeier, Rosa Meo.

The work was done while the author was with F-Secure.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kostakis, O. Classy: fast clustering streams of call-graphs. Data Min Knowl Disc 28, 1554–1585 (2014). https://doi.org/10.1007/s10618-014-0367-9

Download citation

Received: 28 February 2014
Accepted: 18 June 2014
Published: 10 July 2014
Issue Date: September 2014
DOI: https://doi.org/10.1007/s10618-014-0367-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Classy: fast clustering streams of call-graphs

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A brief introduction to distributed systems

How different are different diff algorithms in Git?

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Classy: fast clustering streams of call-graphs

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A brief introduction to distributed systems

How different are different diff algorithms in Git?

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation