Elsevier

Computers & Security

Volume 39, Part B, November 2013, Pages 419-430
Computers & Security

Deriving common malware behavior through graph clustering

https://doi.org/10.1016/j.cose.2013.09.006Get rights and content

Abstract

Detection of malicious software (malware) continues to be a problem as hackers devise new ways to evade available methods. The proliferation of malware and malware variants requires new advanced methods to detect them. This paper proposes a method to construct a common behavioral graph representing the execution behavior of a family of malware instances. The method generates one common behavioral graph by clustering a set of individual behavioral graphs, which represent kernel objects and their attributes based on system call traces. The resulting common behavioral graph has a common path, called HotPath, which is observed in all the malware instances in the same family. The proposed method shows high detection rates and false positive rates close to 0%. The derived common behavioral graph is highly scalable regardless of new instances added. It is also robust against system call attacks.

Introduction

Malware (malicious software) continues to be widespread despite the common use of anti-virus software, and the diversity of malware is increasing. Symantec detected approximately over 5.5 billion malware attacks in 2011, an 81% increase over 2010 (Corporation, 2012). A recent malware indexing technique was proposed to find the most similar malware for detection of (un)known malicious programs, using a large malware database (Hu and Chiueh, 2009). To reduce the resource and management costs for such large databases, deriving a representative behavior for each distinct malware family is an attractive, but challenging, possibility.

Due to the disadvantages of signature-based malware detection, recent researchers have turned to analysis of malware behavior. Previous work for behavior-based malware detection is based on static analysis (i.e. disassembly), or on dynamic analysis at runtime. Static analysis relies on disassembly to identify malicious behavior, and makes use of control flow analysis and data flow analysis, such as in SMIT (Hu and Chiueh, 2009), MetaAware (Zhang and Reeves, 2007), Krugel et al. (2005), etc. However, these methods need full disassembly for analysis, and may have problems with code polymorphism (e.g. encryption, packing of binaries) and metamorphism (e.g. code obfuscation) (Christodorescu and Jha, 2004, Moser et al., 2007a, Christodorescu and Jha, 2003a).

To overcome the drawbacks of static analysis, dynamic behavior-based analysis has gained much interest (Bayer et al., 2009, Bailey et al., 2007, Lee and Mody, 2006, Yin et al., 2007, Kolbitsch et al., 2010, Willems et al., 2007). However, existing dynamic approaches can require intensive memory analysis to investigate unique properties (i.e. the usage of memory areas of interest) of the malware. This may be achieved by advanced techniques like symbolic execution and taint analysis, causing major performance overhead, as discussed in Yin et al., 2007, Kirda et al., 2006 and Kolbitsch et al. (2009). Kolbitsch et al. (2009) proposed a malware detection method to extract a system call behavioral graph for each instance, which includes dependencies between system calls and their arguments, specifically, with the usage of buffers and strings. To find the dependencies, they also used dynamic program slicing and symbolic execution, which requires considerable execution and memory overhead, as discussed in Xin and Zhang (2009).

In this paper, we propose a new malware detection method. This method makes use of one representative common behavioral graph for all malware instances in a family, instead of one behavioral graph per instance. The proposed approach is valid and effective since most new malware are variants of known malware families (Vlachos et al., 2011, Gordon, 1997, Park et al., 2010). The common behavioral graph is created from individual behavioral graphs through graph clustering. Each behavioral graph represents the usage of kernel objects, the object attributes, and the dependencies between these kernel objects. Such properties are obtained through monitoring system calls and their arguments at runtime. The resulting common behavioral graph represents common behavior in malware instances of a family. The common graph has a unique subgraph, called the HotPath, which is shared by all the malware instances in the family. By using the common behavioral graph and the HotPath, the proposed method detects new variants of the family with high accuracy.

The proposed method is different from others in the following ways. Instead of finding unique properties for each malware sample, the proposed method constructs a common behavioral graph for an entire family. It uses a behavioral graph of kernel objects and their attributes and dependencies. It does not require memory tracing, tainting, or program slicing to work. It derives the HotPath, which is execution behavior shared by all malware instances in the family. The proposed scheme aims to achieve the following features, as discussed further in Section 4.

  • Scalability: Each family having similar behaviors produces one common behavioral graph, a representative behavior for the malware family. The resulting common behavioral graph is not changed significantly by the inclusion of many different variants in the family.

  • Efficiency: The method uses system call traces and their arguments by only reading the interesting memory blocks (corresponding to kernel objects). Analysis and graph generation require very little execution time.

  • Effectiveness: The resulting common behavioral graph can detect unknown malware because it is accurate enough to identify variations of common malware behaviors.

  • Robustness: The resulting common behavioral graph is not vulnerable to system call injection attacks, since it makes use of kernel objects and their dependencies. Furthermore, this method uses a similarity measurement to resist attempts at obfuscation, rather than using exact graph matching.

A prototype of the proposed method has been implemented. The prototype use the Ether (Dinaburg et al., 2008) platform to intercept system calls with their arguments, on the Windows. When evaluated on well-known data sets, higher detection rates were achieved than demonstrated in previous research papers, with false positive rates close to 0% or at 0%. Overhead for graph construction and matching is only the order of tens of milliseconds.

Our contributions are as follows. First, this method proposes a method to extract a kernel object behavioral graph (KOBG) for a malware instance from system call traces during execution. Second, from a set of the KOBGs, the method first derives a common behavioral graph that captures representative execution behavior of a malware family, by using graph clustering. It also found that, without manual inspection, the common graph has a unique subgraph (HotPath) shared by all of the binaries in the family. Third, the generated common behavioral graph is scalable since it is not changed significantly according to new malware variants added into the family. Finally, the proposed method matching with the common behavioral graph is robust against some system call attacks, which have been problematic for previous methods using system calls. In addition, the proposed method achieved better results on real malware instances than other techniques.

The remainder of this paper is organized as follows. Sections 2 System overview, 3 The proposed method present an overview and details about the proposed method, respectively. Section 4 presents the results of evaluating the method with actual malware instances. Section 6 describes previous work, and compares them with the proposed method. Discussion and conclusions are presented in Sections 5 Discussions, 7 Conclusion.

Section snippets

System overview

This section provides an overview of the proposed method. Some instances of malware are assumed to have been previously collected and classified into malware families; one of several known techniques (Bayer et al., 2009, Bailey et al., 2007, Lee and Mody, 2006) and anti-virus tools can be used for this purpose. These constitute a training set.

Each binary in this set is executed in a restricted (“sandbox”) environment, and its execution is monitored. Information is captured and recorded about

The proposed method

The new method has two steps for malware detection. In the first step, a set of malware instances (executables) are processed to derive kernel object behavioral graphs (one per instance) that represent their behavior. In the second step, the graphs for malware instances that are in the same family are then clustered into a single graph (a Weighted Common Behavioral Graph) that represents the behavior of all members of that family. The single graph is reliably generated by a few malware samples

Evaluation

This section presents the results of evaluating the proposed method. The method was implemented and used to detect a number of malware samples taken from “the wild”, and also used by other researchers. The detection and false positive rates, scalability, and robustness of the proposed method are examined in detail.

Discussions

This section discusses the limitations of and possible evasion techniques against the proposed method. We suggest possible solutions to address these limitations.

Related work

The graph clustering problem is very well known, and many applications have been proposed in the literature (Duda et al., 2000, Serratosa and Alquza, 2002, Gnter and Bunke, 2002, Cook and Holder, 2006). In addition, graph-based methods have been proposed for binary and behavior analysis (Ning and Xu, 2003, Wang et al., 2009, Christodorescu et al., 2007, Yin et al., 2007).

Behavior-based malware detection can be achieved by static analysis or dynamic analysis; both techniques are briefly reviewed

Conclusion

This paper proposed a new method for detecting malicious software (malware) instances. The method collects system call traces during execution of programs in a virtualized environment. From each trace, a kernel object behavioral graph (KOBG) is produced, representing the kernel objects that are accessed, and their dependencies. The KOBGs of a group of malware instances in the same family are combined into a Weighted Common Behavioral Graph (WCBG). This includes a special subgraph, the HotPath,

Younghee Park is an assistant professor in Computer Engineering of San Jose State University. She received her Ph.D. in Computer Science from North Carolina State University in 2010. After receiving her Ph.D., she had worked in top-tier research universities, Columbia University in New York and University of Illinois at Urbana-Champaign. After finishing her master at KAIST(Advanced Institute of Science and Technology) , she worked in National Security Research Institute, a part of ETRI in Korea

References (46)

  • F.N. Abu-Khzam et al.

    The maximum common subgraph problem: faster solutions via vertex cover, computer systems and applications

  • M. Bailey et al.

    Automated classification and analysis of internet malware

  • U. Bayer et al.

    Scalable, behavior-based malware clustering

  • H. Bunke et al.

    Graph clustering using the weighted minimum common supergraph

  • C. Cadar et al.

    Exe: automatically generating inputs of death

  • M. Christodorescu et al.

    Static analysis of executables to detect malicious patterns

  • M. Christodorescu et al.

    Static analysis of executables to detect malicious patterns

  • M. Christodorescu et al.

    Testing malware detectors

  • M. Christodorescu et al.

    Semantics-aware malware detection

  • M. Christodorescu et al.

    Mining specifications of malicious behavior

  • D. Conte et al.

    Challenging complexity of maximum common subgraph detection algorithms: a performance analysis of three algorithms on a wide database of graphs

    J Gr Algorithms Applications

    (2007)
  • D.J. Cook et al.

    Mining graph data

    (2006)
  • S. Corporation

    Symantec global internet security threat report

    (April 2012)
  • D. Dagon et al.

    A taxonomy of botnet structures

  • A. Dinaburg et al.

    Ether: malware analysis via hardware virtualization extensions

  • R.O. Duda et al.

    Pattern classification

    (2000)
  • S. Forrest et al.

    A sense of self for unix processes

  • M. Fredrikson et al.

    Dynamic behavior matching: a complexity analysis and new approximation algorithms

  • S. Gnter et al.

    Self-organizing map for clustering in the graph domain

    Pattern Recognit Letters

    (2002)
  • S. Gordon

    What is wild?

  • X. Hu et al.

    Large-scale malware indexing using function-call graphs

  • J. Kinder et al.

    Detecting malicious code by model checking

  • E. Kirda et al.

    Spyware detection

  • Cited by (85)

    • Effective malware detection scheme based on classified behavior graph in IIoT

      2021, Ad Hoc Networks
      Citation Excerpt :

      In order to extract more abstract malware features, a more comprehensive feature representation such as API call graph and a model based on deep learning are proposed. Park et al. [25] used control flow analysis and data flow analysis to identify malicious behavior. Christodorescu et al. [26] constructed system call dependency graph as malware behavior features for a malware sample.

    • Enhancing Zero-Day Attack Prediction a Hybrid Game Theory Approach with Neural Networks

      2024, International Journal of Intelligent Systems and Applications in Engineering
    View all citing articles on Scopus

    Younghee Park is an assistant professor in Computer Engineering of San Jose State University. She received her Ph.D. in Computer Science from North Carolina State University in 2010. After receiving her Ph.D., she had worked in top-tier research universities, Columbia University in New York and University of Illinois at Urbana-Champaign. After finishing her master at KAIST(Advanced Institute of Science and Technology) , she worked in National Security Research Institute, a part of ETRI in Korea in 2003. Her research interests are primarily the many aspects of cyber security, with an emphasis on malware detection, insider attacks, botnets, traceback to detect attacks. Currently, she is focusing on security issues in smart grid and mobile phones.

    Douglas S. Reeves is a professor in Computer Science of North Carolina State University. He is also the Director of Graduate Programs at NCSU. He received his Ph.D. in Computer Science from the Pennsylvania State University in 1987. In the same year, he joined the faculty of N.C. State University. He is interested in computer system, security, peer-to-peer computing.

    Mark Stamp has been involved with information security for the past 18 years. Following academic work in cryptography, he spent seven years as a cryptanalyst with the National Security Agency, followed by two years developing a digital rights management product for a Silicon Valley startup company. For the past seven years, Dr. Stamp has been with the Computer Science at San Jose State University, where he teaches courses in information security. He has published numerous articles and is the author of two textbooks, Information Security: Principles and Practice, 2nd edition (Wiley 2011) and Applied Cryptanalysis: Breaking Ciphers in the Real Word (Wiley 2007).

    View full text