Elsevier

Computers & Security

Volume 28, Issue 8, November 2009, Pages 827-842
Computers & Security

Using a bioinformatics approach to generate accurate exploit-based signatures for polymorphic worms

https://doi.org/10.1016/j.cose.2009.06.003Get rights and content

Abstract

In this paper, we propose Simplified Regular Expression (SRE) signature, which uses multiple sequence alignment techniques, drawn from bioinformatics, in a novel approach to generating more accurate exploit-based signatures. We also provide formal definitions of what is “a more specific” and what is “the most specific” signature for a polymorphic worm and show that the most specific exploit-based signature generation is NP-hard. The approach involves three steps: multiple sequence alignment to reward consecutive substring extractions, noise elimination to remove noise effects, and signature transformation to make the SRE signature compatible with current IDSs. Experiments on a range of polymorphic worms and real-world polymorphic shellcodes show that our bioinformatics approach is noise-tolerant and as that because it extracts more polymorphic worm characters, like one-byte invariants and distance restrictions between invariant bytes, the signatures it generates are more accurate and precise than those generated by some other exploit-based signature generation schemes.

Introduction

Internet worms propagate over the Internet through infections in which a worm sends a payload (such as a shellcode or a copy of itself) to the target host by exploiting operation system or network service vulnerabilities (Tang et al., 2009). A polymorphic worm is a worm that changes its appearance at each infection, making detection and prevention much harder. One of the most popular and effective ways to detect worms is signature-based detection (also called content-based filtering). The generated signature can be either exploit-based or vulnerability-based: an exploit-based signature describes the characteristics of one or a number of exploits; a vulnerability-based signature describes properties of one vulnerability and can detect all possible exploits utilizing this vulnerability.

Although vulnerability-based signature could be more effective, so far both exploit-based and vulnerability-based signatures have equal importance for worm detection in current IDSs (Vulnerability, 2005) for the following reasons. First, the real vulnerability-based signature1 can be generated only if a vulnerability is disclosed (Brumley et al., 2006); however an exploit-based signature can be fast and timely generated to detect zero-day exploits of an undisclosed vulnerability. Second, most IDS/anti-virus vendors generate both types of signatures. Through them, we not only know when we were attacked, but how we were attacked by investigating whether an exploit is new or a variant of a well-known exploit. Third, some vulnerability-based signatures cannot be adopted in current IDSs since they require that IDSs are able to perform profound protocol analysis (Crandall et al., 2005). However, this obviously overestimates the capability of current IDSs. On the contrary, it has no such requirement to generate exploit-based signatures and hence exploit-based signature generation systems are more compatible to current IDSs. Compared with manual signature generation, the recently studied automatic signature generation approaches can generate more accurate signatures for worms much faster, especially for polymorphic worms. Thus, this paper will only focus on automatic exploit-based signature generation for polymorphic worms.

We find that currently available exploit-based signature generation approaches may fail to create accurate signatures from collections of exploit samples of polymorphic worms. To understand why, consider a sample of a polymorphic worm (an infection flow) as a string sequence consisting of invariant and wildcard bytes. Invariant bytes have fixed values and are present in every worm sample. In contrast, wildcard bytes change their values in each sample. Fig. 1 shows an example of the polymorphic Code Red II worm, which contains a sequence of seven invariant parts: “GET”, “.ida?”, “XX”, “%u”, “%u780”, “=”, and “HTTP/1.0 r n”. Typically, we would try to extract the invariant parts of polymorphic worms as their signatures, since invariant bytes in a worm flow are composed of a number of invariant parts which are crucial to the exploitation of a vulnerable server (Crandall et al., 2005, Newsome et al., 2005). But this raises two difficulties. First, some invariant parts in polymorphic worms cannot be extracted. Earlier approaches (Kreibich and Crowcroft, 2003, Kim and Karp, 2004, Singh et al., 2004) were able to generate only a single invariant part. More up-to-date approaches (Polygraph (Newsome et al., 2005) and Hamsa (Li et al., 2006)) can extract most invariant parts except for one-byte invariant parts (such as “=” in the Code Red II worm). Second, no approach takes into account all distance restrictions between invariant parts (like ‘‘‘%u780’ is 4 bytes after ‘%u”’ in the Code Red II worm). Yet we might surmise that whether one-byte invariant (signature) parts and distance restrictions are valuable in worm detection. Consider Table 1, which shows part of rules (a rule represents a signature) of two well-known IDSs, Snort and Bro. It can be seen that 40.6% of rules in Snort's exploit.rules file consist of multiple invariant parts, 38.7% contain distance restrictions, and 22% contain one-byte signature parts. These figures suggest that distance restrictions could play a role in worm infections and that signatures generated by previous NSG systems may not be accurate enough to identify worms, resulting in a high rate of false positives.

In this paper, we model the problem of generating accurate exploit-based signature given some zero-day exploits of a new polymorphic worm. We propose a signature type – Simplified Regular Expression (SRE) signature. SRE can be easily transformed to rules in current IDSs to accurately capture polymorphic worms. Based on SRE, we provide formal definitions of what is “a more specific” and what is “the most specific” signature of a polymorphic worm such that we can compare the accuracy of two signatures. We show that it is an NP-hard problem to generate the most specific SRE signature of a polymorphic worm. Our approach proposed in the paper is a network-based and exploit-based scheme to generate SRE signature for a single polymorphic worm. It is a bioinformatics approach, inspired by the multiple sequence alignment techniques (used to identify motifs and domains preserved by evolution) in bioinformatics. The approach consists of three steps: multiple sequence alignment, noise elimination and signature transformation. The multiple sequence alignment step is based on the T-coffee algorithm (Notredame et al., 2000) and our newly proposed CSR (Consecutive Substrings Rewarded) algorithm. CSR is a pairwise alignment algorithm that captures contiguous invariant parts in worm samples. It is an extension of the Needleman–Wunsch algorithm and is based on rewarding consecutive matches. We minimize the impact of noise samples by using a noise elimination algorithm that is relevant to a noise tolerance rate θ. A signature transformation step is used to derive accurate SRE signatures that can be conveniently used in current IDSs. We analyze the complexity of the approach and the impact of its noise tolerance rate θ.

Experiments on a range of polymorphic worms and real-world polymorphic shellcodes show that the SRE signatures generated by our approach are more accurate than those of previous methods, extracting more polymorphic worm characters, including the one-byte invariant and distance restriction, and distinguishes between benign traffic and worm traffic. Our approach is resilient to a small portion of noise samples. Our experiments show that our approach (combining the CSR and T-coffee algorithms) is more accurate than other sequence alignment algorithms with regard to accuracy.

The rest of the paper is organized as follows. We first review the related work in Section 2. In Section 3, we propose the SRE signature and raise the accurate exploit-based signature generation problem. In Section 4, we present a bioinformatics approach that generates accurate SRE signatures given some exploit samples from a polymorphic worm. Section 5 evaluates our approach with and without noise sample interference. We discuss the complexity, the scope and limitation of our approach in Section 6. Finally, Section 7 offers our conclusion.

Section snippets

Polymorphism technique

Polymorphism technique has been exploited to create worm flows and worm writers have started using polymorphic engines in recent years. Published polymorphic shellcode generators include ADMmutate, PHATBOT, Jempiscodes, PHolyP, Clet, TAPiON, and Metasploit. Common techniques used to write polymorphic shellcodes include Garbage and NOP insertions, register shuffling, equivalent code substitution, and encryption/decryption. Although it has been shown that it may be possible to create polymorphic

Accurate exploit-based signature generation problem

In this section, we first propose a new signature – Simplified Regular Expression (SRE) signature. Based on this signature, we formally define what is “a more specific” signature and what is “the most specific” signature to compare SRE signatures. Then we present the accurate exploit-based signature generation problem and we show that the most specific signature generation for polymorphic worms is NP-hard.

A bioinformatics approach to SRE signature generation

In this section, we propose a bioinformatics approach to accurate SRE signature generation for a single polymorphic worm, inspired by some related algorithms from bioinformatics. The kernel of this approach is multiple sequence alignment, which has been broadly studied in bioinformatics where it has been used to find all preserved or common motifs and domains from a set of DNA/RNA sequences (Notredame, 2002). This application is similar to our accurate signature generation problem in which we

Experiments

In this section, we evaluate our bioinformatics approach under different scenarios. After introducing metrics used to measure signature quality, we first evaluate our approach using synthetic polymorphic worms and compare it with previous approaches, like Polygraph (Newsome et al., 2005) and Hamsa (Li et al., 2006). Then, we show the signature generation quality for polymorphic shellcodes, which were constructed by some real-world shellcode engines. We also carried out experiments to test the

Complexity analysis

Suppose that there are N sequences of length L as the input to our signature generation approach. The time complexity of the CSR algorithm is O(L2) and the time complexity of multiple sequence alignment is O(N2L2) + O(N3L) + O(N3) + O(NL2), where O(N2L2) represents the computation involved in the pairwise alignment to build primary library, O(N3L) for library extension, O(N3) for guide tree construction and O(NL2) for the computation of the progressive alignment. Since the time complexity of noise

Conclusion

In this paper, we addressed the problem of generating accurate exploit-based signature for a single polymorphic worm and proposed a novel signature generation method based on multiple sequence alignment – a bioinformatics approach. This approach provides a more powerful method to accurately analyze the intrinsic similarities of worm samples. An IDS can employ such approach to locally generate accurate exploit-based signatures for polymorphic worms and such signatures can be distributed to other

Yong Tang received the B.Sc, M.Sc and Ph.D. degrees in computer science from College of Computer, National University of Defense Technology, China, in 1998, 2002 and 2008 respectively. Now he is a lecturer in the College of Computer, National University of Defense Technology, China. His primary research field is network security, especially on Internet worm attacks, intrusion detection and Honeypot.

References (34)

  • V.I. Levenshtein

    Binary codes capable of correcting deletions, insertions, and reversals

    Soviet Physics Doklady

    (1966)
  • Li Z, Sanghi M, Chen Y, Kao MY, Chavez B. Hamsa: fast signature generation for zero-day polymorphic worms with provable...
  • Li Z, Wang L, Chen Y, Fu Z. Network-based and attack-resilient length signature generation for zero-day polymorphic...
  • Liang Z, Sekar R. Fast and automated generation of attack signatures: a basis for building self-protecting servers. In:...
  • Liang Z, Sekar R. Automatic generation of buffer overflow attack signatures: an approach based on program behavior...
  • Newsome J, Song D. Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on...
  • Newsome J, Karp B, Song D. Polygraph: automatically generating signatures for polymorphic worms. In: the 2005 IEEE...
  • Cited by (36)

    • RNNIDS: Enhancing network intrusion detection systems through deep learning

      2021, Computers and Security
      Citation Excerpt :

      Levenshtein distance: this distance metric defines how many substitutions, deletions or insertions are required to transform one string to another one. To assess the similarity of two strings, this metric has been used in the intrusion detection-related literature, e.g., Tang et al. (2009), Cesare and Xiang (2011). Beside that, it has its drawbacks: the main drawback of the Levenshtein algorithm is that it focuses on the global comparison between two strings, i.e., among all the variables in two strings.

    • Integrating granular computing and bioinformatics technology for typical process routes elicitation: A process knowledge acquisition approach

      2015, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      In other words, the purpose of sequence alignment is that, through the matching and replacement of characters or the insertion of gaps, two or more sequences are edited to reach the same length so that the identical characters in the different sequences can map one-to-one as much as possible. Therefore, according to the number of sequences to be aligned, sequence alignment is divided into pairwise alignment and multiple alignment (Laih, 2014; Eger, 2013; Tang et al., 2009). For pairwise alignment, the well-known Needleman–Wunsch (NW) algorithm, a dynamic programming-based optimization strategy, is often used to seek a best alignment of two sequences.

    • BIOPLAG: An Approach to Detect Programming Plagiarism

      2023, Anais da Academia Brasileira de Ciencias
    • Improved N-gram Algorithm for Unknown Malware Detection

      2022, Proceedings - 2022 International Conference on Machine Learning, Cloud Computing and Intelligent Mining, MLCCIM 2022
    View all citing articles on Scopus

    Yong Tang received the B.Sc, M.Sc and Ph.D. degrees in computer science from College of Computer, National University of Defense Technology, China, in 1998, 2002 and 2008 respectively. Now he is a lecturer in the College of Computer, National University of Defense Technology, China. His primary research field is network security, especially on Internet worm attacks, intrusion detection and Honeypot.

    Bin Xiao received the B.Sc and M.Sc degrees in Electronics Engineering from Fudan University, China. in 1997 and 2000 respectively, and Ph.D. degree from University of Texas at Dallas, USA, in 2003 from Computer Science. Now he is an Assistant Professor in the Department of Computing of Hong Kong Polytechnic University, Hong Kong. His research interests include communication and security in computer networks, peer-topeer networks, wireless mobile ad hoc and sensor networks.

    Xicheng Lu received the B.Sc. degree in computer science from Harbin Military Engineering Institute, China, in 1970. He was a visiting scholar at the University of Massachusetts between 1982 and 1984. He is now a professor in College of Computer, National University of Defense Technology, Changsha, China. His research interests include distributed computing, computer networks, parallel computing, network security, etc. He has served as member of editorial boards of several journals and cochaired many professional conferences. He is the joint recipient of more than a dozen academic awards, including four First Class National Scientific and Technological Progress Prize of China. He is an academician of the Chinese Academy of Engineering.

    The work was partially supported by China 973 Grant No. 2005CB321801, China 863 Grant No. 2009AA01Z432 and HK RGC PolyU 5311/06E.

    View full text