Comparison of feature selection and classification algorithms in identifying malicious executables

https://doi.org/10.1016/j.csda.2006.09.005Get rights and content

Abstract

Malicious executables, often spread as email attachments, impose serious security threats to computer systems and associated networks. We investigated the use of byte sequence frequencies as a way to automatically distinguish malicious from benign executables without actually executing them. In a series of experiments, we compared classification accuracies over seven feature selection methods, four classification algorithms, and variable byte sequence lengths. We found that single-byte patterns provided surprisingly reliable features to separate malicious executables from benign. Between classifiers and feature selection methods, the overall performance of the models depended more on the choice of classifier than the method of feature selection. Support vector machine (SVM) classifiers were found to be superior in terms of prediction accuracy, training time, and aversion to overfitting.

Introduction

Exponential growth and development of the Internet has created unprecedented opportunities to access and share information. From confidential business operations to chocolate-chip cookie recipes, information exchange on the Internet is carried out continuously and ubiquitously. While originally conceived as a convenient tool for text messages, email has since evolved into the backbone of the Internet, and has become the primary medium not only for communicating ideas, opinions, and appointments, but also for unauthorized accesses and malicious attacks. For instance, a malicious executable program attached to an apparently benign email can easily be sent to thousands of recipients. With just a couple of mouse clicks, it can (and has) caused damage to computer systems and associated networks. These unfriendly activities include improperly gaining access privilege (Trapdoor), disclosing sensitive information (Covert Channel), exhausting system resources (Worm), and infecting normal programs (Virus). Some programs possess all the aforementioned malicious actions (Trojan Horse, Time/Logic Bomb, etc.).

There are several ways to determine whether or not a program might perform malicious functions. A program being screened can be compared with a known “clean” copy of the program (Cohen, 1987). Known malicious codes can be detected by virus scanners or compared against a set of verification rules serving as malicious code filters (Crocker and Pozzo, 1989). Dynamic analysis combines the concept of testing and debugging to detect malicious activities by running a program in a clean-room environment (Lo et al., 1991, Crawford et al., 1991).

To identify signatures of a malicious code, a computer security expert can screen and analyze a suspicious code with respect to its functions, data flow, and variable usages in order to find those signatures that indicate malicious intentions. Although a human can reason about a program in detail, it becomes difficult to examine code and data when the malicious components are embedded in different sections, their activities are triggered only when certain conditions are satisfied, or a large number of programs need to be handled at one time. Lo and his colleagues investigated and developed a malicious code filter based on the use of “tell-tale signs” and program slicing verification (Lo et al., 1995). A “tell-tale sign” is a program signature that determines whether the program is malicious or benign. For instance, copying a viral code to the end of the text segment would be considered a malicious signature. The program “slicing” produces a subset of the original program behaving the same with respect to the realization of a specified property. The purpose of this method is to determine whether a program is malicious or benign by examining characteristics of a small subset of a large program.

However, a malicious program may elude detection because it does not match any known signature. Its signature may deviate from known signatures, or the program may contain new signatures that have not been previously seen. As virus-writers continually create and modify malicious programs, it is essential not only to identify those malicious programs that exactly match known signatures, but also to detect new ones with similar features.

In an attempt to address the problem, Kephart and his colleagues at the high integrity computing laboratory of IBM used statistical methods to automatically extract computer virus signatures (Kephart and Arnold, 1994). Their idea was to identify the presence of a virus in an executable file, a boot record, or memory using short identifiers or signatures, which consist of sequences of bytes in the machine code. A good signature is one that is found in every object infected by the virus, but is unlikely to be found if the virus is not present. Later, researchers from the same group successfully developed a neural network based anti-virus scanner to detect the boot sector virus (Tesauro et al., 1996). Due to system limitations at that time, it was difficult to extend the neural network classifier to detect types of virus other than the boot sector virus.

In several later studies, statistical analyses and pattern recognition techniques have been applied to identify the malicious codes. The researchers (Schultz et al., 2001a, Schultz et al., 2001b) at Columbia University applied data mining techniques (Lee et al., 1999) to analyze a large set of malicious executables instead of only boot-sector (Tesauro et al., 1996), or only Win32 binaries (Arnold and Tesauro, 2000). Schultz and his colleagues used system resource information, embedded text strings and byte sequences as features extracted from executables. Three learning algorithms were used to classify the data: (1) an inductive rule-based learner that generates Boolean rules based on feature attributes; (2) a Naı¨ve Bayes classifier that estimates posterior probabilities that a testing file is in a class given a set of features, and (3) a multi-classifier system that combines the outputs from several Naı¨ve Bayes classifiers to generate an overall prediction. Their results show two classifiers, the Naı¨ve Bayes classifier (using text strings as features) and multi-Naı¨ve Bayes classifiers (with only byte sequences), outperform all other methods in terms of overall performance measured by detection rate and false positive rate.

More recently, using the same data set created by the researchers at Columbia University, Wang and colleagues tried the Naı¨ve Bayes classifier and compared its performance with a Decision Tree (Wang et al., 2003).The motivation was, instead of using binary sequence information, to use the feature information of the assembly language level or even at high language levels. To do so, a reverse-engineering step is needed to convert the machine codes to their original computer languages if possible. Interestingly, the overall accuracy of the Naı¨ve Bayes classifier was significantly lower than that reported in Schultz's paper (Schultz et al., 2001a). It is not clear whether the discrepancy originated from the different approaches for feature selection. In a pilot study, Kolter and Maloof also tried to use different classifications algorithms to detect malicious executables using byte sequence patterns. Of the four evaluated classifiers, boosted decision trees outperformed other methods when each feature variable consists of 4 bytes (4-g) (Kolter and Maloof, 2004).

Detecting malicious executables using text strings or byte sequences carries some similarities to text categorization, which uses words and/or phrases to assign documents to pre-defined categories. In the text categorization problem, the native feature space consists of the unique terms (words or phrases) that occur in documents, which can be tens or hundreds of thousands of terms for even a moderate-sized text collection. To effectively pursue dimensionality reduction, Yang and Pedersen did a comprehensive study of five feature selection methods in statistical learning of text categorization (Yang and Pedersen, 1997). The feature selection method based on an information gain metric yielded improved performance by removing up to 98% of the terms. In addition to the fact that the prediction accuracy depends on data and feature selection methods, the right choice of classifiers also plays a critical role. Besides the successful applications of the Naı¨ve Bayes classifiers in text categorization (Lewis and Ringuette, 1994, McCallum and Nigam, 1998), there have been reported a significant improvement made by the implementation of SVMs as classifiers (Joachimes, 1998b, Dumais et al., 1998, Yang and Liu, 1999). As SVMs have gradually gained a reputation as “must-try” classifiers, more comparative studies have been made on other aspects of applications of machine leaning for text classification, such as feature selection. For example, Forman presented an extensive empirical study of 12 feature selection metrics evaluated on a benchmark of 229 text classification problems. The results revealed a new feature selection metric called ‘bi-normal separation’ (BNS), which outperformed the others by a substantial margin in a range of situations (Forman, 2003).

In this study, we first compared the performance of seven different feature selection methods against three types of classifiers in distinguishing benign and malicious binaries. Then, we examined whether extending the features to multiple-byte sequences could improve the classification accuracy. Finally, SVM classifiers with different kernel functions were applied and their performance was compared. The rationale for choosing byte sequences as candidate features is that those byte patterns are the most accessible, and at the same time, the most direct information about the machine code in an executable. Using embedded text strings as features, such as head information, program names, authors’ names, or comments, is not robust since they can be easily changed. Some malicious executables intentionally camouflage these signatures by randomly generating these fields to deceive virus scanners (Kephart and Arnold, 1994; Schultz et al., 2001a).

Section snippets

Data description and preparation

Our experimental data were downloaded from the intrusion detection system site at Columbia University: http://www.cs.columbia.edu/ids/mef/. It consists of 4754 total files with 1074 benign and 3680 malicious. There are no duplicate programs in the data set, and each file is labeled as either benign or malicious. A more detailed description of the dataset can be found in (Schultz et al., 2001a, Schultz et al., 2001b).

The downloaded files have been already transformed from their binary form into

Graphical analysis of feature selection methods

In Fig. 3, the plot shows how the average frequency of a byte pattern F is associated with either benign or malicious files. The y-axis is the average frequency of each variable in the malicious files, i.e. P(F|M). The average occurrence of each variable in the benign files, i.e. P(F|B), is plotted in the x-axis. The dashed line with a slope of 45 divides the quadrant into two sections. When a variable occurs at the upper left part of the quadrant, e.g. the byte pattern “cd”, the variable more

Discussion

Using frequency-derived statistical measures of byte sequences to detect malicious executables is very different from a program analysis approach. Code analysis examines meaningful information on operation codes, address ranges, function calls, and data flow, either statically or by actually running and monitoring the code to infer potential malicious behavior. “Clean room” execution may lead to the identification of malicious codes not detected by static analysis. Single bytes in binary

Acknowledgment

We would like to thank the anonymous reviewers for their time and very helpful feedback. We are also grateful to Dr. Salvatore Stolfo and the colleagues at the Columbia University for letting us use their data. We also appreciate Drs. Chih-Jen Lin and Chih-Chung Chang for providing the LIBSVM package. This work was supported by the LDRD program at Los Alamos National Laboratory.

References (31)

  • F. Cohen

    A cryptographic checksum for integrity protection

    Computers and Security

    (1987)
  • R. Kohavi et al.

    Wrappers for feature subset selection

    Artificial Intelligence

    (1997)
  • R. Lo et al.

    MCF: a malicious code filter

    Comput. Security

    (1995)
  • Arnold, W., Tesauro, G., 2000. Automatically generated Win32 heuristic virus detection. Proceedings of the 2000...
  • C.J.C. Burges

    A tutorial on support vector machines for pattern recognition

    Data Mining and Knowledge Discovery

    (1998)
  • Chang, C.-C., Lin, C.-J., 2001. LIBSVM: a Library for Support Vector Machines. Software available at...
  • C. Cortes et al.

    Support-vector network

    Machine Learning

    (1995)
  • Crawford, R., Lo, R., Crossley, J., Fink, G., Kerchen, P., Ho, W., Levitt, K., Olsson, R., Archer, M., 1991. A testbed...
  • Crocker, S., Pozzo, M.M., 1989. A proposal for a verification to malicious code detection. Proceedings of IEEE Computer...
  • Dumais, S., Platt, J., Heckeman, D., Sahami, M., 1998. Inductive learning algorithm and representations for text...
  • G. Forman

    An extensive empirical study of feature selection metrics for text classification

    J. Mach. Learn. Res.

    (2003)
  • T. Joachimes

    Making large-scale SVM learning practical

  • Joachimes, T., 1998b. Text categorization with support vector machines: learning with many relevant features. European...
  • S.S. Keerthi et al.

    Improvements to Platt's SMO algorithm for SVM classifier design

    Neural Comput.

    (2001)
  • Kephart, J.O., Arnold, W.C., 1994. Automatic extraction of computer virus signatures. Proceedings of the 4th...
  • Cited by (16)

    • Improving malware detection by applying multi-inducer ensemble

      2009, Computational Statistics and Data Analysis
    • Detection of unknown computer worms based on behavioral classification of the host

      2008, Computational Statistics and Data Analysis
      Citation Excerpt :

      Recent studies have proposed methods for detecting unknown malcode using Machine Learning techniques. Given a training set of malicious and benign executable binary code, a classifier is trained to identify and classify unknown malicious executables as being malicious (Schultz et al., 2001; Abou-Assaleh et al., 2004; Kolter and Maloof, 2006; Caia et al., 2007). Existing methods rely on the analysis of the binary for the detection of unknown malcode.

    • Optimized spam classification approach with Negative selection algorithm

      2012, Journal of Theoretical and Applied Information Technology
    • Statistical Methods for Materials Science: The Data Science of Microstructure Characterization

      2019, Statistical Methods for Materials Science: The Data Science of Microstructure Characterization
    • The Impact of Lightweight Disassembler on Malware Detection: An Empirical Study

      2018, Proceedings - International Computer Software and Applications Conference
    • Research on Classification of Chinese Text Data Based on SVM

      2017, IOP Conference Series: Materials Science and Engineering
    View all citing articles on Scopus
    View full text