Using clustering to improve the KNN-based classifiers for online anomaly network traffic identification

https://doi.org/10.1016/j.jnca.2010.10.009Get rights and content

Abstract

This paper proposes a method to identify flooding attacks in real-time, based on anomaly detection by genetic weighted KNN (K-nearest-neighbor) classifiers. A genetic algorithm is used to train an optimal weight vector for features; meanwhile, an unsupervised clustering algorithm is applied to reduce the number of instances in the sampling dataset, in order to shorten training and execution time, as well as to promote the system’s overall accuracy. More precisely, instances in the sampling dataset are replaced by less, but more significant, centroids of clusters. According to the proposed method, a system is implemented and evaluated by numerous Denial-of-Service (DoS) attacks. With an embedded weighted KNN classifier, the proposed system could identify a DoS attack from network traffic within a very short time; moreover, the experimental results show that the proposed system could achieve 95.8654% in overall accuracy in the case of 2-fold cross-validation, and 96.25% in overall accuracy for all known attack evaluations. That is, the proposed system possesses both effectiveness and efficiency. Effectiveness is measured by overall accuracy, including detection rate and false alarm rate, and efficiency is measured by the response time during an attack.

Introduction

Network intrusion detection systems (NIDSs) are traditionally divided into two broad categories: misuse detection (Lekkas and Mikhailov, 2007, Caswell et al., 2003) and anomaly detection (Toosj and Kahani, 2007, Tsang et al., 2007, Auld et al., 2007). Misuse detection aims to detect known attacks by characterizing the rules that govern these attacks. Thus, rule updates are particularly important and consequently, new definitions are frequently released by NIDS vendors. However, the rapid emergence of new vulnerabilities and exploitations is gradually making misuse detection difficult to trust. Anomaly detection is designed to capture any deviation from the profiles of normal behavior patterns. Anomaly detection is much more suitable than misuse detection for detecting unknown or novel attacks, but it may generate too many false alarms. This paper proposes a system for anomaly detection on DoS attacks by a genetic weighted KNN classifier, which is further enhanced by executing an unsupervised clustering algorithm, named MLBG (Rosenberger and Chehdi, 2000), on the sampling instances in advance.

Most NIDSs emphasize effectiveness but neglect efficiency, especially for anomaly-based NIDSs. Usually, effectiveness is measured by detection rate, false alarm rate, etc., and efficiency is measured by the response time during an attack. Having too many features for an anomaly-based NIDS does not necessarily guarantee good performance, and it certainly delays the detection engine from making a decision. So determining how to select fewer but significant features becomes a vital concern. Furthermore, features should be weighted because their contributions to classification should differ from each other. This study applies a genetic algorithm to weigh all possible features and selects an optimal feature set to construct the proposed real-time NIDS for anomaly network traffic identification. The system performance is measured by weighted KNN classification in which the feature weights react upon distance measurements.

In past studies, some anomaly-based NIDSs focused on the feature weighting and selection, such as Mukkamala and Sung (2002), Sung and Mukkamala (2003), Lee et al. (2006), Abbes et al. (2004), Stein et al. (2005), Hofman et al. (2004), Middlemiss and Dick, (2003), Liao and Vemuri (2002). Mukkamala and Sung (2002) applied the Support Vector Machine (SVM) technique to rank the 41 features provided by KDD CUP99 (The UCI KDD Archive). Sung and Mukkamala also ranked the features by both SVM and neural networks in Sung and Mukkamala (2003). Lee et al. (2006) discussed the feature selections based on a genetic algorithm combined with the Relief Tree, and a genetic algorithm combined with the Naïve Bayesian Network. They also used the KDD CUP99 for an experimental dataset. Abbes et al. (2004) and Stein et al. (2005) both applied decision trees to design their detection systems. Features for tree nodes were selected by a genetic algorithm in Abbes et al. (2004), and an information gain mixed with a gain ratio and the Gini Index in Stein et al. (2005). A self-created dataset was the subject of an experiment in Abbes et al. (2004), while the KDD CUP99 was used in Stein et al. (2005). Hofman et al. (2004) applied genetic algorithms combined with the radial basis function (RBF) network to select features, and took 7 attacks out of the KDD CUP98 for the experiment. Finally, Middlemiss and Dick (2003) and Liao and Vemuri (2002) proposed a genetic algorithm combined with KNN for feature selection. The KDD CUP99 TCPDUMP was experimented with in Middlemiss and Dick (2003), while 1998 DARPA BSM audit data (DARPA, 1999) was experimented with in Liao and Vemuri (2002). However, in Middlemiss and Dick (2003), details about the genetic algorithm and KNN were not mentioned, and in Liao and Vemuri (2002) the authors regarded the BSM audit data as documents and applied the document classification term: TF–IDF (term frequency–inverse document frequency) to weight features.

Most of the above researchers evaluated their approaches by the KDD CUP99 TCPDUMP datasets. This means that their researches were designed for off-line analyses or detection and thus could not meet the real-time demands for NIDSs because the announced 41 features in KDD CUP99 were derived from connections, not packets. In fact, the 41 features presented in KDD CUP99 are complicated and varied (Middlemiss and Dick, 2003, The UCI KDD Archive,). The first 9 of the 41 are intrinsic features; these describe the basic features of individual TCP connections and can be obtained from raw TCPDUMP files. Features 10 to 22 are content-based features obtained by examining the data portion of a connection and suggested by domain knowledge. Features 23 to 31 are traffic-based features; they are computed using a two-second time window (“time-based”), while Features 32 to 41 are also traffic-based features but computed using a window of 100 connections (“host-based”). Moreover, the collection of attacks, which appeared in the KDD CUP99, is outdated; for instance, only a total of 12 DoS attacks appeared in the KDD CUP99.

One of main goals of this study was to design an anomaly-based NIDS, which possesses effectiveness as well as efficiency, due to the fact that real-time reaction is vital to NIDSs. All features used in this study were derived from packet headers and gathered by using a short time window. Thus, the proposed method could be implemented as real-time, i.e. making a decision per time unit. The remainder of the paper is organized as follows. Section 2 briefly introduces the genetic algorithm, KNN classification, and the unsupervised clustering algorithm. Section 3 describes the proposed system in detail. Section 4 presents the experimental results. Section 5 gives concluding remarks.

Section snippets

Background

This section briefly introduces K-nearest-neighbor (KNN) classification, Genetic Algorithm (GA), and an unsupervised clustering algorithm, MLBG (Rosenberger and Chehdi, 2000). One common classification scheme, based on the use of distance measures, is that of the K-nearest-neighbor. The KNN technique assumes that the entire sampling set includes not only the data in the set but also the desired classification for each item. When a classification is to be made for a new item, its distance to

Design of the proposed system

This section first presents all of the features that are initially considered in the design of a proposed system. It then states the encoding of a chromosome and the fitness function, in order to obtain an optimal weight vector for all applied features through GA; describes the details about the selection, crossover and mutation in the GA; and finally, illustrates the system framework.

Experiments and analyses

A commercial application, IP Traffic, was applied to produce background traffic, which could generate any amount of TCP/UDP/ICMP/ARP packets. Two hosts running the IP Traffic application played the roles of sender and receiver, respectively. This study deployed a receiver in the LAN, with the sender transmitting packets through the Internet. Using IP Traffic, the user can choose protocols and set the contents of packets generated by mathematical laws (Pareto, Uniform, and Exponential), derived

Conclusions

This paper proposes genetic weighted KNN (K-nearest-neighbor) classifiers for anomaly detection on flooding attacks. In addition, an unsupervised clustering algorithm, MLBG, is applied to replace all instances in the sampling dataset with less, but more significant, centroids, thus, greatly reducing the time expense in training and online real-time anomaly identification, as well as improving the system’s performance in terms of overall accuracy. The proposed method was implemented to realize

Acknowledgments

This work was partially supported by the National Science Council with contracts NSC 95-2221-E-130-003, 96-2221-E-130-009, and 97-2221-E-130-014.

References (22)

  • Y. Liao et al.

    Use of K-nearest neighbor classifier for intrusion detection

    Computers & Security

    (2002)
  • C.-H. Tsang et al.

    Genetic-fuzzy rule mining approach and evaluation of feature selection techniques for anomaly intrusion detection

    Pattern Recognition

    (2007)
  • Abbes T, Bouhoula A, Rusinowitch M. Protocol analysis in intrusion detection using decision tree. In: Proceedings of...
  • T. Auld et al.

    Bayesian neural networks for internet traffic classification

    IEEE Transactions on Neural Networks

    (2007)
  • B. Caswell et al.

    Snort 2.0 intrusion detection

    (2003)
  • DARPA. Intrusion Detection Evaluation, 1999....
  • Heavens VX,...
  • Hofman A, Horeis T, Sick B. Feature selection for intrusion detection: an evolutionary wrapper approach. In:...
  • IP Traffic,...
  • John H. Holland

    Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence

    (1992)
  • Kaspersky Lab,...
  • Cited by (0)

    View full text