Elsevier

Knowledge-Based Systems

Volume 78, April 2015, Pages 13-21
Knowledge-Based Systems

CANN: An intrusion detection system based on combining cluster centers and nearest neighbors

https://doi.org/10.1016/j.knosys.2015.01.009Get rights and content

Abstract

The aim of an intrusion detection systems (IDS) is to detect various types of malicious network traffic and computer usage, which cannot be detected by a conventional firewall. Many IDS have been developed based on machine learning techniques. Specifically, advanced detection approaches created by combining or integrating multiple learning techniques have shown better detection performance than general single learning techniques. The feature representation method is an important pattern classifier that facilitates correct classifications, however, there have been very few related studies focusing how to extract more representative features for normal connections and effective detection of attacks. This paper proposes a novel feature representation approach, namely the cluster center and nearest neighbor (CANN) approach. In this approach, two distances are measured and summed, the first one based on the distance between each data sample and its cluster center, and the second distance is between the data and its nearest neighbor in the same cluster. Then, this new and one-dimensional distance based feature is used to represent each data sample for intrusion detection by a k-Nearest Neighbor (k-NN) classifier. The experimental results based on the KDD-Cup 99 dataset show that the CANN classifier not only performs better than or similar to k-NN and support vector machines trained and tested by the original feature representation in terms of classification accuracy, detection rates, and false alarms. I also provides high computational efficiency for the time of classifier training and testing (i.e., detection).

Introduction

Advancements in computing and network technology have made the activity of accessing the Internet an important part of our daily life. In addition, the amount of people connected to the Internet is increasing rapidly. However, the high popularity of world-wide connections has led to security problems.

Traditionally, some techniques, such as user authentication, data encryption, and firewalls, are used to protect computer security. Intrusion detection systems (IDS), which use specific analytical technique(s) to detect attacks, identify their sources, and alert network administrators, have recently been developed to monitor attempts to break security [3]. In general, IDS are developed for signature and/or anomaly detection. For signature detection, packets or audit logs are scanned to look for sequences of commands or events which are previously determined as indicative of an attack. On the other hand, for anomaly detection, IDS use behavior patterns which could indicate malicious activities and analyzes past activities to recognize whether the observed behaviors are normal. As early IDS largely used signature detection to detect all the attacks captured in their signature databases, they suffer from high false alarm rates. Recent innovative approaches including behavior-based modeling have been proposed to detect anomalies include data mining, statistical analysis, and artificial intelligence techniques [21], [28].

Much related work in the literature focuses on the task of anomaly detection based on various data mining and machine learning techniques. There have been many recent studies, which focus on combining or integrating different techniques in order to improve detection performance, such as accuracy, detection, and/or false alarm rates (see Table 1 in Section 2.4).

However, there are two limitations to existing studies. First, although more advanced and sophisticated detection approaches and/or systems have been developed, very few have focused on feature representation for normal connections and attacks, which is an important issue in enhancing detection performance. There is a huge amount of related studies using either the KDD-Cup 99 or DARPA 1999 dataset for experiments, however there is no an exact answer to the question about which features of these datasets are more representative. Second, the time taken for training the systems and for the detection task to further validate their systems are not considered in many evaluation methods. Recent systems that combine or integrate multiple techniques require much greater computational effort. As a result, this can degrade the efficiency of ‘on-line’ detection.

Therefore, in this study, we propose a novel feature representation method for effective and efficient intrusion detection that is based on combining cluster centers and nearest neighbors, which we call CANN. Specifically, given a dataset, the k-means clustering algorithm is used to extract cluster centers of each pre-defined category. Then, the nearest neighbor of each data sample in the same cluster is identified. Next, the sum of the distance between a specific data sample and the cluster centers and the distance between this data sample and its nearest neighbor is calculated. This results in a new distance based feature that represents the data in the given dataset. Consequently, a new dataset containing only one dimension (i.e., distance = based feature representation) is used for k-Nearest Neighbor classification, which allows for effective and efficient intrusion detection.

The idea behind CANN is that the cluster centers or centroids for a given dataset offer discrimination capabilities for recognition both similar and dissimilar classes [9], [10], [35]. Therefore, the distances between a data sample and these identified cluster centers are likely to provide some further information for recognition. Similarly, the distance between a specific data sample and its nearest data sample in the same class also has some discriminatory power.

The rest of this paper is organized as follows. Section 2 reviews related literature including offering brief descriptions of supervised and unsupervised machine learning techniques. The techniques used in this paper are also described. Moreover, the techniques used, datasets and evaluation strategies considered in related work are compared. The proposed approach for intrusion detection is introduced in Section 3. Section 4 presents the experimental setup and results. Finally, some conclusions are provided in Section 5.

Section snippets

Machine learning

Machine learning requires a system capable of the autonomous acquisition and integration of knowledge. This capacity includes learning from experience, analytical observation, and so on, the result being a system that can continuously self-improve and thereby offers increased efficiency and effectiveness. The main goal of the study of machine learning is to design and develop algorithms and techniques that allow computers to learn. In general, there are two types of machine learning techniques,

The CANN process

The proposed approach is based on two distances which are used to determine the new features, between a specific data point and its cluster center and nearest neighbor respectively. CANN is comprised of three steps as shown in Fig. 2.

Given a training dataset T, the first step is to use a clustering technique to extract cluster centers. The number of clusters is based on the number of classes to be classified. Since intrusion detection is one classification problem, the chosen dataset has

The dataset

Since there is no standard dataset for intrusion detection, the dataset used in this paper is based on the KDD-Cup 99 dataset1 containing 494,020 samples, which is the most popular and widely used in related work (c.f., Table 1). Specifically, each data sample represents a network connection represented by a 41-dimensional feature vector, in which 9 features are of the intrinsic types, 13 features are of the content type, and the

Conclusion

This paper presents a novel feature representation approach that combines cluster centers and nearest neighbors for effective and efficient intrusion detection, namely CANN. The CANN approach first transforms the original feature representation of a given dataset into a one-dimensional distance based feature. Then, this new dataset is used to train and test a k-NN classifier for classification.

The experimental results show that CANN performs better than the k-NN and SVM classifiers over the

References (39)

Cited by (0)

View full text