Elsevier

Knowledge-Based Systems

Volume 106, 15 August 2016, Pages 242-250
Knowledge-Based Systems

Human error tolerant anomaly detection based on time-periodic packet sampling

https://doi.org/10.1016/j.knosys.2016.05.050Get rights and content

Abstract

This paper focuses on an anomaly detection method that uses a baseline model describing the normal behavior of network traffic as the basis for comparison with the audit network traffic. In the anomaly detection method, an alarm is raised if a pattern in the current network traffic deviates from the baseline model. The baseline model is often trained using normal traffic data extracted from traffic data for which all instances (i.e., packets) are manually labeled by human experts in advance as either normal or anomalous. However, since humans are fallible, some errors are inevitable in labeling traffic data. Therefore, in this paper, we propose an anomaly detection method that is tolerant to human errors in labeling traffic data. The fundamental idea behind the proposed method is to take advantage of the lossy nature of packet sampling for the purpose of correcting/preventing human errors in labeling traffic data. By using real traffic traces, we show that the proposed method can better detect anomalies regarding TCP SYN packets than the method that relies only on human labeling.

Introduction

Anomaly detection is the process of finding patterns in current network traffic that do not conform to legitimate (normal) behavior. The nonconforming patterns are called anomalies. Anomalies such as worms, port scans, denial of service attacks, spoofing, etc. seriously affect the operation and normal use of the network and may cause an enormous waste of network resources and economic loss. Consequently, anomaly detection has become an important issue in network monitoring and network security [2], [3], [4], [5], [6], [7].

The design of an anomaly detection method usually relies on a baseline model describing the normal behavior of network traffic. An alarm is raised if a pattern in the current network traffic deviates from the baseline model. The baseline model is often trained using normal traffic data extracted from traffic data for which all instances (i.e., packets) are manually labeled by human experts in advance as either normal or anomalous. However, since humans are fallible, some errors are inevitable in labeling traffic data. Therefore, to achieve an efficient anomaly detection system, a method must be developed that extracts normal traffic data required for training the baseline model in a manner that is tolerant to human error in labeling traffic data.

In this paper, we have developed two methods that employ time-periodic packet sampling in conjunction with human labeling to solve this problem. That is, the two proposed methods employ time-periodic packet sampling to assist human experts in extracting normal traffic data required for training the baseline model. Since time-periodically sampled traffic contains a higher ratio of normal packets than the original traffic data [8], it is promising to employ time-periodic packet sampling to reduce the impact of human errors in labeling traffic data. Note that the two proposed methods are practically useful because they can reduce the effort human experts spend extracting normal traffic by using time-periodic packet sampling that has very low processing complexity. The difference in the two proposed methods is the operational order of human labeling and time-periodic packet sampling. The first method employs packet sampling after human labeling to correct human errors in labeling traffic data. That is, the first method makes primary use of human cognition to label traffic data and then secondary use of time-periodic packet sampling to correct human errors in the labeling process. This method is called the ls-method (labeling-and-sampling method) in this paper. The second method employs time-periodic packet sampling before human labeling to prevent human errors in labeling traffic data. That is, the second method makes primary use of time-periodic packet sampling to make cleaner traffic data that contains a higher ratio of normal packets than the original (unlabeled) traffic data and then secondary use of human cognition to label traffic data from the sampled traffic data. This method is called the sl-method (sampling-and-labeling method) in this paper. In both the ls- and sl-methods, the extracted normal traffic data is used for training a baseline model.

This paper is organized as follows. Section 2 briefly reviews related work on anomaly detection and packet sampling. Section 3 explains the fundamental idea behind the proposed method. Section 4 describes the experimental results obtained using actual traffic traces. Section 5 concludes the paper with a summary of the key points. The main differences between this paper and its original version [1] are the discussion about the ensemble-based anomaly detection given in Section 3.6 and the additional experimental results given in Section 4.4.

Section snippets

Intrusion detection

The process of securing a network infrastructure by scanning the network for suspicious activities is generically referred to as intrusion detection. The approaches to intrusion detection can be roughly classified into two categories: signature detection and anomaly detection.

Procedure of proposed methods

We propose two methods (the ls- and sl-methods) that use time-periodic packet sampling to reduce the impact of human errors in labeling traffic data. The difference in the two proposed methods is the operational order of human labeling and time-periodic packet sampling (see Fig. 1). In the ls-method, human labeling is performed on the original traffic data first, where the label information may include some errors. Then, time-periodic packet sampling is performed on the labeled traffic data.

Traffic data

We used two-way traffic traces provided by the UMass Trace Repository [31]. The traces were measured at the UMass Internet gateway router. The UMass campus was connected to the Internet through Verio, a commercial ISP, and Internet 2. Both connections are Gigabit Ethernet links. In particular, we used the “Gateway Link 3 Trace,” the data of which was measured every morning from 9:30 to 10:30 from July 16 to 22, 2004. All the datasets were manually labeled. The labels may be manipulated in the

Conclusion

In this paper, we proposed two anomaly detection methods that are tolerant to human errors in labeling traffic data, where the labeled traffic data is used for the training of the baseline model. The two proposed methods, called the ls- and sl-methods, use time-periodic packet sampling to correct/prevent human errors in labeling traffic data. The difference in the two proposed methods is the operational order of human labeling and time-periodic packet sampling. In the ls-method, human labeling

Acknowledgments

This work was supported in part by the Japan Society for the Promotion of Science through Grants-in-Aid for Scientific Research (C) (26330112).

References (31)

  • R. Richardson, 2010/2011 CSI Computer Crime and Security Survey, 2011,...
  • Defending TCP Against Spoofing Attacks (RFC 4953), (http://tools.ietf.org/html/rfc4953), (Accessed: April 26,...
  • M. Uchida et al.

    Unsupervised ensembel anomaly detection using time-periodic packet sampling

    IEICE Trans. Commun.

    (2012)
  • Snort, (http://www.snort.org), (Accessed: April 26,...
  • Bro, (http://bro-ids.org), (Accessed: April 26,...
  • Cited by (0)

    A part of this paper appeared in the Proceedings of INCoS 2014 [1].

    View full text