High performance traffic classification based on message size sequence and distribution

https://doi.org/10.1016/j.jnca.2016.09.013Get rights and content

Abstract

Classifying network flows into applications is a fundamental requirement for network administrators. Administrators used to classify network applications by examining transport layer port numbers or application level signatures. However, emerging network applications often send encrypted traffic with randomized port numbers. This makes it challenging to detect and manage network applications. In this paper, we propose two statistics-based solutions, the message size distribution classifier (MSDC) and the message size sequence classifier (MSSC) depending on classification accuracy and real timeliness. The former aims to identify network flows in an accurate manner, while the latter aims to provide a lightweight and real-time solution. The proposed classifiers can be further combined to build a hybrid solution that achieves both good detection accuracy and short response latency. Our numerical results show that the MSDC can make a decision by inspecting less than 300 packets and achieve a high detection accuracy of 99.98%. In contrast, the MSSC classifier can respond by only looking at the very first 15 packets and have a slightly lower accuracy of 94.99%. Our implementations on a commodity personal computer show that running the MSDC, the MSSC, and the hybrid classifier in-line achieves a throughput of 400 Mbps, 800 Mbps, and 723 Mbps, respectively.

Introduction

Classifying a network flow into its source application is essential for application-aware network management. By associating network flows with source applications, network administrators can enforce various access control policies to better utilize the available network resources. However, it is not an easy task to correctly classify network flows into the corresponding applications because the use of obfuscation techniques such as port number randomization, payload encryption, and network tunneling. As a result, characterization of Internet traffic has become one of the major challenging issues in communication networks over the past few years (Azzouna and Guillemin, 2003).

A number of approaches have been proposed to classify network flows. The most primitive solution is port-based classification, which builds mappings from transport layer port numbers to applications. For example, map port 53 to DNS flows, port 20 and 21 to FTP flows, and port 25 to SMTP flows. The advantage of this solution is simple. However, it has an obvious flaw because an application is able to bypass the detection by using an unmapped port number or even masquerading an irrelevant well-known port number. One common case is the HTTP-tunneling, which is used to carry non-HTTP network flows over regular HTTP network flows using port 80. Therefore, port-based classification often fails to provide an accurate and reliable solution.

To overcome the drawback of port-based classification, researchers have proposed to detect network flows by finding specific signatures in payloads (Sen et al., 2004). Signature-based classification is considered to be more reliable. However, it did not solve all the issues. First, an application can employ encryption or encapsulation techniques to intentionally obfuscate packet contents; second, this solution requires precise and up-to-date signatures, which might not be applicable for proprietary applications; third, it is computation-intensive to compare characters in each payload against all the available signatures. These unresolved issues pushed research communities to seek for better solutions without inspection payloads.

Many recent approaches classify network flows based on statistical features. These solutions assume that an application would have certain unique statistical properties that can be obtained from empirical data and then used to classify flows into corresponding applications. Common statistical features include the volume, the duration, the burstiness, the payload size, and the jitter of network flows. Statistical-based traffic classification becomes a good alternative because it is possible to classify encrypted or obfuscated network flows.

Roughan et al. (2004) statistically abstracted application features based on application layer protocol attributes and used the features to classify network flows into a specific class-of-service, while Moore and Zuev, 2005 combined statistical analysis with the Bayes theorem to classify network flows. Selected features for the classifiers include the transport layer port number, the flow duration, the packet inter-arrival time, the payload size, and the effective bandwidth. Bernaille et al. (2006) adopted unsupervised clustering techniques to identify an application by using the sizes of the first five data packets of each TCP flow. The solution can make a decision in a pretty short time. However, the solution is sensitive to packet loss and out-of-order delivery.

Other researchers attempt to classify network flows based on observed application behaviors. They monitored and modeled application behavior profiles and then used the profiles to classify flows. Karagiannis et al. (2005) presented BLINC, which analyzed the communication patterns of transport layer host behavior at three levels of details: social, functional, and application, and then used these application features to classify network flows into groups.

However, the classification accuracy directly based on statistical features or observed behaviors are not satisfactory because of sophisticated application behaviors. Network behavior of one application may be similar to that of another application. For example, the behavior of an HTTP file transfer could be similar to that of an FTP transfer. In contrast, not all flows generated by an application behave similar. A BitTorrent client may simultaneously establish flows to retrieve the list of servers, look up resources, check peer status, and exchange files. Making good use of the scattered information can also help classification. Thus, to have a better classification accuracy, an approach, namely message size distribution classifier (MSDC) (Lu et al., 2012), was proposed to classify network flows into sessions and further obtain a complete picture of application behaviors.

MSDC contains two phases, i.e., flow classification and flow grouping. The former classifies network flows into applications by packet size distribution (PSD) and the latter groups related flows as a session by port locality. A flow is identified by the five-tuple information, which includes source IP, destination IP, source port, destination port, and transport layer protocol. When the PSD of one flow is determined, it is compared against the representative of each pre-selected application to decide which application the flow belongs to. Besides, flows are grouped as a session by checking port locality because underlying operation systems often allocate consecutive port numbers for flows of an application. If flows of a session are classified into different applications, an arbitration algorithm based on majority votes is then invoked to make corrections. Evaluations and online benchmarks show that MSDC can obtain accurate results and make a decision by inspecting at most 300 packets and the overall throughput exceeds 400 Mbps on a mainstream computer. Although MSDC can classify network flows accurately, it works in a not-so-fast manner. Therefore, we propose another lightweight and real-time solution called message size sequence classifier (MSSC).

MSSC classifies network flows into applications by message sequences observed during the activities between a pair of two endpoints. The packets exchanged between the two endpoints can be used to derive a sequence based on packet directions and packet sizes. Data exchanged between two endpoints must follow the protocol state machine and the protocol messages defined by involved network applications. MSSC compares the message size sequences (MSSes) of a flow among the representatives of all pre-selected applications to decide which application it belongs to. We also attempted to build a hybrid classifier by combing MSDC and MSSC to provide a balanced solution in terms of classification accuracy and response latency. Based on our analysis and evaluation, MSSC is able to respond by looking only at the very first 15 packets and have a better throughput of 800 Mbps and the hybrid classifier can achieve 723 Mbps.

The rest of this paper is organized as follows. In Section 2, we survey and review relevant researches on network flow classification. Section 3 describes the features that the proposed solutions used to classify network flows. The proposed MSDC and MSSC algorithms are then presented in Section 4. Section 5 provides an analysis for the proposed algorithms. Performance of the proposed solutions is discussed in Section 6. Finally, a conclusion is given in Section 7.

Section snippets

Related work

Various statistical-based network flow classification approaches have been proposed in recent years (Gomes et al., 2013). The advantage of these methods is the ability to classify an application without the need to inspect packet payloads. We classify all the approaches into two classes, i.e., the flow-level classification and the session-level classification. The former classifies each flow independently while the latter attempts to group network flows as sessions and then classifies network

Features

The proposed solution classifies network flows based on message size features of network flows. We assume that network application protocols can be classified as control protocols, data protocols, and control-data mixed protocols. Each application protocol would have several types of protocol messages. The messages carried in a protocol message would have fixed, limited, or similar formats and sizes. In addition, the order of delivering protocol messages would follow the state machine defined

Classification approaches

With the selected features, we propose the MSDC and MSSC solution. The design objectives for the two classifiers are different. The former aims to provide an accurate output while the latter aims to provide a lightweight and real-time solution. In addition, we also combined the two classifiers to build a hybrid solution. Both MSDC and MSSC need to collect application network flows to develop application representatives. There are two ways that can be used: (1) capture all network flows

Modeling and analysis

We discuss the estimated complexity and accuracy of the proposed MSSC and MSDC solutions in this section.

Experimental study

We further evaluated the performance of the proposed solutions using real traces. Two different data sets were used. Both were captured from the operational instances running in campus networks, not from a traffic generator or a lab. Data sets were split into two parts. One was for training and another was for testing. The training data contained all selected application traffic and it was only used to develop application representatives. The testing data was only used for the purpose of

Conclusion

We proposed MSDC, MSSC, and a hybrid solution to classify network flows into their corresponding applications. The solutions are built based on features retrieved from packet sizes and port locality. MSDC solution is able to make a decision by inspecting less than 300 packets and achieve a high session-level classification accuracy of 99.98%. However, it has a relatively lower throughput of 400 Mbps on a commodity PC. To shorten the classification latency, we also explored the possibility of

References (27)

  • A. Este et al.

    On the stability of the information carried by traffic flow features at the packet level

    In: Proceedings of the ACM SIGCOMM Computer Communication Reviews

    (2009)
  • Frank, J., 1994. Machine learning and intrusion detection: current and future. In: Proceedings of the 17th National...
  • GHMM General Hidden Markov Model Library, Available at...
  • Cited by (0)

    View full text