Elsevier

Computer Networks

Volume 176, 20 July 2020, 107258
Computer Networks

CETAnalytics: Comprehensive effective traffic information analytics for encrypted traffic classification

https://doi.org/10.1016/j.comnet.2020.107258Get rights and content

Highlights

  • Specify the comprehensive effective traffic information and propose the CETAnalytics framework for application encrypted traffic classification, which consists of a preprocessing module, a payload analytics module, a payload statistical characteristics analytics module and a classification module.

  • Implement the CETAnalytics framework. Totally, The whole implementation is built using the neural network. Particularly, a subnetwork named Attract is proposed as the content analytics module to achieve the encrypted content analysis.

  • Conduct solid experiments on ISCX VPN2016 datasets. We conduct 4 experiments to evaluate our proposed framework and implementation. The results have shown that our proposed method can outperform the state-of-the-art in both precision and generalization performance.

Abstract

Encrypted traffic classification is of great significance for advanced network services. Though encryption methods seem unbroken in protecting users’ privacy, existing studies have demonstrated that with sophisticated designed approaches utilizing the methods of machine learning or deep learning, the traffic can be identified as generated from which application type or even the specific application. However, most of the previous approaches either lack the generalization ability in different tasks or can hardly achieve the precise performance. One of the reasons is that they perform the classification from an incomplete perspective. To our best knowledge, none of which consider combing the payload content and payload statistics for encrypted traffic classification. Hence, in this paper, we propose the comprehensive effective traffic information analytics (CETAnalytics) framework to tackle the problem. Firstly, the comprehensive effective traffic information is specified and the motivation for combing the two aspects of the traffic is introduced. Based on the specification, the CETAnalytics framework utilizing the consolidated information and its devising implementation details are elaborated. Briefly, the implementation is totally built on the neural network for its high flexibility and powerful functionality to integrate the two dimensional analytics. Among the challenges tackled in the implementation, a substructure network Attract designed with the purpose of matching the traffic structure is proposed to realize the payload content analytics, which is one of the highlights of our work. For evaluation, several solid experiments are conducted using three designed tasks originated from the ISCXVPN2016 dataset. The experiment results show that: (i) the effectiveness of the framework design for encrypted traffic classification; (ii) our implementation can achieve both high precision and robust generalization performance at the same time.

Introduction

Identifying traffic types is a fundamental and critical ability for network research. Many applications depend on the traffic type to provide more advanced services. However, with the recent widespread use of application traffic encryption, identifying application traffic types faces the challenge of unknown content, especially in security operations based on traditional methods.

Among all mature methods used in industry, the traditional deep packet inspection (DPI) was once considered as the most reliable method in traffic identification task. Because it analyzes the packet content rather than only making judgments based on the port number or behavior rule [30], [43], [44]. However, since the traditional DPI method is mainly implemented on patterns of specific string searching or regular expression matching, it can be hardly adapted to the encrypted content [52]. Snort [38] and Bro [36] are the two most popular open source mature intrusion detection tools deployed in current operational environments. With the widespread adoption of traffic encryption methods, these traditional DPI products become less effective in encrypted context because the payload cannot be understood without the session key.

To solve this challenge, existing studies on encrypted traffic classification mainly adopt machine learning and deep learning for their abstract and strong modeling ability. According to the information as input, previous studies can be roughly divided into two main categories: payload based studies from content dimension and statistics based studies from time dimension [2], [45]. Generally, payload based studies firstly splicing traffic content and then apply different methods to classify different traffic types, among which deep learning is used mostly. While statistics based studies firstly generate flow or session statistics, and then utilize machine learning models to identify encrypted traffic types.

Although these previous studies can achieve relatively high performance, there still exist apparent limitations of them: (i) most of the previous approaches can hardly achieve balance performance in high precision and robust generalization. One of the reasons account for this is that they are not integrating the statistics and payload for encrypted traffic type identification. Thus, researches in either category lose another part of the information, which is not making efficient use of the raw traffic [1], [4], [5], [7], [11], [17], [31], [46], [48], [49], [53], [54]. It is apparent that integrating the two dimensional information is better than only taking one aspect, because different applications may behave similarly in one aspect. For example, email applications and chat applications behave similarly in statistics, which both perform small-quantity transmissions in a period; however, they behave differently in their content. Besides, File transfer applications and torrent applications are both transferring file content but they behave differently in statistics, as file transfer applications and torrent applications have different transmissions in two directions. (ii) the leaked elements are taken into consideration for evaluation, which harms the convincingness of the methods [17], [21], [31], [45], [49], [54]. Especially in payload based studies, the packet headers under the transmission layer are taken as part of the determinant [17], [31], [49], [54]. Although the underlying protocols are components of the traffic, they are designed for transmission control rather than application type identification. Besides, the messages contained within the protocol part are too few to provide enough differentiate information among applications. Thus, they can hardly provide differences for encrypted traffic identification. Nevertheless, part of the protocol fields can perform application identification due to the same network environments in traffic generation procedure, which is informative enough to differentiate the traffic in datasets though they are not the real difference among applications, such as the Address, Port and Time to Live (TTL). Therefore, it is not reliable to utilize the content of these underlying protocols for classification.

To tackle the challenges mentioned above, CETAnalytics framework is proposed adopting Comprehensive Effective Traffic Information (CETI) for encrypted application traffic classification. Firstly, to make efficient use of the raw traffic, the integrated information generated from two dimensions, which is the combination of payload and statistics, is adopted as comprehensive information of the traffic. Secondly, to achieve more convincing results, both dimensions are generated based on the payload as the effective information of the traffic to reduce the influence of the underlying protocols. Specifically, the traffic payload organized in packet-session structure and payload statistical characteristics (PSC) are adopted as the determinant of the application traffic from time dimension and content dimension. Based on the CETI specified, a brief overview of the classification framework CETAnalytics is provided.

As shown in Fig. 1, the CETAnalytics framework mainly includes three procedures achieved by four modules. The first procedure consisting one corresponding module achieves raw traffic preprocessing and outputs payload and PSC as formatted CETI. Then the analytics procedure follows, which consists of two corresponding parallel modules, payload analytics module and PSC analytics module. Since payload and PSC are uncorrelated, the two corresponding analytics modules are also independent. The last procedure consisting of one module is the classification procedure to synthesize the analytics results and make predictions. Generally, the input and classification modules are routines in machine learning architectures without much innovation in the mechanism. However, the highlights of the CETAnalytics framework are the two analytics modules that perform the encrypted payload analytics and the PSC analytics, whose implementations are described in Sections 3.4 and 3.5.

Although the proposed framework can achieve more accurate and more convincing results in theory, there still remain challenges to be tackled for framework implementation: (i) the model training efficiency. Different from the methods only taking one aspect for classification, the analytics procedure and classification procedure are separated in CETAnalytics, which requires double training procedures for the total model training in general. Thus, the implementation should be more integrated to avoid the unnecessary gap time cost; (ii) effective method required for encrypted payload processing. Specific features such as string patterns can no longer be captured in the encrypted context. Though acquiring the session key or the customized protocol format can unencrypt the payload, it is not practical for a large number of sessions.

As the framework implementation, these two challenges are tackled as follows: (i) neural network is utilized for whole framework implementation. As is known, neural network is the base of the deep learning method, and it is flexible for designing different function parts. Hence, the implementation can be designed separately but used as a whole model with the cell. Compared to the method that implements CETAnalytics with separate machine learning methods, this integrated method only requires one training episode for both the analytics procedure and classification procedure. Thus the CETAnalytics can be implemented with efficiency; (ii) a substructure network named Attract is proposed for encrypted payload processing utilizing several deep learning technologies. The Attract is designed based on the traffic structure. As the traffic is a hierarchical structure comprised of sessions, packets and bytes, which is similar to the natural language, the Attract is also designed as a hierarchical structure to match the traffic structure referring to [33], [41]. In addition, the Attract can be used either as a payload analytics module of the total model or a single dependent model.

The main contributions of this paper are as follows:

  • Specify the CETI from two dimensional and propose the CETAnalytics framework for encrypted traffic classification.

  • Propose the Attract based on traffic structure as the payload analytics module implementation of the CETAnalytics framework to perform the encrypted payload analytics.

  • Conduct solid experiments on ISCX VPN2016 datasets. The experiment results demonstrate that (i) the effectiveness of the CETAnalytics framework idea; (ii) the balance performance achieved by our implementation in both high precision and robust generalization.

The rest of this paper is organized as follows. Section 2 summarizes the application traffic encryption technologies and related work on encrypted traffic classification. Section 3 describes the CETAnalytics framework and the implementation details. Section 4 covers the experiment setup description. Section 5 reports the experiment evaluation results and analysis. Section 6 discusses the experiments result and limitations of our approach. Section 7 provides the conclusion and discussion on further studies that can be undertaken.

Section snippets

Overview of encryption methods

Application traffic encryption methods can be mainly divided into two main categories [49]: (i) protocol encapsulation. Usually, the protocol encapsulation technology integrates the standard encryption algorithm, like AES, to protect the communication content. One of the widely adopted technology within the protocol encapsulation is SSL/TLS. This technology work on the presentation layer according to the OSI/ISO model to protect especially the web applications. Another widely adopted technology

Design and implementation

In this section, the workflow of the CETAnalytics framework utilizing the CETI to achieve encrypted traffic classification is firstly introduced. Although the framework can theoretically perform better, it requires implementation for evaluation. Thus, the implementation overview is provided to specifically illustrate the function of each module, and the implementation details of each specific module within CETAnalytics framework are then elaborated to clearly show how we tackled the efficiency

Experiment preparation

In this section, the experiment preparation including the experiment environment, description of the dataset, and evaluation framework is provided.

Experimental evaluation

In this section, the investigation and evaluation comparing with other baselines are presented. First of all, the baselines are introduced including the payload based and flow based approaches. And then four experiments are carried out to comprehensively evaluate our approach. First, the model parameters selection experiment is conducted to achieve the due performance of our approach for further experiments based on task I. Second, the external comparison experiment is conducted to show the

Discussion

In this section, the experiments results and limitations of our implementation in real environment are discussed.

The experiments results show the advantages of the implementation of CETAnalytics; however, it remains to explain that, with the uniform distribution of ciphertext bytes assured by the standard cryptography [12], [16], it is not supposed to achieve such outstanding performance on encrypted traffic. The results are explained in two perspectives: (i) not all applications adopt protocol

Conclusion

As the traffic analytics research is widely adopted in past decade, the proposed framework CETAnalytics and its implementation for encrypted traffic classification are introduced in this paper. The proposed method integrates content dimensional analytics and time dimensional analytics into CETI analytics. In its implementation, neural network is adopted as the base cell to implement the whole framework for its flexible modular combination and powerful representation leaning abilities. The

CRediT authorship contribution statement

Cong Dong: Conceptualization, Methodology, Software, Investigation, Writing - original draft. Chen Zhang: Software, Validation, Investigation. Zhigang Lu: Resources, Methodology. Baoxu Liu: Formal analysis, Writing - review & editing. Bo Jiang: Resources, Formal analysis.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This research is supported by National Key Research and Development Program of China (No. 2019QY1303, No. 2018YFB0803602), and the Strategic Priority Research Program of the Chinese Academy of Sciences (No. XDC02040100), and National Natural Science Foundation of China (No. 61702508, No. 61802404). This work is also supported by the Program of Key Laboratory of Network Assessment Technology, the Chinese Academy of Sciences; Program of Beijing Key Laboratory of Network Security and Protection

Cong Dong received the B.S. degree from Tianjin University in 2017. He is currently pursuing the Ph.D. degree with the Institute of Information Engineering, University of Chinese Academy of Sciences, Beijing, China. His research interests include machine learning and network security.

References (54)

  • B. Anderson et al.

    Identifying encrypted malware traffic with contextual flow data

    Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security

    (2016)
  • R. Bar-Yanai et al.

    Realtime classification for encrypted traffic

    International Symposium on Experimental Algorithms

    (2010)
  • J. Bradbury, S. Merity, C. Xiong, R. Socher, Quasi-recurrent neural networks, arXiv:1611.01576...
  • J. Caballero et al.

    Dispatcher: enabling active botnet infiltration using automatic protocol reverse-engineering

    Proceedings of the 16th ACM Conference on Computer and Communications Security

    (2009)
  • Z. Cai et al.

    isanon: flow-based anonymity network traffic identification using extreme gradient boosting

    2019 International Joint Conference on Neural Networks (IJCNN)

    (2019)
  • A. Canteaut et al.

    Stream ciphers: a practical solution for efficient homomorphic-ciphertext compression

    J. Cryptol.

    (2018)
  • N.V. Chawla et al.

    Smote: synthetic minority over-sampling technique

    J. Artif. Intell. Res.

    (2002)
  • G. Cheng et al.

    Encrypted traffic identification based on n-gram entropy and cumulative sum test

    Proceedings of the 13th International Conference on Future Internet Technologies

    (2018)
  • J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling,...
  • R. Cramer et al.

    A practical public key cryptosystem provably secure against adaptive chosen ciphertext attack

    Annual International Cryptology Conference

    (1998)
  • S. Cui et al.

    A session-packets-based encrypted traffic classification using capsule neural networks

    2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)

    (2019)
  • W. Cui et al.

    Discoverer: Automatic protocol reverse engineering from network traces.

    USENIX Security Symposium

    (2007)
  • C.I. for Cybersecurity (CIC)., Cicflowmeter, (http://netflowmeter.ca/). Accessed November 30,...
  • J.V. Davis et al.

    Information-theoretic metric learning

    Proceedings of the 24th international conference on Machine learning

    (2007)
  • W. De Donato et al.

    Traffic identification engine: an open platform for traffic classification

    IEEE Netw.

    (2014)
  • M. Denil, A. Demiraj, N. Kalchbrenner, P. Blunsom, N. de Freitas, Modelling, visualising and summarising documents with...
  • G. Drapper-Gil et al.

    Characterization of encrypted and vpn traffic using time-related features

    Proceedings of the 2nd International Conference on Information Systems Security and Privacy(ICISSP 2016)

    (2016)
  • Cited by (51)

    • Toward identifying malicious encrypted traffic with a causality detection system

      2024, Journal of Information Security and Applications
    View all citing articles on Scopus

    Cong Dong received the B.S. degree from Tianjin University in 2017. He is currently pursuing the Ph.D. degree with the Institute of Information Engineering, University of Chinese Academy of Sciences, Beijing, China. His research interests include machine learning and network security.

    Chen Zhang received the M.S. degree in China University of Geosciences in 2016. He is an engineer at the Institute of Information Engineering, Chinese Academy of Sciences. His research interests include network security and network situational awareness.

    Zhigang Lu received the Ph.D. degree in Chinese Academy of Sciences in 2010. He is an assistant professor at the Institute of Information Engineering, Chinese Academy of Sciences. His research interests include network security and network situational awareness.

    Baoxu Liu received the Ph.D. degree in Chinese Academy of Sciences in 2002. He is a professor at the Institute of Information Engineering, Chinese Academy of Sciences. His research interests include network security, network situational awareness and threat intelligence.

    Bo Jiang received the Ph.D. degree in Chinese Academy of Sciences in 2016. He is an assistant professor at the Institute of Information Engineering, Chinese Academy of Sciences. His research interests include network situational awareness, knowledge graph and data mining.

    View full text