CETAnalytics: Comprehensive effective traffic information analytics for encrypted traffic classification
Graphical abstract
Introduction
Identifying traffic types is a fundamental and critical ability for network research. Many applications depend on the traffic type to provide more advanced services. However, with the recent widespread use of application traffic encryption, identifying application traffic types faces the challenge of unknown content, especially in security operations based on traditional methods.
Among all mature methods used in industry, the traditional deep packet inspection (DPI) was once considered as the most reliable method in traffic identification task. Because it analyzes the packet content rather than only making judgments based on the port number or behavior rule [30], [43], [44]. However, since the traditional DPI method is mainly implemented on patterns of specific string searching or regular expression matching, it can be hardly adapted to the encrypted content [52]. Snort [38] and Bro [36] are the two most popular open source mature intrusion detection tools deployed in current operational environments. With the widespread adoption of traffic encryption methods, these traditional DPI products become less effective in encrypted context because the payload cannot be understood without the session key.
To solve this challenge, existing studies on encrypted traffic classification mainly adopt machine learning and deep learning for their abstract and strong modeling ability. According to the information as input, previous studies can be roughly divided into two main categories: payload based studies from content dimension and statistics based studies from time dimension [2], [45]. Generally, payload based studies firstly splicing traffic content and then apply different methods to classify different traffic types, among which deep learning is used mostly. While statistics based studies firstly generate flow or session statistics, and then utilize machine learning models to identify encrypted traffic types.
Although these previous studies can achieve relatively high performance, there still exist apparent limitations of them: (i) most of the previous approaches can hardly achieve balance performance in high precision and robust generalization. One of the reasons account for this is that they are not integrating the statistics and payload for encrypted traffic type identification. Thus, researches in either category lose another part of the information, which is not making efficient use of the raw traffic [1], [4], [5], [7], [11], [17], [31], [46], [48], [49], [53], [54]. It is apparent that integrating the two dimensional information is better than only taking one aspect, because different applications may behave similarly in one aspect. For example, email applications and chat applications behave similarly in statistics, which both perform small-quantity transmissions in a period; however, they behave differently in their content. Besides, File transfer applications and torrent applications are both transferring file content but they behave differently in statistics, as file transfer applications and torrent applications have different transmissions in two directions. (ii) the leaked elements are taken into consideration for evaluation, which harms the convincingness of the methods [17], [21], [31], [45], [49], [54]. Especially in payload based studies, the packet headers under the transmission layer are taken as part of the determinant [17], [31], [49], [54]. Although the underlying protocols are components of the traffic, they are designed for transmission control rather than application type identification. Besides, the messages contained within the protocol part are too few to provide enough differentiate information among applications. Thus, they can hardly provide differences for encrypted traffic identification. Nevertheless, part of the protocol fields can perform application identification due to the same network environments in traffic generation procedure, which is informative enough to differentiate the traffic in datasets though they are not the real difference among applications, such as the Address, Port and Time to Live (TTL). Therefore, it is not reliable to utilize the content of these underlying protocols for classification.
To tackle the challenges mentioned above, CETAnalytics framework is proposed adopting Comprehensive Effective Traffic Information (CETI) for encrypted application traffic classification. Firstly, to make efficient use of the raw traffic, the integrated information generated from two dimensions, which is the combination of payload and statistics, is adopted as comprehensive information of the traffic. Secondly, to achieve more convincing results, both dimensions are generated based on the payload as the effective information of the traffic to reduce the influence of the underlying protocols. Specifically, the traffic payload organized in packet-session structure and payload statistical characteristics (PSC) are adopted as the determinant of the application traffic from time dimension and content dimension. Based on the CETI specified, a brief overview of the classification framework CETAnalytics is provided.
As shown in Fig. 1, the CETAnalytics framework mainly includes three procedures achieved by four modules. The first procedure consisting one corresponding module achieves raw traffic preprocessing and outputs payload and PSC as formatted CETI. Then the analytics procedure follows, which consists of two corresponding parallel modules, payload analytics module and PSC analytics module. Since payload and PSC are uncorrelated, the two corresponding analytics modules are also independent. The last procedure consisting of one module is the classification procedure to synthesize the analytics results and make predictions. Generally, the input and classification modules are routines in machine learning architectures without much innovation in the mechanism. However, the highlights of the CETAnalytics framework are the two analytics modules that perform the encrypted payload analytics and the PSC analytics, whose implementations are described in Sections 3.4 and 3.5.
Although the proposed framework can achieve more accurate and more convincing results in theory, there still remain challenges to be tackled for framework implementation: (i) the model training efficiency. Different from the methods only taking one aspect for classification, the analytics procedure and classification procedure are separated in CETAnalytics, which requires double training procedures for the total model training in general. Thus, the implementation should be more integrated to avoid the unnecessary gap time cost; (ii) effective method required for encrypted payload processing. Specific features such as string patterns can no longer be captured in the encrypted context. Though acquiring the session key or the customized protocol format can unencrypt the payload, it is not practical for a large number of sessions.
As the framework implementation, these two challenges are tackled as follows: (i) neural network is utilized for whole framework implementation. As is known, neural network is the base of the deep learning method, and it is flexible for designing different function parts. Hence, the implementation can be designed separately but used as a whole model with the cell. Compared to the method that implements CETAnalytics with separate machine learning methods, this integrated method only requires one training episode for both the analytics procedure and classification procedure. Thus the CETAnalytics can be implemented with efficiency; (ii) a substructure network named Attract is proposed for encrypted payload processing utilizing several deep learning technologies. The Attract is designed based on the traffic structure. As the traffic is a hierarchical structure comprised of sessions, packets and bytes, which is similar to the natural language, the Attract is also designed as a hierarchical structure to match the traffic structure referring to [33], [41]. In addition, the Attract can be used either as a payload analytics module of the total model or a single dependent model.
The main contributions of this paper are as follows:
- •
Specify the CETI from two dimensional and propose the CETAnalytics framework for encrypted traffic classification.
- •
Propose the Attract based on traffic structure as the payload analytics module implementation of the CETAnalytics framework to perform the encrypted payload analytics.
- •
Conduct solid experiments on ISCX VPN2016 datasets. The experiment results demonstrate that (i) the effectiveness of the CETAnalytics framework idea; (ii) the balance performance achieved by our implementation in both high precision and robust generalization.
The rest of this paper is organized as follows. Section 2 summarizes the application traffic encryption technologies and related work on encrypted traffic classification. Section 3 describes the CETAnalytics framework and the implementation details. Section 4 covers the experiment setup description. Section 5 reports the experiment evaluation results and analysis. Section 6 discusses the experiments result and limitations of our approach. Section 7 provides the conclusion and discussion on further studies that can be undertaken.
Section snippets
Overview of encryption methods
Application traffic encryption methods can be mainly divided into two main categories [49]: (i) protocol encapsulation. Usually, the protocol encapsulation technology integrates the standard encryption algorithm, like AES, to protect the communication content. One of the widely adopted technology within the protocol encapsulation is SSL/TLS. This technology work on the presentation layer according to the OSI/ISO model to protect especially the web applications. Another widely adopted technology
Design and implementation
In this section, the workflow of the CETAnalytics framework utilizing the CETI to achieve encrypted traffic classification is firstly introduced. Although the framework can theoretically perform better, it requires implementation for evaluation. Thus, the implementation overview is provided to specifically illustrate the function of each module, and the implementation details of each specific module within CETAnalytics framework are then elaborated to clearly show how we tackled the efficiency
Experiment preparation
In this section, the experiment preparation including the experiment environment, description of the dataset, and evaluation framework is provided.
Experimental evaluation
In this section, the investigation and evaluation comparing with other baselines are presented. First of all, the baselines are introduced including the payload based and flow based approaches. And then four experiments are carried out to comprehensively evaluate our approach. First, the model parameters selection experiment is conducted to achieve the due performance of our approach for further experiments based on task I. Second, the external comparison experiment is conducted to show the
Discussion
In this section, the experiments results and limitations of our implementation in real environment are discussed.
The experiments results show the advantages of the implementation of CETAnalytics; however, it remains to explain that, with the uniform distribution of ciphertext bytes assured by the standard cryptography [12], [16], it is not supposed to achieve such outstanding performance on encrypted traffic. The results are explained in two perspectives: (i) not all applications adopt protocol
Conclusion
As the traffic analytics research is widely adopted in past decade, the proposed framework CETAnalytics and its implementation for encrypted traffic classification are introduced in this paper. The proposed method integrates content dimensional analytics and time dimensional analytics into CETI analytics. In its implementation, neural network is adopted as the base cell to implement the whole framework for its flexible modular combination and powerful representation leaning abilities. The
CRediT authorship contribution statement
Cong Dong: Conceptualization, Methodology, Software, Investigation, Writing - original draft. Chen Zhang: Software, Validation, Investigation. Zhigang Lu: Resources, Methodology. Baoxu Liu: Formal analysis, Writing - review & editing. Bo Jiang: Resources, Formal analysis.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This research is supported by National Key Research and Development Program of China (No. 2019QY1303, No. 2018YFB0803602), and the Strategic Priority Research Program of the Chinese Academy of Sciences (No. XDC02040100), and National Natural Science Foundation of China (No. 61702508, No. 61802404). This work is also supported by the Program of Key Laboratory of Network Assessment Technology, the Chinese Academy of Sciences; Program of Beijing Key Laboratory of Network Security and Protection
Cong Dong received the B.S. degree from Tianjin University in 2017. He is currently pursuing the Ph.D. degree with the Institute of Information Engineering, University of Chinese Academy of Sciences, Beijing, China. His research interests include machine learning and network security.
References (54)
- et al.
Mimetic: mobile encrypted traffic classification using multimodal deep learning
Comput. Netw.
(2019) - et al.
Can encrypted traffic be identified without port numbers, ip addresses and payload inspection?
Comput. Netw.
(2011) - et al.
Automatic protocol reverse-engineering: message format extraction and field semantics inference
Comput. Netw.
(2013) - et al.
Ensemble network traffic classification: algorithm comparison and novel ensemble scheme proposal
Comput. Netw.
(2017) - et al.
Ensemble learning for data stream analysis: asurvey
Inf. Fusion
(2017) - et al.
Intrusion detection system: a comprehensive review
J. Netw. Comput. Appl.
(2013) Bro: a system for detecting network intruders in real-time
Comput. Netw.
(1999)- et al.
Mobile encrypted traffic classification using deep learning: experimental evaluation, lessons learned, and challenges
IEEE Trans. Netw. Serv. Manage.
(2019) - et al.
Deepdocclassifier: document classification with deep convolutional neural network
2015 13th International Conference on Document Analysis and Recognition (ICDAR)
(2015) - et al.
How robust can a machine learning approach be for classifying encrypted voip?
J. Netw. Syst. Manag.
(2015)
Identifying encrypted malware traffic with contextual flow data
Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security
Realtime classification for encrypted traffic
International Symposium on Experimental Algorithms
Dispatcher: enabling active botnet infiltration using automatic protocol reverse-engineering
Proceedings of the 16th ACM Conference on Computer and Communications Security
isanon: flow-based anonymity network traffic identification using extreme gradient boosting
2019 International Joint Conference on Neural Networks (IJCNN)
Stream ciphers: a practical solution for efficient homomorphic-ciphertext compression
J. Cryptol.
Smote: synthetic minority over-sampling technique
J. Artif. Intell. Res.
Encrypted traffic identification based on n-gram entropy and cumulative sum test
Proceedings of the 13th International Conference on Future Internet Technologies
A practical public key cryptosystem provably secure against adaptive chosen ciphertext attack
Annual International Cryptology Conference
A session-packets-based encrypted traffic classification using capsule neural networks
2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
Discoverer: Automatic protocol reverse engineering from network traces.
USENIX Security Symposium
Information-theoretic metric learning
Proceedings of the 24th international conference on Machine learning
Traffic identification engine: an open platform for traffic classification
IEEE Netw.
Characterization of encrypted and vpn traffic using time-related features
Proceedings of the 2nd International Conference on Information Systems Security and Privacy(ICISSP 2016)
Cited by (51)
Exploring the power of convolutional neural networks for encrypted industrial protocols recognition
2024, Sustainable Energy, Grids and NetworksInteraction matters: Encrypted traffic classification via status-based interactive behavior graph
2024, Applied Soft ComputingUnveiling encrypted traffic types through hierarchical network characteristics
2024, Computers and SecurityToward identifying malicious encrypted traffic with a causality detection system
2024, Journal of Information Security and ApplicationsDetection and utilization of new-type encrypted network traffic in distributed scenarios
2024, Engineering Applications of Artificial Intelligence
Cong Dong received the B.S. degree from Tianjin University in 2017. He is currently pursuing the Ph.D. degree with the Institute of Information Engineering, University of Chinese Academy of Sciences, Beijing, China. His research interests include machine learning and network security.
Chen Zhang received the M.S. degree in China University of Geosciences in 2016. He is an engineer at the Institute of Information Engineering, Chinese Academy of Sciences. His research interests include network security and network situational awareness.
Zhigang Lu received the Ph.D. degree in Chinese Academy of Sciences in 2010. He is an assistant professor at the Institute of Information Engineering, Chinese Academy of Sciences. His research interests include network security and network situational awareness.
Baoxu Liu received the Ph.D. degree in Chinese Academy of Sciences in 2002. He is a professor at the Institute of Information Engineering, Chinese Academy of Sciences. His research interests include network security, network situational awareness and threat intelligence.
Bo Jiang received the Ph.D. degree in Chinese Academy of Sciences in 2016. He is an assistant professor at the Institute of Information Engineering, Chinese Academy of Sciences. His research interests include network situational awareness, knowledge graph and data mining.