Extending labeled mobile network traffic data by three levels traffic identification fusion

doi:10.1016/j.future.2018.05.079

Future Generation Computer Systems

Volume 88, November 2018, Pages 453-466

https://doi.org/10.1016/j.future.2018.05.079 Get rights and content

Highlights

•
A method named ELD is devised to automatically extend the labeled mobile traffic data.
•
Payload distribution inspection (PDI) is proposed to identify the label of unknown traffic on packet payload level.
•
Bi-direction payload distribution pattern (BPDP) is presented to reflect the byte distribution pattern of a flow.
•
ELD outperforms existing works (nDPI and Libprotoident) on labeling mobile network traffic.

Abstract

Mobile traffic classification is critically important for the decision-making of network management such as traffic shaping and traffic pricing. Labeled traffic data are the requisite of classification performance evaluation. However, existing works mostly acquired labeled traffic on a simulation environment such as individually running a specific app on mobile devices to collect its traffic. This way is slow and not scalable. This paper devises a scheme to automatically link the ground truth to mobile traffic. A set of labeled traffic data are firstly collected by our previously presented mobilegt (a system to collect mobile traffic and build the ground truth) on the monitored mobile devices. But these traffic are limited to the monitored nodes. Therefore, we present a method named ELD (Extending Labeled Data) to identify the label of newly unknown mobile traffic, so as to extend the labeled mobile traffic data. ELD proceeds traffic identification into packet header, packet payload and flow statistic levels. The three levels’ traffic identification tasks are implemented by ServerTag, payload distribution inspection and Random Forest respectively. ELD is able to identify the mobile traffic with encrypted payload. The cross validation results show that ELD achieves 99% flow accuracy and 95.4% byte accuracy on average when the flow and byte completeness are respectively 86.5% and 65.5%. The results also prove that ELD outperforms existing works, nDPI and Libprotoident, on labeling mobile network traffic.

Introduction

Recent years have witnessed the increased popularity of mobile apps such as WhatsApp and WeChat etc. Mobile apps bring us so many convenience, such as video chatting and web searching at anywhere and anytime. With the popular usage of mobile apps, a large amount of mixed traffic are generated every day, leading to difficult management of mobile network. These mixed traffic also contain much valuable information. Mobile traffic classification is helpful for business and mobile network management. For example, it can help app providers understand which mobile apps are popular [1]; it can also help network managers know which apps consume much more network bandwidth, so as to help managers make a decision on suitably allocating mobile network bandwidth. However, a key task of classification algorithm research is to collect labeled mobile traffic data.

Methods on collecting labeled traffic data generally fall into the following three categories. First one is to collect raw traffic traces on a lived cellular network [2]. And the raw traffic could be further labeled by DPI-based and port-based techniques [3]. DPI (deep packet inspection) methods generally rely on the signature repository for matching protocols [[4], [5], [6]]. However, traditional signatures (e.g. regular expression of L7-filter [4]) may be not suitable for mobile traffic. And DPI based techniques cannot handle encrypted traffic. Second one is to capture the traffic of an app each time by manually running it on mobile devices [7]. However, such traffic are not from real usage context and cannot be used to analyze the characteristics of mixed traffic since only one app is running when collecting the traffic. And this way is slow and not scalable. Third one is to deploy a special app on monitored mobile devices to collect the socket information that records the association between user apps and active sessions [8]. The mobile traffic of monitored devices are collected and further labeled according to the socket information. But the collected traffic are limited to monitored devices.

To overcome the above problems, this paper presents a method to extend the labeled data at the base of the initial labeled traffic data collected by the third way mentioned above. The main contributions of this paper are as follows.

(1) Mobilegt [8] was deployed to acquire initial labeled traffic data. It retrieves the app name for each flow from the socket information recorded on mobile devices. Mobilegt can obtain 100% accuracy on labeling mobile traffic. But the labeled mobile traffic are limited to monitored devices.

(2) To extend the labeled traffic data, a cascade method named ELD (Extending Labeled Data) is presented to identify the traffic class (label) of unknown traffic. ELD proceeds traffic identification into three levels, respectively, packet header, packet payload and flow statistic. The three levels’ traffic identification is respectively implemented by ServerTag, payload distribution inspection (PDI) and machine learning technique (Random Forest [9] used here). At learning stage, on the initial labeled traffic data, the association between traffic class and server IP address is recorded for ServerTag; the payload distribution patterns of each traffic class are automatically extracted for PDI; and flow statistical feature characterized samples are used to train a classification model. At traffic identification stage, the flows with recorded servers are firstly predicted by ServerTag, and then the flows closing to extracted payload distribution patterns are identified by PDI. At last all rest of unknown traffic are fed into the classification model. The flow with high score will be labeled as the prediction class.

(3) Experiments are carried out on 30 mobile traffic datasets, which include popularly used mobile apps. ELD is compared against existing works on labeling mobile traffic, i.e., nDPI [10] and Libprotoident [11]. Results show that ELD outperforms others in terms of flow accuracy and byte accuracy.

The rest of this paper is organized as follows. Section 2 overviews related work on mobile traffic classification and mobile traffic collection. Section 3 devises a scheme to acquire labeled mobile traffic data and presents a method named ELD to extend the labeled mobile traffic data. Section 4 describes experimental data and performance evaluation metrics. Section 5 carries out experiments to compare ELD against existing works and discusses experimental results. Section 6 concludes this paper.

Section snippets

Mobile traffic classification methods

Existing mobile traffic classification methods could be summarized into following three levels: packet header, packet payload and flow statistic.

On packet header level, the port number was used to discriminate network traffic at the early stage of network traffic classification, but port number based techniques have been ineffective as the popular usage of dynamic port numbers [12]. In mobile network, HTTP traffic classification attracts a great attention as HTTP is popularly used by mobile

Labeled mobile traffic

The initial labeled mobile traffic data were obtained by running our mobilegt [8] system.

Mobile traffic data

Our experimental mobile traffic were collected through deploying the mgtClient on volunteers’ smartphones and the mgtServer on a remote server. Volunteers used apps as usual, such as WeChat, Weibo etc. The socket information would be collected on volunteer’s smartphone while running mgtClient. And the mobile traffic traces of the monitored nodes would be captured and labeled by mgtServer. Six volunteers agreed to use mgtClient to share their data during June 2016 to October 2016. Among those

Payload distribution visualization

In ELD, the PDI predicts the label of mobile traffic at packet payload level. This section visualizes the out-direction payload distribution of three popular used apps on our traffic data: Browser, WeChat and Youku. The in-direction payload distribution has similar results.

We randomly sample 300 flows on training and testing sets for each app. And the visualization results on theseflows are shown in Fig. 11. The payload data are broken down into 2-grams, and the value range of each 2-gram is

Conclusion

This paper handles the problems of collecting labeled data for mobile network traffic classification. Conventional methods suffer from poor performance on labeling encrypted traffic or traffic limitation on simulation network environment. This paper presents a method named ELD to enlarge the scale of labeled traffic automatically at the base of initial labeled data. And it is able to label the encrypted traffic and the traffic without payload. ELD concludes three models, respectively,

Acknowledgments

We thank the anonymous reviewers for their constructive comments. This work was supported by National Natural Science Foundation of China under Grant No. 61501128,financial support from China Scholarship Council, supported by Guangdong Provincial Natural fund project, China (Nos. 2017A030313345, 2016A030310300, 2014A03031358), the Specialized Fund for the Basic Research Operating expenses Program of Central College (No. x2rj/D2174870), and Guangdong Province Youth Innovation Talent Project(

Zhen Liu received the Ph.D. degree from the School of Computer Science and Technology of South China University of Technology, China, in 2013. She received her Bachelor’s degree from Department of Computer Science and Technology of South West University, China in 2008. She is now a Lecturer in the School of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, China. She is a member of CCF (China Computer Federation). She serves as a reviewer of Neurocomputing and

References (32)

BujlowT. et al.
Independent comparison of popular DPI tools for traffic classification
Comput. Netw.
(2015)
LiuZ. et al.
A class-oriented feature selection approach for multi-class imbalanced network traffic datasets based on local and global metrics fusion
Neurocomputing
(2015)
FuY. et al.
Service usage classification with encrypted Internet traffic in mobile messaging apps
IEEE Trans. Mob. Comput.
(2016)
NaboulsiD. et al.
Large-scale mobile traffic analysis: a survey
IEEE Commun. Surv. Tutor.
(2016)
YunX. et al.
A semantics-aware approach to the automated network protocol identification
IEEE/ACM Trans. Netw.
(2015)
l7-filter[Online], available:...
TongaonkarA.
A look at the mobile app identification landscape
IEEE Internet Comput.
(2016)
ZhangJ. et al.
Robust network traffic classification
IEEE/ACM Trans. Netw.
(2015)
ContiM. et al.
Analyzing android encrypted network traffic to identify user actions
IEEE Trans. Inf. Forensics Secur.
(2016)
Z. Liu, R.Y. Wang, D.Y. Tang, et al., A system for linking ground truth to mobile network taffic. in: Proc....

BreimanL.

Random forests

Mach. Learn.

(2001)

L. Deri, M. Martinelli, T. Bujlow, A. Cardigliano, nDPI: open-source high-speed deep packet inspection, in: Proceedings...

S. Alcock, R. Nelson, Libprotoident: traffic classification using lightweight packet inspection, Tech. rep., University...

NguyenT.T.T. et al.

A survey of techniques for internet traffic classification using machine learning

IEEE Commun. Surv. Tutor.

(2008)

G. Ranjan, A. Tongaonkar, R. Torres, Approximate matching of persistent lexicon using search-engines for classifying...

P. Casas, P. Fiadino, A. Bar, IP mining: extracting knowledge from the dynamics of the Internet addressing space, in:...

Cited by (17)

Network traffic identification in packet sampling environment
2023, Digital Communications and Networks
Citation Excerpt :
The network management system and the high-end routers in the flow of information statistics have all adopted the packet sampling strategy, such as Cisco's NetFlow [10], Huawei NetStream [11], Juniper's cflowd [12], as well as sflow supported by HP and Foundry companies [13] and so on. Many network traffic identification methods have been proposed [14–19], while few papers study traffic identification in sampling environments. Therefore, we can find out the packet sampling impact on the traffic identification.
With the rapid growth of network bandwidth, traffic identification is currently an important challenge for network management and security. In recent years, packet sampling has been widely used in most network management systems. In this paper, in order to improve the accuracy of network traffic identification, sampled NetFlow data is applied to traffic identification, and the impact of packet sampling on the accuracy of the identification method is studied. This study includes feature selection, a metric correlation analysis for the application behavior, and a traffic identification algorithm. Theoretical analysis and experimental results show that the significance of behavior characteristics becomes lower in the packet sampling environment. Meanwhile, in this paper, the correlation analysis results in different trends according to different features. However, as long as the flow number meets the statistical requirement, the feature selection and the correlation degree will be independent of the sampling ratio. While in a high sampling ratio, where the effective information would be less, the identification accuracy is much lower than the unsampled packets. Finally, in order to improve the accuracy of the identification, we propose a Deep Belief Networks Application Identification (DBNAI) method, which can achieve better classification performance than other state-of-the-art methods.
Joint QoS and energy-efficient resource allocation and scheduling in 5G Network Slicing
2023, Computer Communications
Network Slicing (NS) is fast evolving as a prominent enabler for providing tailored services in the Fifth Generation of cellular networks (5G). Network Slices are virtualized network entities formed over physical substrates, deployed for the customized application use cases. A Network Slice needs to exhibit end to end capabilities and meet Quality of Service (QoS) specifications and Service Level Agreements (SLAs). To provide end-to-end traffic management capabilities in the network slice, firstly, traffic flows are categorized into different priority traffic classes, and their severity levels are assessed. Priorities can be applied across cellular and IP based systems. Machine Learning (ML) algorithms are employed on QoS profile attributes in establishing traffic priorities in slices. Secondly, we propose a novel algorithm for NS Resource Partitioning and User Allocation. We put forward an online virtual backbone based solution for resource allocation and priority class-based packet scheduling. This joint QoS and energy efficiency driven approach is built on top of established traffic classes and dynamic power savings techniques. Finally, through Cognitive Cycles (CC), we devise better network re-configuration to obtain more energy savings. Traffic classifier modules are implemented using Jupyter notebook and Python API. Scheduling and resource allocation modules in networks slices are emulated in Mininet, Flowvisor, and Beacon and POX controllers. The simulation results reveal the reduced node consumption is achieved through the evolutionary CC algorithm, and it outperforms other standard approaches by at least 23%. Similarly, for the traffic priority prediction, from the results, we could infer Gradient Boosting and Random Forest Regressors exhibit superior accuracy with the root mean square deviation of 2.2% and 1.2% respectively when compared to other standard ML algorithms.
A data skew-based unknown traffic classification approach for TLS applications
2023, Future Generation Computer Systems
Citation Excerpt :
In the following, we will briefly introduce some representative methods on network traffic classification. ML-based methods primarily employ ML algorithms (e.g., K-Nearest Neighbor (KNN) [13], Random Forest (RF) [14], Hidden Markov Models (HMM) [15], etc.) to classify network traffic. Bar-Yanai et al. (2010) propose a real-time network traffic classification approach that combines the K-means and KNN algorithms [16].
With the continuous development of network technology, the volume of encrypted traffic from unknown applications rises sharply, posing a significant challenge to conventional traffic classification methods. While these methods achieve a certain level of success in recognizing specific application traffic, they fail to classify unknown traffic, especially encrypted traffic. Existing traffic classification methods are usually constrained by the assumption that classes encountered in testing are also present in training, which is not consistent with the open environment of the real world. In this paper, we propose a novel data skew-based classification method for Transport Layer Security (TLS) application unknown traffic (DSCU) to achieve consistent classification of TLS applications. First, DSCU constructs skew data, and then the one-class classifiers generated based on the skew data limit the input space scope of the known class and reserve space for the unknown class. This enables DSCU to separate known flows (i.e., flows from applications contained in the training set) from unknown flows (i.e., flows without any application information regarding them during training). After separation, the fine-grained classification of known flows can improve the accuracy of known flow classification. Three groups of experiments conducted on a real-world dataset covering 25 applications show that DSCU reliably achieves outstanding performance on TLS flow classification.
Network traffic classification for data fusion: A survey
2021, Information Fusion
Citation Excerpt :
Its granularity is at Level 2 and Level 3. Liu et al. [138] proposed a three-level classification scheme that incorporates multiple classification methods. First, the packet header is checked by the ServerTag method proposed in [139], and unknown traffic could be identified quickly.
Traffic classification groups similar or related traffic data, which is one main stream technique of data fusion in the field of network management and security. With the rapid growth of network users and the emergence of new networking services, network traffic classification has attracted increasing attention. Many new traffic classification techniques have been developed and widely applied. However, the existing literature lacks a thorough survey to summarize, compare and analyze the recent advances of network traffic classification in order to deliver a holistic perspective. This paper carefully reviews existing network traffic classification methods from a new and comprehensive perspective by classifying them into five categories based on representative classification features, i.e., statistics-based classification, correlation-based classification, behavior-based classification, payload-based classification, and port-based classification. A series of criteria are proposed for the purpose of evaluating the performance of existing traffic classification methods. For each specified category, we analyze and discuss the details, advantages and disadvantages of its existing methods, and also present the traffic features commonly used. Summaries of investigation are offered for providing a holistic and specialized view on the state-of-art. For convenience, we also cover a discussion on the mostly used datasets and the traffic features adopted for traffic classification in the review. At the end, we identify a list of open issues and future directions in this research field.
A framework to classify heterogeneous Internet traffic with Machine Learning and Deep Learning techniques for satellite communications
2020, Computer Networks
Citation Excerpt :
Statistical based features from normal and abnormal traffic are computed, and a classifier is trained for the analysis of the massive network users’ traffic behaviors. The work in [14] presents an approach to collect and label mobile IP network traces correctly. The work in [15] exposed a generic architecture of a cellular network, and the possible positions where traffic monitoring can be deployed, such as in a Packet Switched (PS) Core.
Nowadays, the Internet network system serves as a platform for communication, transaction, and entertainment, among others. This communication system is characterized by terrestrial and Satellite components that interact between themselves to provide transmission paths of information between endpoints. Particularly, Satellite Communication providers’ interest is to improve customer satisfaction by optimally exploiting on demand available resources and offering Quality of Service (QoS). Improving the QoS implies to reduce errors linked to information loss and delays of Internet packets in Satellite Communications. In this sense, according to Internet traffic (Streaming, VoIP, Browsing, etc.) and those error conditions, the Internet flows can be classified into different sensitive and non-sensitive classes. Following this idea, this work aims at finding new Internet traffic classification approaches to improving the QoS. Machine Learning (ML) and Deep Learning (DL) techniques will be studied and deployed to classify Internet traffic. All the necessary elements to couple an ML or DL solution over a well-known Satellite Communication and QoS management architecture will be evaluated. To develop this solution, a rich and complete set of Internet traffic is required. In this context, an emulated Satellite Communication platform will serve as a data generation environment in which different Internet communications will be launched and captured. The proposed classification system will deal with different Internet communications (encrypted, unencrypted, and tunneled). This system will process the incoming traffic hierarchically to achieve a high classification performance. Finally, some experiments on a cloud emulated platform validates our proposal and set guidelines for its deployment over a Satellite architecture.
AE-DTI: An Efficient Darknet Traffic Identification Method Based on Autoencoder Improvement
2023, Applied Sciences (Switzerland)

View all citing articles on Scopus

Ruoyu Wang received the Ph.D. degree from the school of Computer Science and Engineering, South China University of Technology, China in 2015. He is now an engineer at the Information and Network Engineering and Research Center, South China University of Technology, China. He is a member of CCF (China Computer Federation). He serves as a reviewer of Applied Soft Computing and ISA Transactions. His research interests are in the areas of machine learning and complex network.

Deyu Tang received the Ph.D. degree from the School of Computer Science and Technology of South China University of Technology, China, in 2015. He is now an associate professor in the School of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, China. He serves as a reviewer of Information Science and Applied Soft Computing. His research interests are in the areas of swarm intelligence and machine learning.

View full text

Extending labeled mobile network traffic data by three levels traffic identification fusion

Highlights

Abstract

Introduction

Section snippets

Mobile traffic classification methods

Labeled mobile traffic

Mobile traffic data

Payload distribution visualization

Conclusion

Acknowledgments

Comput. Netw.

Neurocomputing

Service usage classification with encrypted Internet traffic in mobile messaging apps

IEEE Trans. Mob. Comput.

Large-scale mobile traffic analysis: a survey

IEEE Commun. Surv. Tutor.

A semantics-aware approach to the automated network protocol identification

IEEE/ACM Trans. Netw.

A look at the mobile app identification landscape

IEEE Internet Comput.

Robust network traffic classification

IEEE/ACM Trans. Netw.

Analyzing android encrypted network traffic to identify user actions

IEEE Trans. Inf. Forensics Secur.

Random forests

Mach. Learn.

A survey of techniques for internet traffic classification using machine learning

IEEE Commun. Surv. Tutor.