Extending labeled mobile network traffic data by three levels traffic identification fusion
Introduction
Recent years have witnessed the increased popularity of mobile apps such as WhatsApp and WeChat etc. Mobile apps bring us so many convenience, such as video chatting and web searching at anywhere and anytime. With the popular usage of mobile apps, a large amount of mixed traffic are generated every day, leading to difficult management of mobile network. These mixed traffic also contain much valuable information. Mobile traffic classification is helpful for business and mobile network management. For example, it can help app providers understand which mobile apps are popular [1]; it can also help network managers know which apps consume much more network bandwidth, so as to help managers make a decision on suitably allocating mobile network bandwidth. However, a key task of classification algorithm research is to collect labeled mobile traffic data.
Methods on collecting labeled traffic data generally fall into the following three categories. First one is to collect raw traffic traces on a lived cellular network [2]. And the raw traffic could be further labeled by DPI-based and port-based techniques [3]. DPI (deep packet inspection) methods generally rely on the signature repository for matching protocols [[4], [5], [6]]. However, traditional signatures (e.g. regular expression of L7-filter [4]) may be not suitable for mobile traffic. And DPI based techniques cannot handle encrypted traffic. Second one is to capture the traffic of an app each time by manually running it on mobile devices [7]. However, such traffic are not from real usage context and cannot be used to analyze the characteristics of mixed traffic since only one app is running when collecting the traffic. And this way is slow and not scalable. Third one is to deploy a special app on monitored mobile devices to collect the socket information that records the association between user apps and active sessions [8]. The mobile traffic of monitored devices are collected and further labeled according to the socket information. But the collected traffic are limited to monitored devices.
To overcome the above problems, this paper presents a method to extend the labeled data at the base of the initial labeled traffic data collected by the third way mentioned above. The main contributions of this paper are as follows.
(1) Mobilegt [8] was deployed to acquire initial labeled traffic data. It retrieves the app name for each flow from the socket information recorded on mobile devices. Mobilegt can obtain 100% accuracy on labeling mobile traffic. But the labeled mobile traffic are limited to monitored devices.
(2) To extend the labeled traffic data, a cascade method named ELD (Extending Labeled Data) is presented to identify the traffic class (label) of unknown traffic. ELD proceeds traffic identification into three levels, respectively, packet header, packet payload and flow statistic. The three levels’ traffic identification is respectively implemented by ServerTag, payload distribution inspection (PDI) and machine learning technique (Random Forest [9] used here). At learning stage, on the initial labeled traffic data, the association between traffic class and server IP address is recorded for ServerTag; the payload distribution patterns of each traffic class are automatically extracted for PDI; and flow statistical feature characterized samples are used to train a classification model. At traffic identification stage, the flows with recorded servers are firstly predicted by ServerTag, and then the flows closing to extracted payload distribution patterns are identified by PDI. At last all rest of unknown traffic are fed into the classification model. The flow with high score will be labeled as the prediction class.
(3) Experiments are carried out on 30 mobile traffic datasets, which include popularly used mobile apps. ELD is compared against existing works on labeling mobile traffic, i.e., nDPI [10] and Libprotoident [11]. Results show that ELD outperforms others in terms of flow accuracy and byte accuracy.
The rest of this paper is organized as follows. Section 2 overviews related work on mobile traffic classification and mobile traffic collection. Section 3 devises a scheme to acquire labeled mobile traffic data and presents a method named ELD to extend the labeled mobile traffic data. Section 4 describes experimental data and performance evaluation metrics. Section 5 carries out experiments to compare ELD against existing works and discusses experimental results. Section 6 concludes this paper.
Section snippets
Mobile traffic classification methods
Existing mobile traffic classification methods could be summarized into following three levels: packet header, packet payload and flow statistic.
On packet header level, the port number was used to discriminate network traffic at the early stage of network traffic classification, but port number based techniques have been ineffective as the popular usage of dynamic port numbers [12]. In mobile network, HTTP traffic classification attracts a great attention as HTTP is popularly used by mobile
Labeled mobile traffic
The initial labeled mobile traffic data were obtained by running our mobilegt [8] system.
Mobile traffic data
Our experimental mobile traffic were collected through deploying the mgtClient on volunteers’ smartphones and the mgtServer on a remote server. Volunteers used apps as usual, such as WeChat, Weibo etc. The socket information would be collected on volunteer’s smartphone while running mgtClient. And the mobile traffic traces of the monitored nodes would be captured and labeled by mgtServer. Six volunteers agreed to use mgtClient to share their data during June 2016 to October 2016. Among those
Payload distribution visualization
In ELD, the PDI predicts the label of mobile traffic at packet payload level. This section visualizes the out-direction payload distribution of three popular used apps on our traffic data: Browser, WeChat and Youku. The in-direction payload distribution has similar results.
We randomly sample 300 flows on training and testing sets for each app. And the visualization results on theseflows are shown in Fig. 11. The payload data are broken down into 2-grams, and the value range of each 2-gram is
Conclusion
This paper handles the problems of collecting labeled data for mobile network traffic classification. Conventional methods suffer from poor performance on labeling encrypted traffic or traffic limitation on simulation network environment. This paper presents a method named ELD to enlarge the scale of labeled traffic automatically at the base of initial labeled data. And it is able to label the encrypted traffic and the traffic without payload. ELD concludes three models, respectively,
Acknowledgments
We thank the anonymous reviewers for their constructive comments. This work was supported by National Natural Science Foundation of China under Grant No. 61501128,financial support from China Scholarship Council, supported by Guangdong Provincial Natural fund project, China (Nos. 2017A030313345, 2016A030310300, 2014A03031358), the Specialized Fund for the Basic Research Operating expenses Program of Central College (No. x2rj/D2174870), and Guangdong Province Youth Innovation Talent Project(
Zhen Liu received the Ph.D. degree from the School of Computer Science and Technology of South China University of Technology, China, in 2013. She received her Bachelor’s degree from Department of Computer Science and Technology of South West University, China in 2008. She is now a Lecturer in the School of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, China. She is a member of CCF (China Computer Federation). She serves as a reviewer of Neurocomputing and
References (32)
- et al.
Independent comparison of popular DPI tools for traffic classification
Comput. Netw.
(2015) - et al.
A class-oriented feature selection approach for multi-class imbalanced network traffic datasets based on local and global metrics fusion
Neurocomputing
(2015) - et al.
Service usage classification with encrypted Internet traffic in mobile messaging apps
IEEE Trans. Mob. Comput.
(2016) - et al.
Large-scale mobile traffic analysis: a survey
IEEE Commun. Surv. Tutor.
(2016) - et al.
A semantics-aware approach to the automated network protocol identification
IEEE/ACM Trans. Netw.
(2015) - l7-filter[Online], available:...
A look at the mobile app identification landscape
IEEE Internet Comput.
(2016)- et al.
Robust network traffic classification
IEEE/ACM Trans. Netw.
(2015) - et al.
Analyzing android encrypted network traffic to identify user actions
IEEE Trans. Inf. Forensics Secur.
(2016) - Z. Liu, R.Y. Wang, D.Y. Tang, et al., A system for linking ground truth to mobile network taffic. in: Proc....
Random forests
Mach. Learn.
A survey of techniques for internet traffic classification using machine learning
IEEE Commun. Surv. Tutor.
Cited by (17)
Network traffic identification in packet sampling environment
2023, Digital Communications and NetworksCitation Excerpt :The network management system and the high-end routers in the flow of information statistics have all adopted the packet sampling strategy, such as Cisco's NetFlow [10], Huawei NetStream [11], Juniper's cflowd [12], as well as sflow supported by HP and Foundry companies [13] and so on. Many network traffic identification methods have been proposed [14–19], while few papers study traffic identification in sampling environments. Therefore, we can find out the packet sampling impact on the traffic identification.
Joint QoS and energy-efficient resource allocation and scheduling in 5G Network Slicing
2023, Computer CommunicationsA data skew-based unknown traffic classification approach for TLS applications
2023, Future Generation Computer SystemsCitation Excerpt :In the following, we will briefly introduce some representative methods on network traffic classification. ML-based methods primarily employ ML algorithms (e.g., K-Nearest Neighbor (KNN) [13], Random Forest (RF) [14], Hidden Markov Models (HMM) [15], etc.) to classify network traffic. Bar-Yanai et al. (2010) propose a real-time network traffic classification approach that combines the K-means and KNN algorithms [16].
Network traffic classification for data fusion: A survey
2021, Information FusionCitation Excerpt :Its granularity is at Level 2 and Level 3. Liu et al. [138] proposed a three-level classification scheme that incorporates multiple classification methods. First, the packet header is checked by the ServerTag method proposed in [139], and unknown traffic could be identified quickly.
A framework to classify heterogeneous Internet traffic with Machine Learning and Deep Learning techniques for satellite communications
2020, Computer NetworksCitation Excerpt :Statistical based features from normal and abnormal traffic are computed, and a classifier is trained for the analysis of the massive network users’ traffic behaviors. The work in [14] presents an approach to collect and label mobile IP network traces correctly. The work in [15] exposed a generic architecture of a cellular network, and the possible positions where traffic monitoring can be deployed, such as in a Packet Switched (PS) Core.
AE-DTI: An Efficient Darknet Traffic Identification Method Based on Autoencoder Improvement
2023, Applied Sciences (Switzerland)
Zhen Liu received the Ph.D. degree from the School of Computer Science and Technology of South China University of Technology, China, in 2013. She received her Bachelor’s degree from Department of Computer Science and Technology of South West University, China in 2008. She is now a Lecturer in the School of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, China. She is a member of CCF (China Computer Federation). She serves as a reviewer of Neurocomputing and International Journal of Communication Systems. Her research interests are in the areas of mobile network traffic classification and machine learning.
Ruoyu Wang received the Ph.D. degree from the school of Computer Science and Engineering, South China University of Technology, China in 2015. He is now an engineer at the Information and Network Engineering and Research Center, South China University of Technology, China. He is a member of CCF (China Computer Federation). He serves as a reviewer of Applied Soft Computing and ISA Transactions. His research interests are in the areas of machine learning and complex network.
Deyu Tang received the Ph.D. degree from the School of Computer Science and Technology of South China University of Technology, China, in 2015. He is now an associate professor in the School of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, China. He serves as a reviewer of Information Science and Applied Soft Computing. His research interests are in the areas of swarm intelligence and machine learning.