Elsevier

Future Generation Computer Systems

Volume 88, November 2018, Pages 453-466
Future Generation Computer Systems

Extending labeled mobile network traffic data by three levels traffic identification fusion

https://doi.org/10.1016/j.future.2018.05.079Get rights and content

Highlights

  • A method named ELD is devised to automatically extend the labeled mobile traffic data.

  • Payload distribution inspection (PDI) is proposed to identify the label of unknown traffic on packet payload level.

  • Bi-direction payload distribution pattern (BPDP) is presented to reflect the byte distribution pattern of a flow.

  • ELD outperforms existing works (nDPI and Libprotoident) on labeling mobile network traffic.

Abstract

Mobile traffic classification is critically important for the decision-making of network management such as traffic shaping and traffic pricing. Labeled traffic data are the requisite of classification performance evaluation. However, existing works mostly acquired labeled traffic on a simulation environment such as individually running a specific app on mobile devices to collect its traffic. This way is slow and not scalable. This paper devises a scheme to automatically link the ground truth to mobile traffic. A set of labeled traffic data are firstly collected by our previously presented mobilegt (a system to collect mobile traffic and build the ground truth) on the monitored mobile devices. But these traffic are limited to the monitored nodes. Therefore, we present a method named ELD (Extending Labeled Data) to identify the label of newly unknown mobile traffic, so as to extend the labeled mobile traffic data. ELD proceeds traffic identification into packet header, packet payload and flow statistic levels. The three levels’ traffic identification tasks are implemented by ServerTag, payload distribution inspection and Random Forest respectively. ELD is able to identify the mobile traffic with encrypted payload. The cross validation results show that ELD achieves 99% flow accuracy and 95.4% byte accuracy on average when the flow and byte completeness are respectively 86.5% and 65.5%. The results also prove that ELD outperforms existing works, nDPI and Libprotoident, on labeling mobile network traffic.

Introduction

Recent years have witnessed the increased popularity of mobile apps such as WhatsApp and WeChat etc. Mobile apps bring us so many convenience, such as video chatting and web searching at anywhere and anytime. With the popular usage of mobile apps, a large amount of mixed traffic are generated every day, leading to difficult management of mobile network. These mixed traffic also contain much valuable information. Mobile traffic classification is helpful for business and mobile network management. For example, it can help app providers understand which mobile apps are popular [1]; it can also help network managers know which apps consume much more network bandwidth, so as to help managers make a decision on suitably allocating mobile network bandwidth. However, a key task of classification algorithm research is to collect labeled mobile traffic data.

Methods on collecting labeled traffic data generally fall into the following three categories. First one is to collect raw traffic traces on a lived cellular network [2]. And the raw traffic could be further labeled by DPI-based and port-based techniques [3]. DPI (deep packet inspection) methods generally rely on the signature repository for matching protocols [[4], [5], [6]]. However, traditional signatures (e.g. regular expression of L7-filter [4]) may be not suitable for mobile traffic. And DPI based techniques cannot handle encrypted traffic. Second one is to capture the traffic of an app each time by manually running it on mobile devices [7]. However, such traffic are not from real usage context and cannot be used to analyze the characteristics of mixed traffic since only one app is running when collecting the traffic. And this way is slow and not scalable. Third one is to deploy a special app on monitored mobile devices to collect the socket information that records the association between user apps and active sessions [8]. The mobile traffic of monitored devices are collected and further labeled according to the socket information. But the collected traffic are limited to monitored devices.

To overcome the above problems, this paper presents a method to extend the labeled data at the base of the initial labeled traffic data collected by the third way mentioned above. The main contributions of this paper are as follows.

(1) Mobilegt [8] was deployed to acquire initial labeled traffic data. It retrieves the app name for each flow from the socket information recorded on mobile devices. Mobilegt can obtain 100% accuracy on labeling mobile traffic. But the labeled mobile traffic are limited to monitored devices.

(2) To extend the labeled traffic data, a cascade method named ELD (Extending Labeled Data) is presented to identify the traffic class (label) of unknown traffic. ELD proceeds traffic identification into three levels, respectively, packet header, packet payload and flow statistic. The three levels’ traffic identification is respectively implemented by ServerTag, payload distribution inspection (PDI) and machine learning technique (Random Forest [9] used here). At learning stage, on the initial labeled traffic data, the association between traffic class and server IP address is recorded for ServerTag; the payload distribution patterns of each traffic class are automatically extracted for PDI; and flow statistical feature characterized samples are used to train a classification model. At traffic identification stage, the flows with recorded servers are firstly predicted by ServerTag, and then the flows closing to extracted payload distribution patterns are identified by PDI. At last all rest of unknown traffic are fed into the classification model. The flow with high score will be labeled as the prediction class.

(3) Experiments are carried out on 30 mobile traffic datasets, which include popularly used mobile apps. ELD is compared against existing works on labeling mobile traffic, i.e., nDPI [10] and Libprotoident [11]. Results show that ELD outperforms others in terms of flow accuracy and byte accuracy.

The rest of this paper is organized as follows. Section 2 overviews related work on mobile traffic classification and mobile traffic collection. Section 3 devises a scheme to acquire labeled mobile traffic data and presents a method named ELD to extend the labeled mobile traffic data. Section 4 describes experimental data and performance evaluation metrics. Section 5 carries out experiments to compare ELD against existing works and discusses experimental results. Section 6 concludes this paper.

Section snippets

Mobile traffic classification methods

Existing mobile traffic classification methods could be summarized into following three levels: packet header, packet payload and flow statistic.

On packet header level, the port number was used to discriminate network traffic at the early stage of network traffic classification, but port number based techniques have been ineffective as the popular usage of dynamic port numbers [12]. In mobile network, HTTP traffic classification attracts a great attention as HTTP is popularly used by mobile

Labeled mobile traffic

The initial labeled mobile traffic data were obtained by running our mobilegt [8] system.

Mobile traffic data

Our experimental mobile traffic were collected through deploying the mgtClient on volunteers’ smartphones and the mgtServer on a remote server. Volunteers used apps as usual, such as WeChat, Weibo etc. The socket information would be collected on volunteer’s smartphone while running mgtClient. And the mobile traffic traces of the monitored nodes would be captured and labeled by mgtServer. Six volunteers agreed to use mgtClient to share their data during June 2016 to October 2016. Among those

Payload distribution visualization

In ELD, the PDI predicts the label of mobile traffic at packet payload level. This section visualizes the out-direction payload distribution of three popular used apps on our traffic data: Browser, WeChat and Youku. The in-direction payload distribution has similar results.

We randomly sample 300 flows on training and testing sets for each app. And the visualization results on theseflows are shown in Fig. 11. The payload data are broken down into 2-grams, and the value range of each 2-gram is

Conclusion

This paper handles the problems of collecting labeled data for mobile network traffic classification. Conventional methods suffer from poor performance on labeling encrypted traffic or traffic limitation on simulation network environment. This paper presents a method named ELD to enlarge the scale of labeled traffic automatically at the base of initial labeled data. And it is able to label the encrypted traffic and the traffic without payload. ELD concludes three models, respectively,

Acknowledgments

We thank the anonymous reviewers for their constructive comments. This work was supported by National Natural Science Foundation of China under Grant No. 61501128,financial support from China Scholarship Council, supported by Guangdong Provincial Natural fund project, China (Nos. 2017A030313345, 2016A030310300, 2014A03031358), the Specialized Fund for the Basic Research Operating expenses Program of Central College (No. x2rj/D2174870), and Guangdong Province Youth Innovation Talent Project(

Zhen Liu received the Ph.D. degree from the School of Computer Science and Technology of South China University of Technology, China, in 2013. She received her Bachelor’s degree from Department of Computer Science and Technology of South West University, China in 2008. She is now a Lecturer in the School of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, China. She is a member of CCF (China Computer Federation). She serves as a reviewer of Neurocomputing and

References (32)

  • BujlowT. et al.

    Independent comparison of popular DPI tools for traffic classification

    Comput. Netw.

    (2015)
  • LiuZ. et al.

    A class-oriented feature selection approach for multi-class imbalanced network traffic datasets based on local and global metrics fusion

    Neurocomputing

    (2015)
  • FuY. et al.

    Service usage classification with encrypted Internet traffic in mobile messaging apps

    IEEE Trans. Mob. Comput.

    (2016)
  • NaboulsiD. et al.

    Large-scale mobile traffic analysis: a survey

    IEEE Commun. Surv. Tutor.

    (2016)
  • YunX. et al.

    A semantics-aware approach to the automated network protocol identification

    IEEE/ACM Trans. Netw.

    (2015)
  • l7-filter[Online], available:...
  • TongaonkarA.

    A look at the mobile app identification landscape

    IEEE Internet Comput.

    (2016)
  • ZhangJ. et al.

    Robust network traffic classification

    IEEE/ACM Trans. Netw.

    (2015)
  • ContiM. et al.

    Analyzing android encrypted network traffic to identify user actions

    IEEE Trans. Inf. Forensics Secur.

    (2016)
  • Z. Liu, R.Y. Wang, D.Y. Tang, et al., A system for linking ground truth to mobile network taffic. in: Proc....
  • BreimanL.

    Random forests

    Mach. Learn.

    (2001)
  • L. Deri, M. Martinelli, T. Bujlow, A. Cardigliano, nDPI: open-source high-speed deep packet inspection, in: Proceedings...
  • S. Alcock, R. Nelson, Libprotoident: traffic classification using lightweight packet inspection, Tech. rep., University...
  • NguyenT.T.T. et al.

    A survey of techniques for internet traffic classification using machine learning

    IEEE Commun. Surv. Tutor.

    (2008)
  • G. Ranjan, A. Tongaonkar, R. Torres, Approximate matching of persistent lexicon using search-engines for classifying...
  • P. Casas, P. Fiadino, A. Bar, IP mining: extracting knowledge from the dynamics of the Internet addressing space, in:...
  • Cited by (17)

    • Network traffic identification in packet sampling environment

      2023, Digital Communications and Networks
      Citation Excerpt :

      The network management system and the high-end routers in the flow of information statistics have all adopted the packet sampling strategy, such as Cisco's NetFlow [10], Huawei NetStream [11], Juniper's cflowd [12], as well as sflow supported by HP and Foundry companies [13] and so on. Many network traffic identification methods have been proposed [14–19], while few papers study traffic identification in sampling environments. Therefore, we can find out the packet sampling impact on the traffic identification.

    • A data skew-based unknown traffic classification approach for TLS applications

      2023, Future Generation Computer Systems
      Citation Excerpt :

      In the following, we will briefly introduce some representative methods on network traffic classification. ML-based methods primarily employ ML algorithms (e.g., K-Nearest Neighbor (KNN) [13], Random Forest (RF) [14], Hidden Markov Models (HMM) [15], etc.) to classify network traffic. Bar-Yanai et al. (2010) propose a real-time network traffic classification approach that combines the K-means and KNN algorithms [16].

    • Network traffic classification for data fusion: A survey

      2021, Information Fusion
      Citation Excerpt :

      Its granularity is at Level 2 and Level 3. Liu et al. [138] proposed a three-level classification scheme that incorporates multiple classification methods. First, the packet header is checked by the ServerTag method proposed in [139], and unknown traffic could be identified quickly.

    • A framework to classify heterogeneous Internet traffic with Machine Learning and Deep Learning techniques for satellite communications

      2020, Computer Networks
      Citation Excerpt :

      Statistical based features from normal and abnormal traffic are computed, and a classifier is trained for the analysis of the massive network users’ traffic behaviors. The work in [14] presents an approach to collect and label mobile IP network traces correctly. The work in [15] exposed a generic architecture of a cellular network, and the possible positions where traffic monitoring can be deployed, such as in a Packet Switched (PS) Core.

    View all citing articles on Scopus

    Zhen Liu received the Ph.D. degree from the School of Computer Science and Technology of South China University of Technology, China, in 2013. She received her Bachelor’s degree from Department of Computer Science and Technology of South West University, China in 2008. She is now a Lecturer in the School of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, China. She is a member of CCF (China Computer Federation). She serves as a reviewer of Neurocomputing and International Journal of Communication Systems. Her research interests are in the areas of mobile network traffic classification and machine learning.

    Ruoyu Wang received the Ph.D. degree from the school of Computer Science and Engineering, South China University of Technology, China in 2015. He is now an engineer at the Information and Network Engineering and Research Center, South China University of Technology, China. He is a member of CCF (China Computer Federation). He serves as a reviewer of Applied Soft Computing and ISA Transactions. His research interests are in the areas of machine learning and complex network.

    Deyu Tang received the Ph.D. degree from the School of Computer Science and Technology of South China University of Technology, China, in 2015. He is now an associate professor in the School of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, China. He serves as a reviewer of Information Science and Applied Soft Computing. His research interests are in the areas of swarm intelligence and machine learning.

    View full text