Elsevier

Computer Networks

Volume 132, 26 February 2018, Pages 81-98
Computer Networks

An efficient feature generation approach based on deep learning and feature selection techniques for traffic classification

https://doi.org/10.1016/j.comnet.2018.01.007Get rights and content

Abstract

Substantial recent efforts have been made on the application of Machine Learning (ML) techniques to flow statistical features for traffic classification. However, the classification performance of ML techniques is severely degraded due to the high dimensionality and redundancy of flow statistical features, the imbalance in the number of traffic flows and concept drift of Internet traffic. With the aim of comprehensively solving these problems, this paper proposes a new feature optimization approach based on deep learning and Feature Selection (FS) techniques to provide the optimal and robust features for traffic classification. Firstly, symmetric uncertainty is exploited to remove the irrelevant features in network traffic data sets, then a feature generation model based on deep learning is applied to these relevant features for dimensionality reduction and feature generation, finally Weighted Symmetric Uncertainty (WSU) is exploited to select the optimal features by removing the redundant ones. Based on real traffic traces, experimental results show that the proposed approach can not only efficiently reduce the dimension of feature space, but also overcome the negative impacts of multi-class imbalance and concept drift problems on ML techniques. Furthermore, compared with the approaches used in the previous works, the proposed approach achieves the best classification performance and relatively higher runtime performance.

Introduction

Accurate classification of Internet traffic is the basis of many network management tasks [1], [2], including Quality of Service (QoS) control, intrusion detection and diagnostic monitoring. Traditional traffic classification approaches are based on examining the 16-bit port numbers in transport layer header or investigating the signature information in the packet payloads [3]. These approaches proved to be inefficient as they encounter many problems such as dynamic port numbers, data encryption and user privacy protection.

Due to the limitations of traditional traffic classification approaches, many research papers [4], [5], [6], [7], [8], [9], [10], [11] have been dedicated to conduct traffic classification by applying ML techniques to flow statistical features. Although they made significant achievements, classifying Internet traffic by using ML techniques is still a daunting task, as the high redundancy of flow statistical features greatly degrades the accuracy and efficiency of the ML classifiers [12]. With the aim of solving this problem, FS techniques [13] can play an effective role in reducing the dimensionality (of flow statistical features) and removing irrelevant and redundant features. However, despite a vast number FS methods proposed in the literature [1], [14], [15], [16], searching for the optimal features by FS methods remains a challenge because: (1) FS techniques conduct the search for an optimal subset using different evaluation criteria, which may make the optimal subset be local optima; (2) most FS techniques have been developed for improving classification accuracy by removing the redundant features, but neglect the stability of optimal subset for variations in the traffic data; (3) FS techniques cannot capture the complex dependency across all flow statistical features, which have a great impact on traffic classification. Thus, one of the key challenges is to provide the optimal and robust features for traffic classification.

Another key challenge for traffic classification is the multi-class imbalance problem, which leads to the situation where ML algorithms suffer from low recall for the minority classes. Many research efforts have been proposed to address this problem, which can be mainly divided into two categories: resampling approaches and cost-sensitive approaches. The resampling approaches balance the class distribution of data set by under-sampling the majority classes or by over-sampling the minority classes. Since these approaches would change the original class distributions, they have been criticized by some literatures [18], [19]. Cost-sensitive approaches address the multi-class imbalance problem by adjusting the costs that are associated with misclassification. However, obtaining the accurate misclassification cost is a difficult task, and the different misclassification cost might result in different induction results. Furthermore, Chen and Wasikowski [20] presented that the multi-class problem can hardly be addressed very well by resampling approaches and cost-sensitive approaches if the feature space is high dimensional. In recent years, FS techniques have been concerned for handling multi-class imbalance problem [21], [22], [23]. Nevertheless, most of them did not consider the relation between features and class distributions.

Concept drift of Internet traffic is the third key challenge for traffic classification, which also has a great impact on traffic classification. Due to the evolution of network techniques and changes in user activities and management strategies, the Internet traffic and its underlying class distribution dynamically changes with time. For example, the percentage of P2P traffic in the night is always higher than that in the day. Furthermore, the emergence of some new P2P applications leads to changes in flow statistical features. In order to retain high traffic classification performance, the ML classifier should be periodically updated with the latest traffic data. This problem, known as concept drift, is inevitable for ML based traffic classification [25]. Although many methods [25], [35]–37] were proposed to handle dynamic nature of Internet traffic, it is hard to precisely determine the time period or to promptly detect the occurrence of concept drifts, especially when the traffic data are multi-class imbalanced. Furthermore, these methods increase the computational complexity and time cost of classification system. Therefore, it is necessary to find the robust features from original flow statistical features to overcome concept drift. However, unfortunately, most of existing FS techniques neglects the insensitivity of the output to variations in the flow statistical features.

In order to comprehensively solve the three challenges mentioned above, we propose an Efficient Feature Optimization Approach (EFOA) based on deep learning and FS techniques. Fig. 1 provides an overview of the implementation of EFOA for traffic classification in practice. The significant contributions of this paper are the follows:

  • 1.

    A novel feature optimization approach called EFOA is proposed to provide optimal and robust features for Internet traffic classification. With this object, deep learning and FS techniques are respectively exploited in this approach to generate the robust and discriminative features and search the optimal features. EFOA proceeds in three phases. First phase evaluates the correlation of the original flow statistical features with the class and removes the irrelevant features. The correlation measure is based on symmetric uncertainty. In the second phase, the retained relevant features are passed to a feature generation model to generate robust and discriminative features by capturing the dependency among the features, and the new generated features have smaller dimension. The model is based on Deep Belief Networks (DBNs) and it can be constructed by unsupervised learning and fine tuning. The third phase searches for the optimal features by removing the redundant features. WSU is exploited in this phase to select the features that are conducive to classifying the minority classes. Thus, the feature set outputted by EFOA not only has smaller dimension but also can handle multi-class imbalance and concept drift problems. Based on our extensive research, this is the first time that feature optimization approach is successfully used to handle the high redundancy of flow statistical features, multi-class imbalance and concept drift problems comprehensively in traffic classification.

  • 2.

    A series of experiments are conducted on Cambridge and UNIBS traffic data sets to evaluate the classification performance and runtime performance of the proposed approach. We compare the proposed approach with six different approaches (i.e., Weighted Symmetric Uncertainty Area Under roc Curve (WSU_AUC) [19], Global Optimization Approach (GOA) [24], model of random over-sampling [26], model of random under-sampling [26], cost sensitive learning based on MetaCost [26] and Per Concept Drift Detection (PCDD) [25]) proposed in the recent literatures. Flow Overall Accuracy (OA), Byte OA, flow g-mean and byte g-mean are exploited as metrics to evaluate the classification performance of each approach. Experimental results show that on Cambridge data sets, EFOA achieves relatively high flow OA (very close to the best one) and the best byte OA and flow g-mean, and on UNIBS data sets, EFOA achieves the best flow OA, byte OA and flow g-mean. However, the byte g-mean achieved by EFOA on both Cambridge and UNIBS data sets are relatively low. These results demonstrate that, on one hand, EFOA achieves better performance improvements for traffic classification compared to other approaches, but on the other hand, EFOA does not adequately consider the byte information of traffic flows. On runtime performance, EFOA consumes much more time of preprocessing, but relative less time of training and testing. This suggests that EFOA can effectively reduce the dimension of original feature sets to decrease the time consumption of training and testing. In conclusion, the experimental results demonstrate that EFOA achieves the relatively higher classification performance and runtime performance.

This paper is organized as follows. Section 2 reviews the related works on traffic classification. In Section 3, we present our feature optimization approach. Section 4 details the traffic data sets and the evaluation metrics. Section 5 evaluates the performance of our approach by comparing with existing approaches in traffic classification. Finally, Section 5 makes our concluding remarks for our paper.

Section snippets

Related work

Many methods, such as FS methods [11], [13], [24], [27], resampling methods [18], [30], [31], cost-sensitive methods [26], [32], [33] and concept drift detection methods [25], [36], [37], have been proposed to handle feature redundancy, multi-class imbalance and concept drift problems. In order to intuitively describe the role of these methods in traffic classification, Fig. 2 presents a flow diagram that illustrates the implement of these methods in ML based traffic classification system. The

Methodology

In this section, we propose a new feature optimization approach, called EFOA, which not only efficiently removes the irrelevant and redundant features but also provides optimal and robust features to overcome the multi-class imbalance and concept drift problems. Fig. 3 presents the schematic diagram of the proposed approach, which composed of three phases: First, relevance analysis phase selects the subset of relevant features using FS method; Second, feature generation phase exploits a feature

Network traffic data sets

To provide a quantitative performance evaluation, two real world traffic traces (Cambridge and UNIBS) are used in our experiments. Cambridge traffic traces are published by the computer laboratory in the University of Cambridge [42]. They are the extensively acceptable traffic traces for evaluation and comparison of traffic classification methods. Cambridge traffic traces were collected on the Genome Campus network in August 2003. These traffic traces have ten separate data sets and each of

Experimental results

The main purpose of this section is to demonstrate the effectiveness of EFOA by comparing with the previous works. Our experiments are preceded in two phases. Firstly, we evaluate the performance of EFOA in traffic classification on Cambridge data sets. Especially, we systemically and comprehensively examine the ability of EFOA to handle feature redundancy, multi-class imbalance and concept drift problems. Secondly, we validate the effectiveness of EFOA on UNIBS data sets and demonstrate the

Conclusion

In this study, we comprehensively examined the negative impacts of high redundancy of flow statistical features, multi-class imbalance and concept drift problems on ML based traffic classification and proposed a new feature optimization approach, called EFOA, based on deep learning and FS techniques to handle the three problems. The proposed approach involves (1) removing irrelevant features by using symmetric uncertainty to measure feature-class correlation, (2) applying RFGM to generate

Hongtao Shi is currently a Ph.D. candidate in College of Information Science and Engineering at Ocean University of China. He received his B.S. in Computer Sciences from Chang'an University, and his M.S. in Computer Application from Ocean University of China. His research interests include feature selection, deep learning and network security.

References (46)

  • T. Auld et al.

    Bayesian neural networks for Internet traffic classification

    IEEE Trans. Neural Netw.

    (2007)
  • HaoF. et al.

    Fast dynamic multiple-set membership testing using combinatorial bloom filters

    IEEE/ACM Trans. Netw. (TON)

    (2012)
  • A. Moore et al.

    Toward the accurate identification of network applications

    Passive and Active Network Measurement

    (2005)
  • KimH. et al.

    Internet traffic classification demystified: myths, caveats, and the best practices

  • T.T.T. Nguyen et al.

    A survey of techniques for internet traffic classification using machine learning

    IEEE Commun. Surv. Tutor.

    (2008)
  • JinY. et al.

    A modular machine learning system for flow-level traffic classification in large networks

    ACM Trans. Knowl. Discov. Data

    (2012)
  • ZhangJ. et al.

    Internet traffic classification by aggregating correlated naive bayes predictions

    IEEE Trans. Inf. Forensics Secur.

    (2013)
  • HanJ. et al.

    Concepts and Techniques

    (2006)
  • A.W. Moore et al.

    Internet traffic classification using Bayesian analysis techniques

  • N. Williams et al.

    A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification

    ACM SIGCOMM Comput. Commun. Rev.

    (2006)
  • YuanR. et al.

    An SVM-based machine learning method for accurate internet traffic classification

    Inf. Syst. Front.

    (2010)
  • ChenX. et al.

    FAST: a ROC-based feature selection metric for small samples and imbalanced data classification problems

  • LimY. et al.

    Internet traffic classification demystified: on the sources of the discriminative power

  • Cited by (0)

    Hongtao Shi is currently a Ph.D. candidate in College of Information Science and Engineering at Ocean University of China. He received his B.S. in Computer Sciences from Chang'an University, and his M.S. in Computer Application from Ocean University of China. His research interests include feature selection, deep learning and network security.

    Hongping Li is currently a professor in College of Information Science and Engineering at Ocean University of China. He received his Ph.D. in Computer Sciences from University of Oklahoma. His research interests include machine learning, feature selection, traffic management.

    Dan Zhang is currently a graduate student in College of Information Science and Engineering at Ocean University of China. Her research interests include data mining, machine learning and big data.

    Chaqiu Cheng is currently a graduate student in College of Information Science and Engineering at Ocean University of China. His research interests include data mining, statistical analysis and machine learning.

    Xuanxuan Cao is currently a graduate student in College of Information Science and Engineering at Ocean University of China. His research interests include network engineering, statistical analysis and traffic classification.

    View full text