skip to main content
10.1145/3229543.3229548acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Free access

Efficient Distribution-Derived Features for High-Speed Encrypted Flow Classification

Published: 07 August 2018 Publication History

Abstract

Flow classification is an important tool to enable efficient network resource usage, support traffic engineering, and aid QoS mechanisms. As traffic is increasingly becoming encrypted by default, flow classification is turning towards the use of machine learning methods employing features that are also available for encrypted traffic. In this work we evaluate flow features that capture the distributional properties of in-flow per-packet metrics such as packet size and inter-arrival time. The characteristics of such distributions are often captured with general statistical measures such as standard deviation, variance, etc. We instead propose a Kolmogorov-Smirnov discretization (KSD) algorithm to perform histogram bin construction based on the distributional properties observed in the data. This allows for a richer, histogram based, representation which also requires less resources for feature computation than higher order statistical moments. A comprehensive evaluation using synthetic data from Gaussian and Beta mixtures show that the KSD approach provides Jensen-Shannon distance results surpassing those of uniform binning and probabilistic binning. An empirical evaluation using live traffic traces from a cellular network further shows that when coupled with a random forest classifier the KSD-constructed features improve classification performance compared to general statistical features based on higher order moments, or alternative bin placement approaches.

References

[1]
John Aitchison. 1986. The statistical analysis of compositional data. Chapman and Hall London.
[2]
Lucien Birgé and Yves Rozenholc. 2006. How many bins should be put in a regular histogram. ESAIM: Probability and Statistics 10 (2006), 24--45.
[3]
Anderson Santos da Silva, Cristian Cleder Machado, Rodolfo Vebber Bisol, Lisandro Zambenedetti Granville, and Alberto Schaeffer-Filho. 2015. Identification and selection of flow features for accurate traffic classification in sdn. In Network Computing and Applications (NCA), 2015 IEEE 14th International Symposium on. IEEE, 134--141.
[4]
Tapio Elomaa and Juho Rousu. 1999. General and efficient multisplitting of numerical attributes. Machine learning 36, 3 (1999), 201--244.
[5]
Tom Fawcett. 2006. An introduction to ROC analysis. Pattern Recogn. Lett. 27, 8 (jun 2006), 861--874.
[6]
U Fayyad and K Irani. 1993. Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In 13th International Joint Conference on Artificial Intelligence, Vol. 2. 1022--1027.
[7]
David Freedman and Persi Diaconis. 1981. On the histogram as a density estimator: L 2 theory. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 57, 4 (1981), 453--476.
[8]
Mikel Galar, Alberto Fernandez, Edurne Barrenechea, Humberto Bustince, and Francisco Herrera. 2011. An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes. Pattern Recognition 44, 8 (2011), 1761--1776.
[9]
Johan Garcia and Anna Brunstrom. 2018. Clustering-based Separation of Media Transfers in DPI-classified Cellular Video and VoIP Traffic. In 2018 IEEE Wireless Communications and Networking Conference, (WCNC). IEEE, 1--6.
[10]
Johan Garcia, Topi Korhonen, Ricky Andersson, and Filip Vastlund. 2018. Towards Video Flow Classification at a Million Encrypted Flows Per Second. In 2018 IEEE Advanced Information Networking and Applications (AINA) Conference.
[11]
Salvador Garcia, Julian Luengo, José Antonio Sáez, Victoria Lopez, and Francisco Herrera. 2013. A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering 25, 4 (2013), 734--750.
[12]
Santiago Egea Gómez, Belén Carro Martínez, Antonio J Sánchez-Esguevillas, and Luis Hernández Callejo. 2017. Ensemble network traffic classification: Algorithm comparison and novel ensemble scheme proposal. Computer Networks 127 (2017), 68--80.
[13]
Robert M Gray. 1990. Vector quantization. In Readings in speech recognition. Elsevier, 75--100.
[14]
Kevin H Knuth. 2006. Optimal data-based binning for histograms. arXiv preprint physics/0605197 (2006).
[15]
Andrey Kolmogorov. 1933. Sulla determinazione empirica di una lgge di distribuzione. Inst. Ital. Attuari, Giorn. 4 (1933), 83--91.
[16]
Yeon-sup Lim, Hyun-chul Kim, Jiwoong Jeong, Chong-kwon Kim, Ted Taekyoung Kwon, and Yanghee Choi. 2010. Internet traffic classification demystified: on the sources of the discriminative power. In Proceedings of the 6th International Conference (Co-NEXT '10). ACM.
[17]
H. Liu, F. Hussain, C.L. Tan, and M. Dash. 2002. Discretization: An Enabling Technique. Data Mining and Knowledge Discovery 6, 4 (2002), 393--423.
[18]
Frank J Massey Jr. 1951. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American statistical Association 46, 253 (1951), 68--78.
[19]
Thuy TT Nguyen and Grenville Armitage. 2008. A survey of techniques for internet traffic classification using machine learning. IEEE Communications Surveys & Tutorials 10, 4 (2008), 56--76.
[20]
Lizhi Peng, Bo Yang, and Yuehui Chen. 2015. Effective packet number for early stage internet traffic identification. Neurocomputing 156 (2015), 252--267.
[21]
Lizhi Peng, Bo Yang, Yuehui Chen, and Zhenxiang Chen. 2015. Effectiveness of Statistical Features for Early Stage Internet Traffic Identification. International Journal of Parallel Programming (2015), 1--17.
[22]
Jeffrey D Scargle, Jay P Norris, Brad Jackson, and James Chiang. 2013. Studies in astronomical time series analysis. VI. Bayesian block representations. The Astrophysical Journal 764, 2 (2013).
[23]
David W Scott. 1979. On optimal and data-based histograms. Biometrika 66, 3 (1979), 605--610.
[24]
Tuncay Soylu, Oğuzhan Erdem, Aydin Carus, and Edip S Güner. 2017. Simple CART based real-time traffic classification engine on FPGAs. In ReConFigurable Computing and FPGAs (ReConFig), 2017 International Conference on. IEEE, 1--8.
[25]
Dougal J Sutherland. 2016. Scalable, Flexible and Active Learning on Distributions. Ph.D. Dissertation. Carnegie Mellon University Pittsburgh United States.
[26]
Vincent F Taylor, Riccardo Spolaor, Mauro Conti, and Ivan Martinovic. 2016. Appscanner: Automatic fingerprinting of smartphone apps from encrypted network traffic. In Security and Privacy (EuroS&P), 2016 IEEE European Symposium on. 439--454.
[27]
Petr Velan, Milan Čermák, Pavel Čeleda, and Martin Drašar. 2015. A survey of methods for encrypted traffic classification and analysis. International Journal of Network Management 25, 5 (2015), 355--374.
[28]
Yu Wang, Yang Xiang, Jun Zhang, and Shunzheng Yu. 2012. Internet traffic clustering with constraints. In Wireless Communications and Mobile Computing Conference (IWCMC), 2012 8th International. IEEE, 619--624.
[29]
Ming Xu, Wenbo Zhu, Jian Xu, and Ning Zheng. 2015. Towards selecting optimal features for flow statistical based network traffic classification. In Network Operations and Management Symposium (APNOMS), 2015 17th Asia-Pacific. IEEE, 479--482.
[30]
Zhenlong Yuan, Y. Xue, and Y. Dong. 2013. Harvesting unique characteristics in packet sequences for effective application classification. In 2013 IEEE Conference on Communications and Network Security (CNS). 341--349.
[31]
Shuyuan Zhao, Yongzheng Zhang, and Peng Chang. 2017. Network Traffic Classification Using Tri-training Based on Statistical Flow Characteristics. In Trustcom/BigDataSE/ICESS, 2017 IEEE. IEEE, 323--330.

Cited By

View all
  • (2022)Software-Defined Networking (SDN) Traffic Analysis Using Big Data Analytic ApproachProceedings of the International Joint Conference on Science and Engineering 2022 (IJCSE 2022)10.2991/978-94-6463-100-5_25(243-250)Online publication date: 25-Dec-2022
  • (2022)A Complete Review on the Application of Statistical Methods for Evaluating Internet Traffic UsageIEEE Access10.1109/ACCESS.2022.322707310(128433-128455)Online publication date: 2022
  • (2020)A surrogate-assisted GA enabling high-throughput ML by optimal feature and discretization selectionProceedings of the 2020 Genetic and Evolutionary Computation Conference Companion10.1145/3377929.3398092(1632-1640)Online publication date: 8-Jul-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
NetAI'18: Proceedings of the 2018 Workshop on Network Meets AI & ML
August 2018
86 pages
ISBN:9781450359115
DOI:10.1145/3229543
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 August 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Discretization
  2. Machine learning
  3. Traffic classification

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

SIGCOMM '18
Sponsor:
SIGCOMM '18: ACM SIGCOMM 2018 Conference
August 24, 2018
Budapest, Hungary

Acceptance Rates

Overall Acceptance Rate 13 of 38 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)72
  • Downloads (Last 6 weeks)6
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Software-Defined Networking (SDN) Traffic Analysis Using Big Data Analytic ApproachProceedings of the International Joint Conference on Science and Engineering 2022 (IJCSE 2022)10.2991/978-94-6463-100-5_25(243-250)Online publication date: 25-Dec-2022
  • (2022)A Complete Review on the Application of Statistical Methods for Evaluating Internet Traffic UsageIEEE Access10.1109/ACCESS.2022.322707310(128433-128455)Online publication date: 2022
  • (2020)A surrogate-assisted GA enabling high-throughput ML by optimal feature and discretization selectionProceedings of the 2020 Genetic and Evolutionary Computation Conference Companion10.1145/3377929.3398092(1632-1640)Online publication date: 8-Jul-2020
  • (2020)DIOPT: Extremely Fast Classification Using Lookups and Optimal Feature Discretization2020 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN48605.2020.9207037(1-8)Online publication date: Jul-2020
  • (2020)Deep Neural Network Based Video Source Recognition2020 IEEE 20th International Conference on Communication Technology (ICCT)10.1109/ICCT50939.2020.9295702(1306-1310)Online publication date: 28-Oct-2020
  • (2020)Timely Classification and Verification of Network Traffic Using Gaussian Mixture ModelsIEEE Access10.1109/ACCESS.2020.29925568(91287-91302)Online publication date: 2020
  • (2019)On Runtime and Classification Performance of the Discretize-Optimize (DISCO) Classification ApproachACM SIGMETRICS Performance Evaluation Review10.1145/3308897.330896546:3(167-170)Online publication date: 25-Jan-2019
  • (2019)A Survey on Big Data for Network Traffic Monitoring and AnalysisIEEE Transactions on Network and Service Management10.1109/TNSM.2019.293335816:3(800-813)Online publication date: Sep-2019

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media