Abstract
Peer-to-peer (P2P) classifications based on flow statistics have been proven accurate in detecting P2P traffic. A machine learning classification is affected by the quality and recency of the training dataset used. Hence, to classify P2P traffic on-line requires the removal of these limitations. In this paper, an automated training dataset generation for an on-line P2P traffic classification is proposed to allow frequent classifier retraining. A two-stage training dataset generator (TSTDG) is proposed by combining a 3-class heuristic and a 3-class statistical classification to automatically generate a training dataset. In the heuristic stage, traffic is classified as P2P, non-P2P, or unknown. In the statistical stage, a dual Decision Tree is built based on a dataset generated in the heuristic stage to reduce the amount of classified unknown traffic. The final training dataset is generated based on all flows that are classified in these two stages. The proposed system has been evaluated on traces captured from a campus network. The overall results show that the TSTDG can generate an accurate training dataset by classifying around 94 % of total flows with high accuracy (98.59 %) and a low false positive rate (1.27 %).




Similar content being viewed by others
Notes
Flows are distinguished based on [Source IP, Destination IP, Src Port, Dst Post, Protocol].
J48 is an open source C++ implementation of the C4.5 algorithm
References
Chen, Z., Yang, B., Chen, Y., Abraham, A., Grosan, C., Peng, L.: Online hybrid traffic classifier for peer-to-peer systems based on network processors. Appl. Soft. Comput. 9(2), 685–694 (2009)
Soysal, M., Schmidt, E.G.: Machine learning algorithms for accurate flow-based network traffic classification: evaluation and comparison. Perform. Eval. 67(6), 451–467 (2010)
Bernaille, L., Teixeira, R., Salamatian, K.: Early application identification. In: Proceedings of the 2006 ACM CoNEXT Conference (CoNEXT ’06), pp. 6:1–6:12. Lisboa, Portugal (2006)
Moore, A.W., Zuev, D.: Internet traffic classification using Bayesian analysis techniques. SIGMETRICS Perform. Eval. Rev. 33(1), 50–60 (2005)
Williams, N., Zander, S., Armitage, G.: A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification. SIGCOMM Comput. Commun. Rev. 36(5), 5–16 (2006)
Xu, K., Zhang, M., Ye, M., Chiu, D.M., Wu, J.: Identify P2P traffic by inspecting data transfer behavior. Comput. Commun. 33(10), 1141–1150 (2010)
Lu, W., Tavallaee, M., Ghorbani, A.A.: Hybrid traffic classification approach based on Decision Tree. In: Proceedings of the 28th IEEE Conference on Global Telecommunications (GLOBECOM’09), pp. 5679–5684. Honolulu, Hawaii, USA (2009)
Keralapura, R., Nucci, A., Chuah, C.N.: A novel self-learning architecture for p2p traffic classification in high speed networks. Comput. Netw. 54(7), 1055–1068 (2010)
Erman, J., Mahanti, A., Arlitt, M., Cohen, I., Williamson, C.: Offline/realtime traffic classification using semi-supervised learning. Perform. Eval. 64(9–12), 1194–1213 (2007)
Li, W., Moore, A.W.: A machine learning approach for efficient traffic classification. In: Proceedings of 15th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, pp. 310–317. Washington, DC, USA (2007)
Tian, X., Sun, Q., Huang, X., Ma, Y.: A dynamic online traffic classification methodology based on data stream mining. In: Proceedings of the 2009 WRI world congress on computer science and information engineering—Volume 01, CSIE ’09, pp. 298–302. IEEE Computer Society, Washington, DC, USA (2009)
Mula-Valls, O.: A practical retraining mechanism for network traffic classification in operational environments. Master thesis, Universitat Politècnica de Catalunya (2011)
Mingliang, G., Xiaohong, H., Xu, T., Yan, M., Zhenhua, W.: Data stream mining based real-time highspeed traffic classification. In: Proceedings of the 2nd IEEE international conference on broadband network multimedia technology (IC-BNMT’09), pp. 700–705. Beijing, China (2009)
Raahemi, B., Zhong, W., Liu, J.: Peer-to-peer traffic identification by mining IP layer data streams using concept-adapting very fast Decision Tree. In: Proceedings of the 20th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’08), vol. 1, pp. 525–532. Dayton, OH, USA (2008)
Nguyen, T.T., Armitage, G.: A survey of techniques for internet traffic classification using machine learning. Commun. Surv. Tutor. IEEE 10(4), 56–76 (2008)
Hassan, M., Marsono, M.: A three-class heuristics technique: generating training corpus for peer-to-peer traffic classification. In: Proceedings of the 2010 IEEE 4th International Conference on Internet Multimedia Services Architecture and Application (IMSAA 2010), pp. 1–5. Bangalore, India (2010)
Sears, W., Yu, Z., Guan, Y.: An adaptive reputation-based trust framework for peer-to-peer applications. In: Proceedings of the Fourth IEEE International Symposium on Network Computing and Applications (NCA’05), pp. 13–20. Cambridge, MA, USA (2005)
Sen, S., Spatscheck, O., Wang, D.: Accurate, scalable in-network identification of p2p traffic using application signatures. In: Proceedings of the 13th international conference on World Wide Web (WWW ’04), pp. 512–521. New York, NY, USA (2004)
Karagiannis, T., Broido, A., Faloutsos, M., claffy, k.c.: Transport layer identification of P2P traffic. In: Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, pp. 121–134. Taormina, Sicily, Italy (2004)
Perényi, M., Dang, T.D., Gefferth, A., Molnr, S.: Identification and analysis of peer-to-peer traffic. J. Commun. 1(7), 36–46 (2006)
Crotti, M., Dusi, M., Gringoli, F., Salgarelli, L.: Traffic classification through simple statistical fingerprinting. SIGCOMM Comput. Commun. Rev. 37(1), 5–16 (2007)
Sen, S., Wang, J.: Analyzing peer-to-peer traffic across large networks. IEEE/ACM Trans. Netw. 12, 219–232 (2004)
Raahemi, B., Hayajneh, A., Rabinovitch, P.: Peer-to-peer IP traffic classification using Decision Tree and IP layer attributes. Int. J. Bus. Data Commun. Netw. 3(4), 60–74 (2007)
Kim, H., Fomenkov, M., Claffy, K.C., Brownlee, N., Barman, D., Faloutsos, M.: Comparison of internet traffic classification tools. In: Workshop on Application Classification and Identification (2007). http://www.icir.org/imrg/waci07/docs/waci-3-abs.pdf
CoralReef: http://www.caida.org/tools/measurement/coralreef/. (2012)
Karagiannis, T., Papagiannaki, K., Faloutsos, M.: Blinc: multilevel traffic classification in the dark. SIGCOMM Comput. Commun. Rev. 35(4), 229–240 (2005)
Weka: data mining software in Java. (2012). http://www.cs.waikato.ac.nz/ml/weka/
Madhukar, A., Williamson, C.: A longitudinal study of p2p traffic classification. In: Proceedings of the 2007 15th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS ’07), pp. 179–188. Washington, DC, USA (2006)
John, W., Tafvelin, S.: Heuristics to classify internet backbone traffic based on connection patterns. In: Proceedings of the 22nd International Conference on Information Networking (ICOIN’08), pp. 1–5. Busan, Korea (2008)
Raahemi, B., Hayajneh, A., Rabinovitch, P.: Classification of peer-to-peer traffic using neural networks. In: Artificial Intelligence and Pattern Recognition, pp. 411–417 (2007)
Zhang, M., John, W., Claffy, K.C., Brownlee, N.: State of the art in traffic classification: a research review. In: Proceedings of the Tenth Passive and Active Measurement Conference (PAM’09). Seoul, Korea (2009)
Zarei, R., Monemi, A., Marsono, M.: Retraining mechanism for on-line peer-to-peer traffic classification. In: intelligent Informatics, Advances in Intelligent Systems and Computing. vol. 182, pp. 373–382. Springer,Berlin Heidelberg (2013)
Tcpdump: http://www.tcpdump.org/ (2012)
Moore, A.W., Papagiannaki, K.: Toward the accurate identification of network applications. In: the Proceedings of Sixth Passive and Active Measurement Workshop (PAM ’05), pp. 41–54. Boston, USA (2005)
Quinlan, J.R.: http://www.rulequest.com/personal/ (2012)
Wang, Y., Yu, S.Z.: Machine learned real-time traffic classifiers. In: Proceedings of the 2008 Symposium on Intelligent Information Technology Application (IITA ’08), pp. 449–454. Shanghai, China (2008)
Erman, J., Arlitt, M., Mahanti, A.: Traffic classification using clustering algorithms. In: Proceedings of the 2006 SIGCOMM Workshop on Mining Network Data, MineNet ’06, pp. 281–286 (2006)
Acknowledgments
The work was done when the first author was with the Faculty of Electrical Engineering, Universiti Teknologi Malaysia.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zarei, R., Monemi, A. & Marsono, M.N. Automated Dataset Generation for Training Peer-to-Peer Machine Learning Classifiers. J Netw Syst Manage 23, 89–110 (2015). https://doi.org/10.1007/s10922-013-9279-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10922-013-9279-z