A nonparametric approach to the automated protocol fingerprint inference

https://doi.org/10.1016/j.jnca.2017.10.009Get rights and content

Abstract

Protocol fingerprints are a set of byte subsequences within packet payload that can distinguish individual application protocols. They play an important role for deep packet analysis in traffic normalization and network management. In this paper, we propose ProPrint, a network trace-based protocol fingerprint inference system. In ProPrint, we first build a protocol language model based on a modified nonparametric Bayesian statistical model. Second, we use the corresponding protocol language model to identify field boundaries in packet payload, such that we can segment each payload into a set of protocol feature words according to the hidden structure information. Third, we propose a ranking algorithm that selects true protocol fingerprints from the candidate protocol feature words. In evaluations, we measure ProPrint on real-world network traces, and also compare ProPrint to existing state-of-the-art solutions, ProWord and Securitas. The experimental results show that ProPrint performs better than ProWord and Securitas on f-measure for online application classification.

Introduction

This paper concerns the automatic inference of protocol fingerprints from the packet payload of protocol traces. Protocol fingerprint specifications are referred to by us as a set of unique byte subsequences that can be used to distinguish protocol traces from mixed Internet traffic. Protocol fingerprint inference is a fundamental problem for a variety of networking and security services, such as network management, network-based Intrusion Detection and Prevention Systems (IDSes/IPSes), traffic classification, and measurement, etc (Luo and Yu, 2013, Najam et al., 2015, Reviriego et al., 2014, Rubio-Largo et al., 2014, Wang et al., 2012). For example, protocol fingerprints are helpful for Internet Service Providers (ISPs) to provide a better service experience for Internet end users. In practice, with fine protocol fingerprints, ISPs can formulate an in-depth understanding of the protocol traffic passing through their networks, such that they can impose different strategies (i.e., higher or lower priority) and appropriate policies on the protocol traffic they are concerning for better services.

Prior works for protocol fingerprint inference fall into two categories: 1). executable code-based approaches and 2). network trace-based approaches. In this paper, we focus on the problem of automated protocol fingerprint inference based on the packet payload of protocol traces, and thus executable code-based approaches are beyond the discussion of this paper. Notice that many network trace-based approaches have been proposed for the deep understanding of protocol traffic in prior arts, such as Discoverer (Cui et al., 2007), KISS (La Mantia et al., 2010, Finamore et al., 2010), Veritas (Wang et al., 2011), ProDecoder (Wang et al., 2012), ProWord (Zhang et al., 2014) and so on. ProWord (Zhang et al., 2014) proposed by Zhang et al is the most recent and relevant work, and it is an elegant system for network trace-based protocol fingerprint inference. To infer protocol fingerprints, ProWord identifies possible word boundaries of protocol traces using entropy and then selects the most possible boundary positions for word partitioning. With the recognized word boundaries, ProWord partitions the payload of protocol traces into a set of candidate protocol words. The extracted protocol words are regarded as the building rules for deep packet analysis. However, when reconstructing the hidden structure of packet payload, ProWord ignores the temporal coherence of candidate protocol words in protocol messages. Taking message “250 OK” as an example, protocol word “250” is basically followed by “OK” in network communications, and both words together form a protocol fingerprint of SMTP. It is noteworthy that ProWord often breaks the aforementioned protocol fingerprint “250 OK” into two separate parts, “250” and “OK”. The divided protocol words are incomplete protocol fingerprints for SMTP. The main reason is that ProWord only depends on the entropy information of protocol messages to construct protocol models, and thus misses the opportunity of exploiting temporal information among protocol words.

In this paper, we propose ProPrint, a nonparametric and unsupervised approach that performs automated protocol fingerprint inference from the network traces of application protocols. Prior literatures in this field can be roughly classified into two categories, 1) statistics-oriented methods, and 2) string-oriented methods. ProPrint belongs to the second category – string-oriented methods. The input to ProPrint is the packet payload of a given protocol, and the output to ProPrint is the protocol fingerprints for the corresponding protocol. ProPrint is based on the key insight that application protocols can be regarded as a kind of formatted languages for application softwares, such that we can leverage statistical language models for robust and accurate protocol fingerprint inference. The key novelty of our work is that when reconstructing the hidden structure information of packet payload, we explore the temporal coherence (i.e., the Markov property) of protocol words, in which such information is often omitted in prior literatures.

In practice, ProPrint does not assume any prior knowledge on protocol specifications, such as word boundaries, and it can be effectively applied to both textual and binary protocols. Therefore, ProPrint is a more robust network trace-based system for automatic protocol fingerprint inference. In order to test and verify the effectiveness of ProPrint, we measure ProPrint and conduct extensive evaluations on six real-world protocol traces. Our experimental results show that ProPrint can accurately identify the protocol trace in terms of a precision of 99.01% and a recall of 93.72% on average. Furthermore, we also compare ProPrint to two state-of-the-art solutions, ProWord and Securitas. It is worthy to note that ProPrint reports more effective results in inferring protocol fingerprints than ProWord and Securitas. The main contribution are follows,

  • We introduce and present a nested hierarchical Pitman-Yor process to build protocol language models. The proposed model explores the temporal coherence of protocol words and has no“unknown words” problem. In addition, it is a completely unsupervised learning of protocol language models directly from the byte sequences generated by an application protocol.

  • We design and implement a lightweight tool called ProPrint, which automatically conducts trace-driven protocol fingerprint inference from the packet payload of protocol traces. As a general solution for fingerprint extraction, our technique is independent of the type of the target application protocol.

  • We conduct extensive experimental evaluations on six stateful protocols, including SMTP, FTP, SopCast, BitTorrent, PPStream and PPLive. In addition, we compare our approach to two existing state-of-the-art systems, ProWord and Securitas. Our results of evaluation show that ProPrint is more effective than ProWord and Securitas on f-measure.

The remainder of this paper is organized as follows. In Section 2, we present and introduce the related work regarding protocol fingerprint inference. Section 3 is dedicated to an overview of our proposed system ProPrint. In 4 Protocol language modeling, 5 Message segmentation, 6 Keyword ranking, we present the technical details of each module of ProPrint. Next, in Section 7, we evaluate the whole system with the packet traces of different application protocols. We compare the experimental results of ProPrint to state-of-the-art algorithms in Section 8. Finally, we conclude the whole work in Section 9.

Section snippets

Related work

Protocol fingerprinting inference from packet payload is very important for application protocol identification as the inferred fingerprinting can be used as the building blocks for payload oriented protocol classifiers. It is worthy to notice that prior literatures in this field can be roughly classified into two categories, 1) statistics-oriented methods, and 2) string-oriented methods. Next, we give a brief introduction and discussion about the prior methods for the aforementioned two

General framework of proprint

In this paper, we propose ProPrint, a nonparametric and unsupervised approach that automatically infers application protocol fingerprints from the payload of protocol traces. Fig. 1 shows the architecture of ProPrint. ProPrint has three major modules: Language Modeling, Message Segmentation, and Keyword Ranking. The input to ProPrint is a set of packet traces that are of the same application protocol, and the output to ProPrint is the protocol fingerprints identified by ProPrint for the

Basics

Protocol Language Modeling is the first module of ProPrint, and it builds robust language models for a given protocol. The input to this module is the packet payload of a given protocol, and the output to this module is language models for the target application protocol. In this section, we first introduce and present a novel approach for protocol language modeling, called nested hierarchical Pitman-Yor process language models (abbr. NPYLM) (Walter et al., 2013), which is able to cope with an

Message segmentation

Given a collection D of unsegmentation character sequences (i.e., byte sequences), we aim to segment each packet payload of the input into a set of protocol feature words according to the hidden structure information. The output to this message segmentation module is a set of candidate protocol feature words. The technical challenge in the module is to recognize protocol word boundaries. In order to alleviate this problem, a feasible way for the unsupervised message segmentation is to combine

Keyword ranking

Protocol fingerprints are referred to by us as protocol features for deep packet analysis. Remember that in the previous message segmentation module, we may extract hundreds of candidate protocol words from the NPYLM+Segmentation algorithm. However, we notice that some candidate protocol words are incomplete or not true protocol fingerprints. For example, byte “0×53” may be a candidate protocol word for SMTP. However, it is worthy to note that byte “0×53” itself is incomplete and it can not be

Experimental setup

In order to quantitatively measure the effectiveness of ProPrint in protocol fingerprint inference, we conduct extensive trace-driven evaluations with widely used and stateful application protocols. We collect our traffic traces from a backbone router of a major Internet Service Provider (ISP) that offers diverse network services on the Internet, and select six typical protocols, including SMTP, FTP, PPLive, SopCast, BitTorrent and PPStream. Notice that the above mentioned protocols can be

Comparison to existing algorithm

In this evaluation part, we meanwhile compare the experimental results of ProPrint to existing state-of-the-art solutions, ProWord (Zhang et al., 2014, Zhang et al., 2014) and Securitas (Yun et al., 2016), on automated protocol fingerprinting inference. Notice that ProPrint, ProWord and Securitas are all tools for extracting protocol fingerprints by deep packet analysis from the payload of protocol traces. In addition, as aforementioned, ProWord is a string-oriented approach, and Securitas is a

Conclusion

In this paper, we propose a network trace-based protocol fingerprint inference system, ProPrint, which takes network packet traces as input and automatically infers the fingerprints of the application traces. Our method builds on a modified nonparametric Bayesian statistical model that discovers temporal structure of protocol words. We have implemented and extensively evaluated our protocol fingerprint inference system with both textual and binary protocols. We also compare ProPrint to two

Yipeng Wang is an Associate Professor with the Institute of Information Engineering, Chinese Academy of Sciences (CAS), China. He received the Ph.D. degree in computer science from the Institute of Computing Technology, Chinese Academy of Sciences (CAS). His research interests are in networking, security and machine learning, in particular network protocol inference. He has published more than 25 research papers in refereed international journals and conferences, such as IEEE/ACM Transactions

References (24)

  • Mochihashi, D., Yamada, T., Ueda, N., Bayesian unsupervised word segmentation with nested pitman-yor language...
  • nDPI, 2014. available: 〈http://www.ntop.org/products/ndpi/〉(Open and Extensible GPLv3 Deep Packet Inspection...
  • Cited by (8)

    • Protocol Reverse-Engineering Methods and Tools: A Survey

      2022, Computer Communications
      Citation Excerpt :

      In other words, fields with different types change differently over specific sub-collections. The tools that use n-grams include PRISMA [42], FieldHunter [6], Li et al. [7], ProHacker [45], ProPrint [8], Esoul et al. [9], and Luo et al. [10] Li et al. [43] first establishes a hidden semi-Markov model (HSMM) [50] for optimal segmentation and estimates the model parameters by using a sample set of message sequences transmitted by unknown application layer protocols during a network session.

    • Khaos: An Adversarial Neural Network DGA with High Anti-Detection Ability

      2020, IEEE Transactions on Information Forensics and Security
    • CCGA: Clustering and capturing group activities for DGA-based botnets detection

      2019, Proceedings - 2019 18th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/13th IEEE International Conference on Big Data Science and Engineering, TrustCom/BigDataSE 2019
    • Detecting Domain Generation Algorithms with Convolutional Neural Language Models

      2018, Proceedings - 17th IEEE International Conference on Trust, Security and Privacy in Computing and Communications and 12th IEEE International Conference on Big Data Science and Engineering, Trustcom/BigDataSE 2018
    View all citing articles on Scopus

    Yipeng Wang is an Associate Professor with the Institute of Information Engineering, Chinese Academy of Sciences (CAS), China. He received the Ph.D. degree in computer science from the Institute of Computing Technology, Chinese Academy of Sciences (CAS). His research interests are in networking, security and machine learning, in particular network protocol inference. He has published more than 25 research papers in refereed international journals and conferences, such as IEEE/ACM Transactions on Networking (ToN), IEEE International Conference on Network Protocols (ICNP), International Conference on Applied Cryptography and Network Security (ACNS). Dr. Wang won the Best Paper Award at ICNP 2012.

    Xiaochun Yun is a Professor with the Institute of Information Engineering, Chinese Academy of Sciences (CAS), China. He respectively received the B.S. and Ph.D. degree from the Harbin Institute of Technology, China, in 1993 and 1998. His research interests include network and information security. He is a member of the IEEE.

    Yongzheng Zhang received the B.S. and Ph.D. degrees in computer science from the Harbin Institute of Technology, Harbin, China, in 2001 and 2006, respectively. He is a Professor and Ph.D. Supervisor with the Institute of Information Engineering, Chinese Academy of Sciences (CAS), Beijing, China. His research interests include network security, particularly cyberspace security situational awareness. Prof. Zhang was honored with the first prize of the Chinese National Award for Science and Technology Progress in 2011.

    Liwei Chen received the Ph.D. degree in computer science from the Chinese Academy of Sciences (CAS), Beijing, China, in 2014, and received the B.S. degree in Department of Automation from Tsinghua University, Beijing, China, in 2008. He is an Assistant Professor with the Institute of Information Engineering, CAS. He has published several research papers in refereed international journals and conferences, such as the Multimedia Tools and Applications, the IEEE/IFIP International Conference on Embedded and Ubiquitous Computing (EUC) and the International Conference on Networking, Architecture and Storage (NAS). His research interests include: computer architecture, information security, video coding and VLSI design.

    Guangjun Wu received the masters and doctors degrees in computer science from the Harbin Institute of Technology, China, in 2006 and 2010, respectively. He is currently a senior engineer at the Institute of Information Engineering, Chinese Academy of Sciences, China. His research interests include big data analysis, distributed storage, and information security.

    This work was supported by the National Natural Science Foundation of China under Grants No. 61402472.

    View full text