Keywords

1 Introduction

Fuzzing is one of the most effective techniques to find security vulnerabilities in application by repeatedly testing it with modified or fuzzed inputs. State-of-the-art Fuzzing techniques can be divided into two main types: (1) black-box fuzzing [1], and (2) white-box fuzzing [2]. Black-box fuzzing is used to find security vulnerabilities in closed-source applications and white-box fuzzing is for open source applications. In terms of proprietary protocols, whose specification and implementation code are unavailable, black-box fuzzing is the only method can be conducted. There are two kinds of black-box fuzzing: (1) mutation-based fuzzing, and (2) generation-based fuzzing. Mutation-based fuzzing requires no knowledge of the protocol under test, it modifies an existing corpus of seed inputs to generate test cases. In contrast, generation-based fuzzing requires the input model to specify the message format of the protocol, in order to generate test cases. It has been proved that generation-based fuzzing performs much better, when compared to mutation-based fuzzing [3]. However, the input model of generation-based fuzzing can not be provided if neither the specification nor the implementation code of the protocol are available. Therefore, it requires protocol reverse engineering to figure out the message format of the protocol.

There have been many approaches to find security vulnerabilities in protocol implementations. For example, static code analysis [4, 5], white-box fuzzing [2, 6], symbolic execution [7, 8], and dynamic taint analysis [9] can help spotting vulnerabilities of the protocol, if the source code of the protocol is available. And if the specification of the protocol is already known, there are several modern fuzzers such as Sulley [10], Peach [11] and SPIKE [12] can be used. However, if the specification and implementation code of the protocol are both unavailable, things have become completely different. In this situation, only a few methods [13, 14] can be applied for protocol vulnerability discovery. These methods provided first solution for automatically fuzzing proprietary protocols if a program analysis is not possible or hard to carry out. But they have used variants of traditional clustering algorithm and n-gram based approaches that are limited by contents of finite length.

In contrast with previous work, we make the first attempt at applying neural-network-based machine-learning techniques for black-box fuzzing of proprietary network protocol. Our method combine the concepts from fuzzing with the techniques from natural language processing. In specifically, we capture sufficient network traffic of an unknown protocol, then use seq2seq model with LSTM cells to learn a generative input model that can be used to generate test cases. Finally, we use the generative model to communicate with the implementation of unknown protocol.

The rest of the paper is organized as follows: Sect. 2 gives a brief introduction to neural-network-based machine-learning techniques. We introduce our method for black-box fuzzing of proprietary protocols in Sect. 3. Section 4 presents results of fuzzing experiments with our method. Related work is discussed in Sect. 5. We conclude in Sect. 6.

2 Preliminaries

We now give a brief introduction to neural-network-based machine-learning techniques.

2.1 Recurrent Neural Networks

Recurrent neural networks (RNNs) address the issue of information persistence, which traditional neural networks can’t do. They are networks with loops in them, operating on a variable length input sequence (\(x_1,x_2,...,x_T\)) and consist of a hidden state \(h_t\) and an output y.

Fig. 1.
figure 1

Recurrent neural network with loops

As Fig. 1 shows, a block of neural network, A, looks at some input \(x_t\) and outputs a value \(h_t\). A loop is able to pass information from one step of the network to the next. A RNN can be thought of as multiple copies of the same network, each passing a message to a successor as Fig. 2.

Fig. 2.
figure 2

Unrolled recurrent neural network

The RNN processes the input sequence in a series of time stamps. For a particular time stamp t, the hidden state \(h_t\) and the output \(y_t\) at that time stamp has equations as Eqs. 1 and 2 show.

$$\begin{aligned} h_t = f(h_{t-1},x_t) \end{aligned}$$
(1)
$$\begin{aligned} y_t = \phi (h_t) \end{aligned}$$
(2)
Fig. 3.
figure 3

RNN long-term dependencies

In Eq. 1, f is a non-linear activation function such as sigmoid, tanh etc., which is used to introduce non-linearity into the network. And \(\phi \) in Eq. 2 is a function such as softmax that computes the output probability distribution over a given vocabulary conditioned on the current hidden state. RNNs can learn a probability distribution over a character sequence (\(x_1,x_2,...,x_{t-1}\)) by training to predict the next character \(x_t\) in the sequence.

In theory, RNNs are absolutely capable of handling long-term dependencies, where the predictions need more context. Unfortunately, in practice, RNNs become unable to learn to connect the information in cases shown in Fig. 3, where the distance between the relevant information and the place that it is needed becomes very large.

2.2 Long Short-Term Memory Networks

Long short-term memory networks (LSTMs) are a special kind of RNN, explicitly designed to avoid the long-term dependency problem. They also have the form of a chain, which has repeating modules of neural networks. But instead of having a single neural network layer, the repeating module has a different structure as Fig. 4 shows.

Fig. 4.
figure 4

LSTM repeating module with four interacting layers

The horizontal line crossing through the top of Fig. 4 is the cell state, which is the key of LSTMs. LSTMs are able to remove or add information to cell state with structures called gates, which composed out of a sigmoid neural net layer and a point wise multiplication operation.

Fig. 5.
figure 5

Basic sequence to sequence

2.3 Sequence to Sequence

A basic sequence-to-sequence (seq2seq) model, as introduced by Cho et al. [15], consists of two recurrent neural networks, an encoder RNN that processes a variable dimensional input sequence to a fixed-size state vector, and a decoder RNN that takes the fixed-size state vector and generates the variable dimensional output sequence. The basic architecture is depicted as Fig. 5.

Each box in Fig. 5 represents a cell of the RNN, in our method an LSTM cell. Encoder and decoder can share weights or, as is more common, use a different set of parameters. We train the seq2seq model using a corpus of network recordings, treating each one of the message as a sequence of characters. Before training, we concatenate all the messages into a single file.

3 Methodology

The main idea of our method is to learn a generative input model over the set of network protocol messages. We use a seq2seq model that has been historically proved to be very successful at many automatic tasks such as speech recognition and machine translation. Traditional n-gram based approaches are limited by contexts of finite length, while the seq2seq model is able to learn arbitrary length contexts to predict next sequence of characters. The seq2seq model can be trained in an unsupervised mode to learn a generative input model, which can be used to generate test cases.

3.1 Training the Model

Before training the seq2seq model, we need to preprocess the corpus. Firstly, we count the non-repeating characters in the corpus, and sort them in a list according to their frequency of occurrence. Then, take each character as key and its order in list as value, storing in a dictionary. Finally, create a tensor file which replace all characters with its value in list. The main purpose of preprocessing is to calculate the number of batches \(N_{b}\),

$$\begin{aligned} N_{b} = \frac{S_{t}}{S_{b} * L_{s}} \end{aligned}$$
(3)

where \(S_{t}\) is the size of tensor file, \(S_{b}\) is the size of one batch, which is set to 50 by default. And \(L_{s}\) is the length of each sequence in batches.

After the preprocessing, we train the seq2seq model in an unsupervised learning mode. Due to the absent of training dataset labels, we are not able to accurately determine how well the trained models are performing. We instead train several models with different epochs, which is the number of learning algorithm execution. Therefore, an epoch is defined as an iteration of the learning algorithm to go over the complete training dataset. We train the seq2seq models \(M_{s}\) as shown in Algorithm 1 with five different numbers of epochs \(N_{e}\): 10, 20, 30, 40 and 50. We use an LSTM model with 2 hidden layers, and each layer consists of 128 hidden states.

figure a

\(I_{p}\) is the initial path, where the checkpoints file stored in. \(N_{s}\) is the number of training steps to save intermediate result and the default setting is 1000.

3.2 Test Case Generation

We use the trained seq2seq model to generate new protocol messages. At the beginning of the fuzzing, we always connect to the server, and take the received message for initial sequence \(I_{s}\). Then request the seq2seq model to generate a sequence until it outputs one protocol message terminator like CRLF in ftp. Based on sampling strategy, there are three different strategies for message generation. Now, we give the details of these three different sampling strategies we make experiments with.

Max at Each Step: In this sampling strategy, we pick the best character in the predicted probability distribution. This strategy will generate protocol messages which are most likely to be well-formed. But this feature just makes the strategy unsuitable for fuzzing. Because we need test cases which are not quite the same as well-formed messages for fuzzing.

Sample at Each Step: In this sampling strategy, we don’t pick the best predicted next characters in the probability distribution. As a result, this strategy is able to generate multifarious new protocol messages, which combines various templates the seq2seq model has learnt from the protocol messages. Due to sampling, the generated protocol messages will not always be well-formed, which is of great use for fuzzing.

Sample on Spaces: This sampling strategy combines the two strategies described above. It uses the best predicted character in the probability distribution when the last character of the input sequence is not a space. And it samples distribution to generate next character when the input sequence ends with a space, similar to the second strategy. More well-formed protocol messages compared to the second strategy can be generated by this strategy.

figure b

N is the number of characters in the generated sequence, and we set it randomly to generate messages of arbitrary length.

4 Experimental Evaluation

4.1 Experiment Setup

In this section, we present results of fuzzing experiments with two ftp applications WarFTPD 1.65 and Serv-U build 4.0.0.4. We establish these two ftp applications on two servers, which run Windows Server 2003. The seq2seq models is trained on a personal computer, which has a Ubuntu 16.04 operating system. We implement a client program to communicate with ftp server, using the test cases generated by trained seq2seq model as input. If the program detects any error reports from ftp server, it records error messages in an error log. And we can validate whether the recorded error messages are indeed able to trigger vulnerabilities. Moreover, it is also feasible to implement a server program of the protocol to fuzz the client applications.

We use three working standards to evaluate fuzzing effectiveness:

Coverage: A basic demand shared by random and more advanced grammar-based fuzzers is that the instruction coverage should be as high as possible. In the case of our method, the fuzzer is able to fuzz the communication both ends but its coverage is highly depend on the network recordings.

Bugs: During the fuzzing process, we take the advantage of tool AppVerifier to monitor the running of ftp server. AppVerifier is a free runtime monitoring tool which can catch memory corruption bugs like buffer overflows, and it is widely used for fuzzing on Windows.

Performance Comparison: We record the statistical data when our fuzzer and existing fuzzer Sulley and SPIKE running with Serv-U build 4.0.0.4 for performance comparison. The statistical data include Times, Time and Speed. Times is the number of test cases sent, and Time means how many minutes was taken to find the bug. Speed indicates the number of test cases sent per second.

4.2 Corpus

We extracted about 10,000 messages for WarFTPD and 36,000 messages for Serv-U from network recordings. Most of the network recordings are generated by normal access to ftp server. And part of the traffic is generated by Sulley. Using Sulley is to improve the instruction coverage, because normal access may not include some less commonly used commands like MDTM, which is used to get the modification time of the remote file.

These 10,000 messages for WarFTPD and 36,000 messages for Serv-U which have both client and server side data are the training corpus for the seq2seq model we used in this work. We generate protocol messages using the trained seq2seq model, but the input data for ftp server should be transfered from network. Therefore we implement a client program to send the generated messages to ftp server.

4.3 Result

In order to obtain a reasonable explanation of coverage results, we select the network recordings of normal access to ftp server, and measure their coverage of the ftp application, to be used as a baseline for following experiments. When training the seq2seq model, an important parameter is the number of epochs. The results of experiments obtained after training the seq2seq model with 10, 20, 30, 40 and 50 epochs is reported here.

Coverage. Figure 6(a) and (b) show the instruction coverage obtained with sample at each step and sample at spaces from 10 to 50 epochs for WarFTPD and Serv-U. The figures also show the coverage obtained with the corresponding baseline.

Fig. 6.
figure 6

Coverage for WarFTPD and Serv-U from 10 to 50 epochs.

We observe the following:

  • The coverage for sample at each step and sample on spaces are above the baseline coverage for most epoch results.

  • The trend for the coverage of WarFTPD and Serv-U from 10 to 50 epochs is quite unstable and unpredictable.

  • The best coverage obtained with sample at each step and sample on spaces are both with 40-epochs.

Bugs. Another working standard is of course the number of bugs found. Our method has been tested on WarFTPD and Serv-U two ftp applications, and after a nearly 4-days experiment, we found almost all of the already known vulnerabilities in these two ftp applications as Table 1 shows.

Table 1. Bugs found by fuzzing
Table 2. Performance comparison

There is a SMNT buffer overflow vulnerability in Serv-U not found, because of the incompleteness of network traffic we used to train the seq2seq model.

Performance Comparison. In addition to coverage and bugs, a third working standard of interest is performance of our method. We compared our fuzzer with existing fuzzer Sulley [10] and SPIKE [12]. As Table 2 shows, the efficiency of our method is slightly lower than that of the existing methods. This is because that the generation of test cases by seq2seq model takes a lot of time. However, Sulley and SPIKE can only be used when the specification of the protocol is available, but our method is able to fuzz proprietary network protocols, whose specification and implementation code are both unavailable.

5 Related Work

Protocol Reverse Engineering. Over a decade ago, the process of reverse engineering a network protocol was a tedious, time-consuming and manual task. Nowadays, there are plenty of methods proposed for automating the process of protocol reverse engineering. The methods can be divided into two branches: On the one hand, methods that utilize the protocol implementation [16, 17], and on the other hand, those extract protocol specification from network recordings only. The Protocol Informatics Project [18] uses a bioinformatics method to implement byte sequence alignment of similar message formats. The Discoverer tool [19] present a recursive clustering approach of tokenized messages. Biprominer [20] and ProDecoder [21] presented by Wang et al. focused on binary protocols, they retrieve statistically relevant keywords and sequencing. Based on data mining techniques, AutoReEngine [22] reveal keywords and their position within messages. It is particularly difficult to extract protocol specification in case the protocol implementation code can not be available for network security staff, but network recordings only. These approaches provide first means for automatically identify message field boundaries and formats, but unfortunately, they are not able to relate variable fields over temporal states.

Protocol Fuzzing. Fuzzing is one of the most effective techniques to uncover security flaws in application by generating test case in an automated way. Two types of fuzzing can be discriminated here: (1) black-box fuzzing [1] which a tester can only seeing what input and output of an application, and white-box fuzzing [2] that allows the tester to inspect the implementation code (either binary or source code) and for instance, take advantage of static code analysis and symbolic execution. This classification is obviously applicable to protocol fuzzing as well. Most well-known black-box random fuzzers today support generation-based fuzzing, e.g. Peach [11] and SPIKE [12], can be used to fuzz protocol implementation when the specification of the protocol is available, but can do no more when the protocol is unknown. Only few approaches can fuzz protocol in situation where specification and implementation code are both unavailable. AutoFuzz [13] and PULSAR [14], which both infer the protocol state machine and message formats from network traffic alone.

6 Conclusion

It is a challenging problem of computer security to find vulnerabilities in the implementations of proprietary protocols. To the best of our knowledge, this is the first attempt to do black-box protocol fuzzing using neural network learning algorithm, which is able to find vulnerabilities in protocol implementations, whether or not the code nor specification are available. We presented and evaluated algorithms with different sampling strategies to automatically learn a generative model of protocol messages.

Although we have applied our method on very common network protocols, the method is also able to find vulnerabilities in unusual implementations, such as in embedded devices and industrial control systems. Moreover, we are considering adding some form of reinforcement learning in our future work to guide the fuzzing process with coverage feedback from the application.