1 Introduction

Sign language is a kind of gestures expressed by the movements of hands and body to exchange information for deaf and mute people. Recently, sign language recognition (SLR) has drawn increasing attention in the computer vision community in order to provide an efficient mechanism to convenient the communication between the normal and hearing impaired people.

SLR has been widely studied over the years. Previous work on SLR can be divided into two categories: traditional methods and deep learning based methods. For traditional methods, hand-crafted features are usually adopted, such as Histogram of Oriented Gradient (HOG) [1] and Histogram Of Oriented Optical Flow (HOF) [2]. However, the hand-crafted features are usually too simple to represent the sign language and are difficult to design. Besides, Hidden Markov Model (HMM) is a classical method to model the temporal information [3, 4], but HMM is based on the assumption of Markov property , which is not always true for sign language in practice.

Recently, deep learning has achieved great success in computer vision. Convolutional neural network (CNN) and Recurrent Neural Network (RNN) are two typical deep learning architectures. CNN has a strong ability to learn effective representative of images, and has led to a series breakthroughs in many computer vision tasks, including image classification [5] and object detection [6]. RNN shows strong ability in modeling temporal information, so it’s widely adopted in sequence modeling tasks such as speech recognition [7] and machine translation [8]. Inspired by the success of deep learning, researchers start to apply learning based methods to SLR. In these methods, the hand-crafted features are replaced with the learned features. For example, in [9], a multichannel CNN is developed to enhance the features for hand posture classification. Later on, LSTM is adopted in [10, 11] to model the temporal information. All these methods regard SLR as a classification problem, while neglect the inherent context relationship between sign language words. As the developing of deep learning, encoder-decoder network has been proposed for many sequence tasks. In [12], Venugopalan et al. employed CNN for feature extraction and pour the feature into the encoder-decoder network to realize video caption. In [13], a unified network consisting of convolutional layers, recurrent layers, transcription layers was proposed for scene text recognition.

Fig. 1.
figure 1

The proposed sequence to sequence learning framework. This framework consists three modules, including feature extraction, encoder-decoder modeling and model fusion.

In this paper, we propose a sequence to sequence framework for SLR. The aim of SLR is actually to translate a gesture into its corresponding words, which is actually a sequence to sequence mapping problem similar to video caption. The overall framework is illustrated as Fig. 1. The proposed framework consists of three modules: feature extraction, encoder-decoder modeling and model fusion. The inputs of the framework are captured by Kinect2.0, including RGB image sequences and skeletal coordinates. Firstly, the image sequences are fed into the CNN to extract the spatial features. Then the spatial features are used as the input of the encoder-decoder for temporal modeling. The encoder-decoder module consists of two LSTM layers. The encoder LSTM layer is used to map the input features to a fixed-length vector, while the decoder LSTM is used to learn the context relationship of the sign language. Secondly, the skeletal coordinates are used as auxiliary features and are directly fed into another encoder-decoder module. Finally, in the model fusion module, a probability combination method is proposed to fuse the two models to get the final prediction.

The remainder of this paper is organized as follows. In Sect. 2, we will give an a brief review of the related work. The proposed sequence to sequence framework will be introduced in detail in Sect. 3. The experimental results are shown in Sect. 4, and Sect. 5 concludes the paper.

2 Related Work

The sign language recognition methods can be divided into two categories: traditional methods and deep learning based methods. Traditional SLR methods usually use hand-crafted features. In [1], HOG feature is adopted for hand shape representation. Several other researchers focus on trajectory feature to realize more robust SLR [14] or gesture recognition [15]. As to temporal modeling, HMM has been widely used. Starner et al. [16] utilized HMM to model American Sign Language acquired by tracking the user’s unadorned hands using single camera. In [17], Gao et al. proposed a method using the self-organizing feature maps (SOFM) as different signers feature extractor to transform input signs into significant and low-dimensional representations that can be well modeled by HMM. Besides HMM, other methods are also explored. Sminchisescu et al. [18] proposed a framework for human motion recognition based on conditional random fields (CRFs) and maximum entropy Markov models (MEMMs). Lichtenauer et al. [19] presented a hybrid approach for SLR by using statistical Dynamic Time Warping (DTW), assuming that time warping and classification should be separated because of conflicting likelihood modeling demands.

As for deep learning based SLR method, Barros et al. [9] developed a multichannel CNN to enhance the features for hand posture classification. Wu et al. [20] employed dynamic neural networks to process skeleton and depth and RGB (RGB-D) images of sign language data, and then combined the network with HMM to realize gesture recognition. Later on, Pigou et al. [11] proposed an end-to-end architecture incorporating temporal convolutions and bidirectional recurrence in order to acquire better temporal feature. More recently, Liu et al. [10] presented an end-to-end method for SLR based on LSTM, using trajectory acquired by using Microsoft Kinect as the input of the network.

3 Proposed Method

In this section, we will introduce the proposed sequence to sequence framework in detail. As shown in the Fig. 1, the whole framework can be divided into three modules: feature extraction, encoder-decoder modeling and model fusion. We will introduce the three modules in the following sections.

3.1 Feature Extraction

The sign language data used in this work is captured by Microsoft Kinect2.0, including image sequences and 3D skeletal points. The image sequences and skeletal coordinates are jointly used, considering that multi-modality of data is beneficial for recognition. As is shown in the Fig. 1, we utilize CNN to extract the feature of image sequences rather than designing hand-crafted feature. In this work, we use the pre-trained VGG-16 [21] as the feature extractor, which is an outstanding model trained on the ImageNet dataset [22]. The image sequences are fed into the VGG-16 and the fc7 layer units are selected as the image feature, which is a vector of 4096 dimensions.

Kinect2.0 has the skeletal tracking ability by recording space coordinates (xyz) of 25 skeleton points of human body, such as hands, wrists and elbows. The 25 coordinates are concatenated to a vector of 75 dimensions. Therefore, each frame of the image sequences corresponds to a string of space coordinates with 75 dimensions. We take this kind of data as the trajectory features considering that the movements of hands or body of the sign language can also play an important role in recognition.

3.2 Encoder-Decoder Modeling

The Background of LSTM. RNN are the most popular choice to solve the sequence to sequence mapping problems. RNN can model the temporal relationship within the input sequence by using the recurrent feedbacks. For the simple RNN, it may suffer from the gradient vanishing and exploding problem [23]. Therefore, some new architectures have been proposed to alleviate these problems. The long short-term memory (LSTM) [24] model is an enhanced RNN architecture to implement the recurrent feedbacks using various learnable gates and has been widely adopted in temporal modeling. Given the input \(x_t\), the forget gate \(f_t\), input gate \(i_t\), output gate \(o_t\), input modulation gate \(\tilde{C_t}\) and memory cell \(C_t\) can be computed by the following equations :

$$\begin{aligned} f_t=sigm(W_{xf}x_t+W_{hi}h_{t-1}+b_f) \end{aligned}$$
(1)
$$\begin{aligned} i_t=sigm(W_{xi}x_t+W_{hi}h_{t-1}+b_i) \end{aligned}$$
(2)
$$\begin{aligned} \tilde{C_t}=tanh(W_{xc}x_t+W_{hc}h_{t-1}+b_c) \end{aligned}$$
(3)
$$\begin{aligned} C_t= f_t \odot C_{t-1}+i_t \odot \tilde{C_t} \end{aligned}$$
(4)
$$\begin{aligned} o_t=sigm(W_{xo}x_t+W_{ho}h_{t-1}+b_o) \end{aligned}$$
(5)
$$\begin{aligned} h_t=o_t\odot tanh(C_t) \end{aligned}$$
(6)

where sigm is the sigmoid function, tanh is the hyperbolic tangent non-linearity, \(\odot \) stands for element-wise product with the gate value, and the weight matrices denoted by \(W_{ij}\) and biases \(b_j\) are the network parameters which should be optimized during the training process.

Encoder-Decoder Network for SLR. Typical methods regard SLR as a classification task, while in this paper we formulate SLR as a sequence to sequence problem, mainly based on two observations as follows:

  1. (i)

    Some isolated Chinese sign language usually consists of several separated gestures and each gesture corresponds to some specific Chinese characters.

  2. (ii)

    Other Chinese sign language cannot be further separated, but there still exists inherent context relationship among the Chinese characters.

Figure 2 provides two typical examples to explain this observation. Figure 2(a) represents the Chinese sign language word ‘post office’, which can be divided into two isolated gestures corresponding to two Chinese characters: ‘post’ and ‘house’, respectively. The first five frames correspond to ‘post’ and the rest five frames mean ‘house’. Figure 2(b) illustrates the sign language word ‘train’, which consists of two characters in Chinese. Although the sign language word ‘train’ cannot be further divided into isolated gestures, there still exists context relationship between the two characters of ‘train’ in Chinese. Considering the two types of characteristics of Chinese sign language, we formulate Chinese SLR as a sequence to sequence problem. Inspired by the application of LSTM in video caption, we design an encoder-decoder network to handle it.

Fig. 2.
figure 2

Two typical Chinese sign language words widely used in the daily life. (a) ‘post office’, (b) ‘train’.

Fig. 3.
figure 3

The encoder-decoder framework. The top LSTM is the encoder mapping the inputs to a temporal fixed-length vector, and the bottom LSTM is the decoder which build a language model and output a sequence of characters. EOS is used to indicate the end of decoding. Zeros padding is used when there is no input at the time step. (Color figure online)

The encoder-decoder network is shown in Fig. 3, which consists of two LSTMs (blue blocks and yellow blocks) and a softmax layer. The first LSTM (blue blocks) is acted as the encoder and the other LSTM (yellow blocks) is acted as the decoder. At the encoding stage, the LSTM is used to map the input feature \( (x_1,x_2, ... , x_n) \) to a fixed-length vector which contains the temporal information. After all the input features of sign language sample are exhausted, there will be a mark to indicate the beginning of decoding. At the decoding stage, the output of the encoder is fed into the second LSTM to learn the context model of sign words. We establish a vocabulary by taking the Chinese characters as the basic elements. The final output is a string of Chinese characters from the vocabulary. The probability \(p(z_t)\) of every character in the vocabulary at each time t will be estimated using a softmax layer:

$$\begin{aligned} p(z_t|h_t)=\frac{exp(W_zh_t)}{\sum _{z\in V} exp(W_zh_t)} \end{aligned}$$
(7)

where \(h_t\) is the output vector of decoder LSTM at time t, \(z_t\) is a basic element in the vocabulary, \(W_z\) is the parameters of this layer and V is the whole vocabulary. The final outputs can be estimated by the conditional probability \(p(z_1,z_2,...,z_m|x_1,x_2,...,x_n)\), which can be computed by the following equation:

$$\begin{aligned} p(z_1,z_2,...,z_m|x_1,x_2,...,x_n)=\prod _{t=1}^m p(z_t|h_t,z_1,z_2,...,z_{t-1}) \end{aligned}$$
(8)

Image and Trajectory Sequence Modeling. In the feature extraction stage, CNN is utilized to extract the spatial features of image sequences. The output of CNN is a vector of 4096 dimensions. The 4096-dimensional vector is firstly embedded to 500 dimensions by a fully connected layer which will be trained jointly with the encoder-decoder network. The embedding operation is utilized to reduce the complexity of model. After embedding, the image feature is fed into the encoder-decoder network frame by frame. During the encoding process, the image feature sequence is mapped into a fixed-length representation. During the decoding process, the characters sequence \( (z_1,z_2,...,z_m) \) is generated. In this way, we can establish the images sequence model, termed as \(model_{rgb}\).

As to the trajectory feature, we directly utilize the 75-dimensional coordinate vector as the input of another encoder-decoder. In this way, we can establish the trajectory model (denoted as \(model_{tra}\)). This model can be used as a complementary of images sequence model, because skeletal coordinates provide important motion information of the human body.

3.3 Probability Combination

Unlike data and feature level fusion where the data or features are usually heterogenous and hard to combine, decision level fusion is more flexible and nature. Therefore, in this paper we propose a probability combination method to fuse the image model and trajectory model. In the recognition step, the probabilities of all the characters in the vocabulary will be computed for each model at the time step t. And the final prediction is determined by combining the two probabilities through the following equation, so the decoding output of time t will be the character which corresponds to the maximum decoding probability.

$$\begin{aligned} p(z_t=w)=\alpha \times p_{rgb}(z_t=w)+(1-\alpha ) \times p_{tra}(z_t=w) \end{aligned}$$
(9)

where w is a certain character in the vocabulary, \(\alpha \) is the weighting parameter which will be tune on testing dataset, and \(p_{rgb}\) and \(p_{tra}\) are the probabilities estimated by \(model_{rgb}\) and \(model_{tra}\), respectively.

4 Experiments

In this section, we conduct a series of experiments to evaluate the effectiveness of the proposed approach on our self-built dataset captured by the Kinect2.0. First, we will introduce the dataset and the detail of experiment settings. Then the experimental results and analysis are provided.

4.1 Dataset and Experiment Settings

The dataset adopted in the experiment consists of 90 Chinese sign language words widely used in our daily life, such as ‘airplane’ and ‘train’, unlike simple gesture, sign language is usually more complex which consists of several simple gestures. A sign language word sample usually consists of 80–120 frames. There are 100 samples for each word, and these samples are captured by Kinect2.0. Therefore, this dataset is composed of 9000 samples in total. For each sample, both RGB frames and skeletal coordinates are recorded. Specifically, the captured RGB frame is of size 1920 \(\times \) 1080, skeletal coordinates are (xyz) of 25 skeleton points of human body. In the experiment, we divide the dataset into two parts, 70% of the data is utilized as the training set and the rest as test set. We use the pre-trained VGG-16 as the spatial feature extractor of images. For an input image, we extract the fc7 layer with 4096 units in VGG net as the image feature and then embed it to a 500-dimensional vector. For the encoder-decoder, the number of units of each LSTM is set to be 1000. We use the deep learning framework Caffe [25] to train the models. The network parameters are optimized using stochastic gradient descent (SGD), and the cross-entropy loss is used as loss function.

Fig. 4.
figure 4

Recognition accuracy with different \(\alpha \) values

4.2 Results and Analysis

After the experiment setting and data preparation, we use the self-build dataset to evaluate the performance of our method. We firstly study the impact of different \(\alpha \) values on the final recognition accuracy. We vary this parameter from 0.4 to 1, and the result is illustrated in Fig. 4. It can be seen that we can obtain the best recognition accuracy when \(\alpha \) is set to be 0.6, which is also used in the following experiment.

For comparison, two typical methods [10, 26] are selected as the representative algorithms. The method in [26] is utilizing hand shape features (HOG features) and trajectory features to represent the sign videos, and using HMM to realize recogntion. We also implement the algorithm proposed in [10], where LSTM is employed to learn the temporal information of trajectory. The recognition accuracy of the above methods are shown in Table 1.

Table 1. Average accuracy of the proposed method

We can see from the Table 1 that our method outperforms the compared methods when only using the \(model_{rgb}\). And it is noted that when fusing RGB model and trajectory model, the accuracy can be further improved by about 1.2%. From the experimental results, we can draw the following conclusions:

  1. (i)

    The proposed sequence to sequence learning method has the capability to learn the inherent context relationship in sign language. And this method is more reliable for the Chinese sign language recognition task.

  2. (ii)

    Since the trajectory data provide effective and precise position information by recoding the space coordinates, it can be used as an enhancement for recognition.

4.3 Results of Using Hand Region Images

We also conduct the experiment by using the hand region images instead of the full images as the input of the framework. The method in [27] are adopted to extract the hand region. The experimental results are shown in Table 2.

Fig. 5.
figure 5

Two Chinese sign language words after hand region extracting. (a) ‘train’, (b) ‘berth’.

As can be seen from the Table 2, the recognition accuracy can be further improved by 3.1% compared with using the full gesture images. This is mainly due to the fact that the most meaningful movements center on the hand region which is shown in Fig. 5. Besides, there is only tiny difference between some movements of different gestures, in which case the body and clothes may interfere, so the full images features extracted by CNN cannot be discriminated effectively during recognition. Therefore, using the hand region instead of the full images can be more beneficial for the recognition accuracy.

Table 2. Results using hand region images and trajectory

5 Conclusion

In this paper, a sequence to sequence framework based on CNN and LSTM is proposed for Chinese sign language recognition. The propose sequence to sequence framework consists of three modules: feature extraction, encoder-decoder modeling and model fusion. Specifically, we use CNN to extract image sequence features, an encoder-decoder network is utilized to build the end-to-end temporal model, and finally image model and trajectory model are fused to get recognition result. This framework can not only learn the spatial features of the input but also can learn the temporal information and context relationship in Chinese sign language. A series of experiments have been conducted, and the experimental results show that the proposed method outperforms the compared methods on our dataset. In the future, we will go on investigate how to further improve the recognition accuracy and apply this method to continuous sign language recognition.