Chinese Sign Language Recognition with Sequence to Sequence Learning

Mao, Chensi; Huang, Shiliang; Li, Xiaoxu; Ye, Zhongfu

doi:10.1007/978-981-10-7299-4_15

Chinese Sign Language Recognition with Sequence to Sequence Learning

Chensi Mao¹⁶,
Shiliang Huang¹⁶,
Xiaoxu Li¹⁶ &
…
Zhongfu Ye¹⁶

Conference paper
First Online: 30 November 2017

2952 Accesses
11 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 771))

Abstract

In this paper, we formulate Chinese sign language recognition (SLR) as a sequence to sequence problem and propose an encoder-decoder based framework to handle it. The proposed framework is based on the convolutional neural network (CNN) and recurrent neural network (RNN) with long short-term memory (LSTM). Specifically, CNN is adopted to extract the spatial features of input frames. Two LSTM layers are cascaded to implement the structure of encoder-decoder. The encoder-decoder can not only learn the temporal information of the input features but also can learn the context model of sign language words. We feed the images sequences captured by Microsoft Kinect2.0 into the network to build an end-to-end model. Moreover, we also set up another model by using skeletal coordinates as the input of the encoder-decoder framework. In the recognition stage, a probability combination method is proposed to fuse these two models to get the final prediction. We validate our method on the self-build dataset and the experimental results demonstrate the effectiveness of the proposed method.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Sign language is a kind of gestures expressed by the movements of hands and body to exchange information for deaf and mute people. Recently, sign language recognition (SLR) has drawn increasing attention in the computer vision community in order to provide an efficient mechanism to convenient the communication between the normal and hearing impaired people.

SLR has been widely studied over the years. Previous work on SLR can be divided into two categories: traditional methods and deep learning based methods. For traditional methods, hand-crafted features are usually adopted, such as Histogram of Oriented Gradient (HOG) [1] and Histogram Of Oriented Optical Flow (HOF) [2]. However, the hand-crafted features are usually too simple to represent the sign language and are difficult to design. Besides, Hidden Markov Model (HMM) is a classical method to model the temporal information [3, 4], but HMM is based on the assumption of Markov property , which is not always true for sign language in practice.

Recently, deep learning has achieved great success in computer vision. Convolutional neural network (CNN) and Recurrent Neural Network (RNN) are two typical deep learning architectures. CNN has a strong ability to learn effective representative of images, and has led to a series breakthroughs in many computer vision tasks, including image classification [5] and object detection [6]. RNN shows strong ability in modeling temporal information, so it’s widely adopted in sequence modeling tasks such as speech recognition [7] and machine translation [8]. Inspired by the success of deep learning, researchers start to apply learning based methods to SLR. In these methods, the hand-crafted features are replaced with the learned features. For example, in [9], a multichannel CNN is developed to enhance the features for hand posture classification. Later on, LSTM is adopted in [10, 11] to model the temporal information. All these methods regard SLR as a classification problem, while neglect the inherent context relationship between sign language words. As the developing of deep learning, encoder-decoder network has been proposed for many sequence tasks. In [12], Venugopalan et al. employed CNN for feature extraction and pour the feature into the encoder-decoder network to realize video caption. In [13], a unified network consisting of convolutional layers, recurrent layers, transcription layers was proposed for scene text recognition.

In this paper, we propose a sequence to sequence framework for SLR. The aim of SLR is actually to translate a gesture into its corresponding words, which is actually a sequence to sequence mapping problem similar to video caption. The overall framework is illustrated as Fig. 1. The proposed framework consists of three modules: feature extraction, encoder-decoder modeling and model fusion. The inputs of the framework are captured by Kinect2.0, including RGB image sequences and skeletal coordinates. Firstly, the image sequences are fed into the CNN to extract the spatial features. Then the spatial features are used as the input of the encoder-decoder for temporal modeling. The encoder-decoder module consists of two LSTM layers. The encoder LSTM layer is used to map the input features to a fixed-length vector, while the decoder LSTM is used to learn the context relationship of the sign language. Secondly, the skeletal coordinates are used as auxiliary features and are directly fed into another encoder-decoder module. Finally, in the model fusion module, a probability combination method is proposed to fuse the two models to get the final prediction.

The remainder of this paper is organized as follows. In Sect. 2, we will give an a brief review of the related work. The proposed sequence to sequence framework will be introduced in detail in Sect. 3. The experimental results are shown in Sect. 4, and Sect. 5 concludes the paper.

2 Related Work

The sign language recognition methods can be divided into two categories: traditional methods and deep learning based methods. Traditional SLR methods usually use hand-crafted features. In [1], HOG feature is adopted for hand shape representation. Several other researchers focus on trajectory feature to realize more robust SLR [14] or gesture recognition [15]. As to temporal modeling, HMM has been widely used. Starner et al. [16] utilized HMM to model American Sign Language acquired by tracking the user’s unadorned hands using single camera. In [17], Gao et al. proposed a method using the self-organizing feature maps (SOFM) as different signers feature extractor to transform input signs into significant and low-dimensional representations that can be well modeled by HMM. Besides HMM, other methods are also explored. Sminchisescu et al. [18] proposed a framework for human motion recognition based on conditional random fields (CRFs) and maximum entropy Markov models (MEMMs). Lichtenauer et al. [19] presented a hybrid approach for SLR by using statistical Dynamic Time Warping (DTW), assuming that time warping and classification should be separated because of conflicting likelihood modeling demands.

As for deep learning based SLR method, Barros et al. [9] developed a multichannel CNN to enhance the features for hand posture classification. Wu et al. [20] employed dynamic neural networks to process skeleton and depth and RGB (RGB-D) images of sign language data, and then combined the network with HMM to realize gesture recognition. Later on, Pigou et al. [11] proposed an end-to-end architecture incorporating temporal convolutions and bidirectional recurrence in order to acquire better temporal feature. More recently, Liu et al. [10] presented an end-to-end method for SLR based on LSTM, using trajectory acquired by using Microsoft Kinect as the input of the network.

3 Proposed Method

In this section, we will introduce the proposed sequence to sequence framework in detail. As shown in the Fig. 1, the whole framework can be divided into three modules: feature extraction, encoder-decoder modeling and model fusion. We will introduce the three modules in the following sections.

3.1 Feature Extraction

The sign language data used in this work is captured by Microsoft Kinect2.0, including image sequences and 3D skeletal points. The image sequences and skeletal coordinates are jointly used, considering that multi-modality of data is beneficial for recognition. As is shown in the Fig. 1, we utilize CNN to extract the feature of image sequences rather than designing hand-crafted feature. In this work, we use the pre-trained VGG-16 [21] as the feature extractor, which is an outstanding model trained on the ImageNet dataset [22]. The image sequences are fed into the VGG-16 and the fc7 layer units are selected as the image feature, which is a vector of 4096 dimensions.

Kinect2.0 has the skeletal tracking ability by recording space coordinates (x, y, z) of 25 skeleton points of human body, such as hands, wrists and elbows. The 25 coordinates are concatenated to a vector of 75 dimensions. Therefore, each frame of the image sequences corresponds to a string of space coordinates with 75 dimensions. We take this kind of data as the trajectory features considering that the movements of hands or body of the sign language can also play an important role in recognition.

3.2 Encoder-Decoder Modeling

The Background of LSTM. RNN are the most popular choice to solve the sequence to sequence mapping problems. RNN can model the temporal relationship within the input sequence by using the recurrent feedbacks. For the simple RNN, it may suffer from the gradient vanishing and exploding problem [23]. Therefore, some new architectures have been proposed to alleviate these problems. The long short-term memory (LSTM) [24] model is an enhanced RNN architecture to implement the recurrent feedbacks using various learnable gates and has been widely adopted in temporal modeling. Given the input $x_t$, the forget gate $f_t$, input gate $i_t$, output gate $o_t$, input modulation gate $\tilde{C_t}$ and memory cell $C_t$ can be computed by the following equations :

$$\begin{aligned} f_t=sigm(W_{xf}x_t+W_{hi}h_{t-1}+b_f) \end{aligned}$$

(1)

$$\begin{aligned} i_t=sigm(W_{xi}x_t+W_{hi}h_{t-1}+b_i) \end{aligned}$$

(2)

$$\begin{aligned} \tilde{C_t}=tanh(W_{xc}x_t+W_{hc}h_{t-1}+b_c) \end{aligned}$$

(3)

$$\begin{aligned} C_t= f_t \odot C_{t-1}+i_t \odot \tilde{C_t} \end{aligned}$$

(4)

$$\begin{aligned} o_t=sigm(W_{xo}x_t+W_{ho}h_{t-1}+b_o) \end{aligned}$$

(5)

$$\begin{aligned} h_t=o_t\odot tanh(C_t) \end{aligned}$$

(6)

where sigm is the sigmoid function, tanh is the hyperbolic tangent non-linearity, $\odot $ stands for element-wise product with the gate value, and the weight matrices denoted by $W_{ij}$ and biases $b_j$ are the network parameters which should be optimized during the training process.

Encoder-Decoder Network for SLR. Typical methods regard SLR as a classification task, while in this paper we formulate SLR as a sequence to sequence problem, mainly based on two observations as follows:

(i)
Some isolated Chinese sign language usually consists of several separated gestures and each gesture corresponds to some specific Chinese characters.
(ii)
Other Chinese sign language cannot be further separated, but there still exists inherent context relationship among the Chinese characters.

Figure 2 provides two typical examples to explain this observation. Figure 2(a) represents the Chinese sign language word ‘post office’, which can be divided into two isolated gestures corresponding to two Chinese characters: ‘post’ and ‘house’, respectively. The first five frames correspond to ‘post’ and the rest five frames mean ‘house’. Figure 2(b) illustrates the sign language word ‘train’, which consists of two characters in Chinese. Although the sign language word ‘train’ cannot be further divided into isolated gestures, there still exists context relationship between the two characters of ‘train’ in Chinese. Considering the two types of characteristics of Chinese sign language, we formulate Chinese SLR as a sequence to sequence problem. Inspired by the application of LSTM in video caption, we design an encoder-decoder network to handle it.

The encoder-decoder network is shown in Fig. 3, which consists of two LSTMs (blue blocks and yellow blocks) and a softmax layer. The first LSTM (blue blocks) is acted as the encoder and the other LSTM (yellow blocks) is acted as the decoder. At the encoding stage, the LSTM is used to map the input feature $ (x_1,x_2, ... , x_n) $ to a fixed-length vector which contains the temporal information. After all the input features of sign language sample are exhausted, there will be a mark to indicate the beginning of decoding. At the decoding stage, the output of the encoder is fed into the second LSTM to learn the context model of sign words. We establish a vocabulary by taking the Chinese characters as the basic elements. The final output is a string of Chinese characters from the vocabulary. The probability $p(z_t)$ of every character in the vocabulary at each time t will be estimated using a softmax layer:

$$\begin{aligned} p(z_t|h_t)=\frac{exp(W_zh_t)}{\sum _{z\in V} exp(W_zh_t)} \end{aligned}$$

(7)

where $h_t$ is the output vector of decoder LSTM at time t, $z_t$ is a basic element in the vocabulary, $W_z$ is the parameters of this layer and V is the whole vocabulary. The final outputs can be estimated by the conditional probability $p(z_1,z_2,...,z_m|x_1,x_2,...,x_n)$, which can be computed by the following equation:

$$\begin{aligned} p(z_1,z_2,...,z_m|x_1,x_2,...,x_n)=\prod _{t=1}^m p(z_t|h_t,z_1,z_2,...,z_{t-1}) \end{aligned}$$

(8)

Image and Trajectory Sequence Modeling. In the feature extraction stage, CNN is utilized to extract the spatial features of image sequences. The output of CNN is a vector of 4096 dimensions. The 4096-dimensional vector is firstly embedded to 500 dimensions by a fully connected layer which will be trained jointly with the encoder-decoder network. The embedding operation is utilized to reduce the complexity of model. After embedding, the image feature is fed into the encoder-decoder network frame by frame. During the encoding process, the image feature sequence is mapped into a fixed-length representation. During the decoding process, the characters sequence $ (z_1,z_2,...,z_m) $ is generated. In this way, we can establish the images sequence model, termed as $model_{rgb}$.

As to the trajectory feature, we directly utilize the 75-dimensional coordinate vector as the input of another encoder-decoder. In this way, we can establish the trajectory model (denoted as $model_{tra}$). This model can be used as a complementary of images sequence model, because skeletal coordinates provide important motion information of the human body.

3.3 Probability Combination

Unlike data and feature level fusion where the data or features are usually heterogenous and hard to combine, decision level fusion is more flexible and nature. Therefore, in this paper we propose a probability combination method to fuse the image model and trajectory model. In the recognition step, the probabilities of all the characters in the vocabulary will be computed for each model at the time step t. And the final prediction is determined by combining the two probabilities through the following equation, so the decoding output of time t will be the character which corresponds to the maximum decoding probability.

$$\begin{aligned} p(z_t=w)=\alpha \times p_{rgb}(z_t=w)+(1-\alpha ) \times p_{tra}(z_t=w) \end{aligned}$$

(9)

where w is a certain character in the vocabulary, $\alpha $ is the weighting parameter which will be tune on testing dataset, and $p_{rgb}$ and $p_{tra}$ are the probabilities estimated by $model_{rgb}$ and $model_{tra}$, respectively.

4 Experiments

In this section, we conduct a series of experiments to evaluate the effectiveness of the proposed approach on our self-built dataset captured by the Kinect2.0. First, we will introduce the dataset and the detail of experiment settings. Then the experimental results and analysis are provided.

4.1 Dataset and Experiment Settings

The dataset adopted in the experiment consists of 90 Chinese sign language words widely used in our daily life, such as ‘airplane’ and ‘train’, unlike simple gesture, sign language is usually more complex which consists of several simple gestures. A sign language word sample usually consists of 80–120 frames. There are 100 samples for each word, and these samples are captured by Kinect2.0. Therefore, this dataset is composed of 9000 samples in total. For each sample, both RGB frames and skeletal coordinates are recorded. Specifically, the captured RGB frame is of size 1920 $\times $ 1080, skeletal coordinates are (x, y, z) of 25 skeleton points of human body. In the experiment, we divide the dataset into two parts, 70% of the data is utilized as the training set and the rest as test set. We use the pre-trained VGG-16 as the spatial feature extractor of images. For an input image, we extract the fc7 layer with 4096 units in VGG net as the image feature and then embed it to a 500-dimensional vector. For the encoder-decoder, the number of units of each LSTM is set to be 1000. We use the deep learning framework Caffe [25] to train the models. The network parameters are optimized using stochastic gradient descent (SGD), and the cross-entropy loss is used as loss function.

4.2 Results and Analysis

After the experiment setting and data preparation, we use the self-build dataset to evaluate the performance of our method. We firstly study the impact of different $\alpha $ values on the final recognition accuracy. We vary this parameter from 0.4 to 1, and the result is illustrated in Fig. 4. It can be seen that we can obtain the best recognition accuracy when $\alpha $ is set to be 0.6, which is also used in the following experiment.

For comparison, two typical methods [10, 26] are selected as the representative algorithms. The method in [26] is utilizing hand shape features (HOG features) and trajectory features to represent the sign videos, and using HMM to realize recogntion. We also implement the algorithm proposed in [10], where LSTM is employed to learn the temporal information of trajectory. The recognition accuracy of the above methods are shown in Table 1.

Table 1. Average accuracy of the proposed method

Full size table

We can see from the Table 1 that our method outperforms the compared methods when only using the $model_{rgb}$. And it is noted that when fusing RGB model and trajectory model, the accuracy can be further improved by about 1.2%. From the experimental results, we can draw the following conclusions:

(i)
The proposed sequence to sequence learning method has the capability to learn the inherent context relationship in sign language. And this method is more reliable for the Chinese sign language recognition task.
(ii)
Since the trajectory data provide effective and precise position information by recoding the space coordinates, it can be used as an enhancement for recognition.

4.3 Results of Using Hand Region Images

We also conduct the experiment by using the hand region images instead of the full images as the input of the framework. The method in [27] are adopted to extract the hand region. The experimental results are shown in Table 2.

As can be seen from the Table 2, the recognition accuracy can be further improved by 3.1% compared with using the full gesture images. This is mainly due to the fact that the most meaningful movements center on the hand region which is shown in Fig. 5. Besides, there is only tiny difference between some movements of different gestures, in which case the body and clothes may interfere, so the full images features extracted by CNN cannot be discriminated effectively during recognition. Therefore, using the hand region instead of the full images can be more beneficial for the recognition accuracy.

Table 2. Results using hand region images and trajectory

Full size table

5 Conclusion

In this paper, a sequence to sequence framework based on CNN and LSTM is proposed for Chinese sign language recognition. The propose sequence to sequence framework consists of three modules: feature extraction, encoder-decoder modeling and model fusion. Specifically, we use CNN to extract image sequence features, an encoder-decoder network is utilized to build the end-to-end temporal model, and finally image model and trajectory model are fused to get recognition result. This framework can not only learn the spatial features of the input but also can learn the temporal information and context relationship in Chinese sign language. A series of experiments have been conducted, and the experimental results show that the proposed method outperforms the compared methods on our dataset. In the future, we will go on investigate how to further improve the recognition accuracy and apply this method to continuous sign language recognition.

References

Jangyodsuk, P., Conly, C., Athitsos, V.: Sign language recognition using dynamic time warping and hand shape distance based on histogram of oriented gradient features. In: Proceedings of the 7th International Conference on PErvasive Technologies Related to Assistive Environments, p. 50. ACM (2014)
Google Scholar
Hernández-Vela, A., Bautista, M.Á., Perez-Sala, X., Ponce-López, V., Escalera, S., Baró, X., Pujol, O., Angulo, C.: Probability-based dynamic time warping and bag-of-visual-and-depth-words for human gesture recognition in RGB-D. Pattern Recogn. Lett. 50, 112–121 (2014)
Article Google Scholar
Grobel, K., Assan, M.: Isolated sign language recognition using hidden Markov models. In: 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation, vol. 1, pp. 162–167. IEEE (1997)
Google Scholar
Yang, R., Sarkar, S.: Gesture recognition using hidden Markov models from fragmented observations. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), vol. 1, pp. 766–773. IEEE (2006)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Ouyang, W., Wang, X.: Joint deep learning for pedestrian detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2056–2063 (2013)
Google Scholar
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. ICML, vol. 14, pp. 1764–1772 (2014)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Google Scholar
Barros, P., Magg, S., Weber, C., Wermter, S.: A multichannel convolutional neural network for hand posture recognition. In: Wermter, S., Weber, C., Duch, W., Honkela, T., Koprinkova-Hristova, P., Magg, S., Palm, G., Villa, A.E.P. (eds.) ICANN 2014. LNCS, vol. 8681, pp. 403–410. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11179-7_51
Google Scholar
Liu, T., Zhou, W., Li, H.: Sign language recognition with long short-term memory. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 2871–2875. IEEE (2016)
Google Scholar
Pigou, L., Van Den Oord, A., Dieleman, S., Van Herreweghe, M., Dambre, J.: Beyond temporal pooling: recurrence and temporal convolutions for gesture recognition in video. arXiv preprint arXiv:1506.01911 (2015)
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)
Google Scholar
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39, 2298–2304 (2016)
Article Google Scholar
Sun, C., Zhang, T., Bao, B.K., Xu, C.: Latent support vector machine for sign language recognition with kinect. In: 2013 20th IEEE International Conference on Image Processing (ICIP), pp. 4190–4194. IEEE (2013)
Google Scholar
Gowayyed, M.A., Torki, M., Hussein, M.E., El-Saban, M.: Histogram of oriented displacements (HOD): describing trajectories of human joints for action recognition. In: IJCAI (2013)
Google Scholar
Starner, T., Weaver, J., Pentland, A.: Real-time american sign language recognition using desk and wearable computer based video. IEEE Trans. Pattern Anal. Mach. Intell. 20(12), 1371–1375 (1998)
Article Google Scholar
Gao, W., Fang, G., Zhao, D., Chen, Y.: A chinese sign language recognition system based on SOFM/SRN/HMM. Pattern Recogn. 37(12), 2389–2402 (2004)
Article MATH Google Scholar
Sminchisescu, C., Kanaujia, A., Metaxas, D.: Conditional models for contextual human motion recognition. Comput. Vis. Image Underst. 104(2), 210–220 (2006)
Article Google Scholar
Lichtenauer, J.F., Hendriks, E.A., Reinders, M.J.T.: Sign language recognition by combining statistical dtw and independent classification. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 2040–2046 (2008)
Article Google Scholar
Wu, D., Pigou, L., Kindermans, P.J., Le, N.D.H., Shao, L., Dambre, J., Odobez, J.M.: Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1583–1597 (2016)
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2009, pp. 248–255. IEEE (2009)
Google Scholar
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
Article Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678. ACM (2014)
Google Scholar
Zhang, J., Zhou, W., Xie, C., Pu, J., Li, H.: Chinese sign language recognition with adaptive HMM. In: 2016 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2016)
Google Scholar
Jiang, Y., Tao, J., Ye, W., Wang, W., Ye, Z.: An isolated sign language recognition system using RGB-D sensor with sparse coding. In: 2014 IEEE 17th International Conference on Computational Science and Engineering (CSE), pp. 21–26. IEEE (2014)
Google Scholar

Download references

Acknowledgments

This work is supported by the Fundamental Research Funds for the Central Universities (Grant no. WK2350000002).

Author information

Authors and Affiliations

Department of Electronic Engineering and Information Science, National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, 230027, Anhui, China
Chensi Mao, Shiliang Huang, Xiaoxu Li & Zhongfu Ye

Authors

Chensi Mao
View author publications
You can also search for this author in PubMed Google Scholar
Shiliang Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoxu Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhongfu Ye
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhongfu Ye .

Editor information

Editors and Affiliations

Civil Aviation University of China, Tianjin, China
Jinfeng Yang
School of Computer Science and Technology, Tianjin University, Tianjin, China
Qinghua Hu
Nankai University, Tianjin, China
Ming-Ming Cheng
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Liang Wang
Information Science and Technology, Nanjing University, Beijing, China
Qingshan Liu
Huazhong University of Science and Technology, Wuhan, Hubei, China
Xiang Bai
Xi’an Jiaotong University, Xi’an, China
Deyu Meng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mao, C., Huang, S., Li, X., Ye, Z. (2017). Chinese Sign Language Recognition with Sequence to Sequence Learning. In: Yang, J., et al. Computer Vision. CCCV 2017. Communications in Computer and Information Science, vol 771. Springer, Singapore. https://doi.org/10.1007/978-981-10-7299-4_15

Download citation

DOI: https://doi.org/10.1007/978-981-10-7299-4_15
Published: 30 November 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7298-7
Online ISBN: 978-981-10-7299-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics