Abstract
In this paper, we focus on isolated gesture recognition from RGB-D videos. Our main idea is to design an algorithm that can extract global and local information from multi-modality inputs. To this end, we propose a novel attention-based method with 3D convolutional neural network (CNN) to recognize isolated gesture recognition. It includes two parts. The first one is a global and local spatial-attention network (GLSANet), which takes into account the global information that focuses on the context of the frame and the local information that focuses on the hand/arm actions of the person, to extract efficient features from multi-modality inputs simultaneously. The second part is an adaptive model fusion strategy to fuse the predicted probabilities from multi-modality inputs. Experiments demonstrate that the proposed method has achieved state-of-the-art performance on the IsoGD dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Rautaray, S.S., Agrawal, A.: Vision based hand gesture recognition for human computer interaction: a survey. Artif. Intell. Rev. 43(1), 1–54 (2015)
Wan, J., Zhao, Y., Zhou, S., Guyon, I., Escalera, S., Li, S.Z.: ChaLearn looking at people RGB-D isolated and continuous datasets for gesture recognition. In: CVPRW, pp. 56–64 (2016)
Miao, Q., et al.: Multimodal gesture recognition based on the ResC3D network. In: ICCVW, pp. 3047–3055 (2017)
Duan, J., Wan, J., Zhou, S., Guo, X., Li, S.: A unified framework for multi-modal isolated gesture recognition. TOMM 9(4) (2017)
Li, Y., et al.: Large-scale gesture recognition with a fusion of RGB-D data based on the C3D model. In: ICPR, pp. 25–30. IEEE (2016)
Li, Y., et al.: Large-scale gesture recognition with a fusion of RGB-D data based on optical flow and the C3D model PRL (2017)
Li, Y., et al.: Large-scale gesture recognition with a fusion of RGB-D data based on saliency theory and C3D model. TCSVT 28, 2956–2964 (2017)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV, pp. 2556–2563. IEEE (2011)
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild arXiv preprint arXiv:1212.0402 (2012)
Kay, W., et al.: The kinetics human action video dataset arXiv preprint arXiv:1705.06950 (2017)
Wang, P., Li, W., Liu, S., Gao, Z., Tang, C., Ogunbona, P.: Large-scale isolated gesture recognition using convolutional neural networks. In: ICPR, pp. 7–12. IEEE (2016)
Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., Tuytelaars, T.: Rank pooling for action recognition. TPAMI 39(4), 773–787 (2017)
Chai, X., Liu, Z., Yin, F., Liu, Z., Chen, X.: Two streams recurrent neural networks for large-scale continuous gesture recognition. In: ICPR, pp. 31–36. IEEE (2016)
Kopuklu, O., Kose, N., Rigoll, G.: Motion fused frames: data level fusion strategy for hand gesture recognition. In: CVPR, pp. 2103–2111 (2018)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp. 4489–4497 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Zhu, G., Zhang, L., Mei, L., Shao, J., Song, J., Shen, P.: Large-scale isolated gesture recognition using pyramidal 3D convolutional networks. In: ICPR, pp. 19–24. IEEE (2016)
Tran, D., Ray, J., Shou, Z., Chang, S.F., Paluri, M.: ConvNet architecture search for spatiotemporal feature learning arXiv preprint arXiv:1708.05038 (2017)
Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: CVPR (2016)
Lin, C., Wan, J., Liang, Y., Li, S.Z.: Large-scale isolated gesture recognition using masked Res-C3D network and skeleton LSTM. In: FG (2018)
Paszke, A., et al.: Automatic differentiation in pytorch (2017)
Zhu, G., Zhang, L., Shen, P., Song, J.: Multimodal gesture recognition using 3D convolution and convolutional LSTM. IEEE Access 5, 4517–4524 (2017)
Wang, H., Wang, P., Song, Z., Li, W.: Large-scale multimodal gesture recognition using heterogeneous networks. In: ICCVW, pp. 3129–3137 (2017)
Zhang, L., Zhu, G., Shen, P., Song, J., Shah, S.A., Bennamoun, M.: Learning spatiotemporal features using 3DCNN and convolutional LSTM for gesture recognition. In: ICCV, pp. 3120–3128 (2017)
Acknowledgments
This work has been partially supported by the Chinese National Natural Science Foundation Projects \(\#\)61876179, \(\#\)61872367, and by Science and Technology Development Fund of Macau (Grant No. 0025/2018/A1). We acknowledge the support of NVIDIA Corporation with the donation of the GPU used for this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Yuan, Q. et al. (2019). Global and Local Spatial-Attention Network for Isolated Gesture Recognition. In: Sun, Z., He, R., Feng, J., Shan, S., Guo, Z. (eds) Biometric Recognition. CCBR 2019. Lecture Notes in Computer Science(), vol 11818. Springer, Cham. https://doi.org/10.1007/978-3-030-31456-9_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-31456-9_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31455-2
Online ISBN: 978-3-030-31456-9
eBook Packages: Computer ScienceComputer Science (R0)