Abstract
With computer vision developing rapidly, sign language recognition (SLR) can be realized to bridge the communication gap for deaf people. In this paper, we propose a novel deep reinforcement learning model imitating the dynamic attention of humans for isolated SLR that selectively pays attention to keyframes of video and exclude noise from the redundant frames. We construct a Partially Observable Markov Decision Process (POMDP) to learn dynamic attention for SLR from the non-differentiable sequence of interactions. The proposed model adopts Inflated 3D ConvNets as the feature learner. Following the policy learned by the deep reinforcement learning method, the proposed model “observes” a clip from the video to infer the position of keyframes and move the focus for the following observation. As a result, dynamic attention excludes interference from redundant frames and improves performance. We validate the effectiveness of the proposed method and compare it with benchmark methods on the Chinese Sign Language dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chena, Y., Gao, W., Fang, G., Yang, C., Wang, Z.: CSLDS: Chinese sign language dialog system. In: IEEE International SOI Conference. Proceedings (Cat. No. 03CH37443). IEEE, vol. 2003, pp. 236–237 (2003)
Starner, T., Weaver, J., Pentland, A.: Real-time American sign language recognition using desk and wearable computer based video. IEEE Trans. Pattern Anal. Mach. Intell. 20(12), 1371–1375 (1998)
Vogler, C., Metaxas, D.: ASL recognition based on a coupling between HMMs and 3D motion analysis. In: Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271), pp. 363–369. IEEE (1998)
Fang, G., Gao, W., Zhao, D.: Large vocabulary sign language recognition based on fuzzy decision trees. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 34(3), 305–314 (2004)
Sun, C., Zhang, T., Bao, B.-K., Xu, C., Mei, T.: Discriminative exemplar coding for sign language recognition with kinect. IEEE Trans. Cybern. 43(5), 1418–1428 (2013)
Laptev, I.: On space-time interest points. Int. J. Comput. Vision 64(2), 107–123 (2005)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)
Tang, A., Lu, K., Wang, Y., Huang, J., Li, H.: A real-time hand posture recognition system using deep neural networks. ACM Trans. Intell. Syst. Technol. (TIST) 6(2), 1–23 (2015)
Pigou, L., Van Den Oord, A., Dieleman, S., Van Herreweghe, M., Dambre, J.: Beyond temporal pooling: recurrence and temporal convolutions for gesture recognition in video. Int. J. Comput. Vision 126(2), 430–439 (2018)
Hu, H., Zhou, W., Pu, J., Li, H.: Global-local enhancement network for NMF-aware sign language recognition. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 17(3), 1–19 (2021)
Yosinski, J., et al.: Advances in neural information processing systems. vol. 27 (2014)
Bahdanau, D., Cho, K.H., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015 (2015)
Luong, M.-T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv e-prints, pp. arXiv-1508 (2015)
Huang, J., Zhou, W., Li, H., Li, W.: Attention-based 3D-CNNs for large-vocabulary sign language recognition. IEEE Trans. Circuits Syst. Video Technol. 29(9), 2822–2832 (2018)
Monahan, G.E.: State of the art-a survey of partially observable markov decision processes: theory, models, and algorithms. Manage. Sci. 28(1), 1–16 (1982)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Zhang, J., Zhou, W., Xie, C., Pu, J., Li, H.: Chinese sign language recognition with adaptive HMM. In: IEEE International Conference on Multimedia and Expo (ICME). vol. 2016, pp. 1–6. IEEE (2016)
Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L1 optical flow. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74936-3_22
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Oneata, D., Verbeek, J., Schmid, C.: Action and event recognition with fisher vectors on a compact feature set. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1817–1824 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lin, S., Fang, Y., Wang, L. (2023). Dynamic Attention for Isolated Sign Language Recognition with Reinforcement Learning. In: Lu, H., et al. Image and Graphics. ICIG 2023. Lecture Notes in Computer Science, vol 14356. Springer, Cham. https://doi.org/10.1007/978-3-031-46308-2_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-46308-2_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46307-5
Online ISBN: 978-3-031-46308-2
eBook Packages: Computer ScienceComputer Science (R0)