A deep co-attentive hand-based video question answering framework using multi-view skeleton

Rastgoo, Razieh; Kiani, Kourosh; Escalera, Sergio

doi:10.1007/s11042-022-13573-w

A deep co-attentive hand-based video question answering framework using multi-view skeleton

Published: 01 August 2022

Volume 82, pages 1401–1429, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

364 Accesses
7 Citations
Explore all metrics

Abstract

In this paper, we present a novel hand –based Video Question Answering framework, entitled Multi-View Video Question Answering (MV-VQA), employing the Single Shot Detector (SSD), Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Bidirectional Encoder Representations from Transformers (BERT), and Co-Attention mechanism with RGB videos as the inputs. Our model includes three main blocks: vision, language, and attention. In the vision block, we employ a novel representation to obtain some efficient multiview features from the hand object using the combination of five 3DCNNs and one LSTM network. To obtain the question embedding, we use the BERT model in language block. Finally, we employ a co-attention mechanism on vision and language features to recognize the final answer. For the first time, we propose such a hand-based Video-QA framework including the multi-view hand skeleton features combined with the question embedding and co-attention mechanism. Our framework is capable of processing the arbitrary numbers of questions in the dataset annotations. There are different application domains for this framework. Here, as an application domain, we applied our framework to dynamic hand gesture recognition for the first time. Since the main object in dynamic hand gesture recognition is the human hand, we performed a step-by-step analysis of the hand detection and multi-view hand skeleton impact on the model performance. Evaluation results on five datasets, including two datasets in VideoQA, two datasets in dynamic hand gesture, and one dataset in hand action recognition show that MV-VQA outperforms state-of-the-art alternatives.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Multi-level Mesh Mutual Attention Model for Visual Question Answering

Article Open access 30 October 2022

Question Type Guided Attention in Visual Question Answering

Enhancing machine vision: the impact of a novel innovative technology on video question-answering

Article 18 January 2024

Data availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning. In: CVPR, California, United States, pp. 12487–12496
Cerda P, Varoquaux G, Kégl B (2018) Similarity encoding for learning with dirty categorical variables. Mach Learn Springer Verlag 107:1477–1494
Article MathSciNet Google Scholar
Chai J, Li A (2019) Deep Learning in Natural Language Processing: A State-of-the-Art Survey. In: International Conference on Machine Learning and Cybernetics (ICMLC), Kobe, Japan. pp. 1–6
Chen D-L, Dolan W-B (2011) Collecting Highly Parallel Data for Paraphrase Evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human language technologies, Portland, Oregon, USA. pp. 190–200
D’Souza J (2020) An Introduction to Bag-of-Words in NLP. Medium. [Online]. Available: https://medium.com/greyatom/an-introduction-to-bag-of-words-in-nlp-ac967d43b428. Accessed 28 Jul 2022
Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, Maryland, USA. pp. 376–380
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of NAACL-HLT, Minneapolis, Minnesota. pp. 4171–4186
Duan J, Zhou Sh, Wan J, Guo X, Li SZ (2016) Multi-Modality Fusion based on Consensus-Voting and 3D Convolution for Isolated Gesture Recognition. arXiv:1611.06689v2
El Adlouni Y, Rodríguez H, Meknassi M, El Alaoui SA, En-nahnahi N (2019) A multi-approach to community question answering. Expert Syst Appl 137:432–442
Article Google Scholar
Fan JCh, Zhang X, Zhang Sh, Wang W, Zhang Ch, Huang H, COM, JD. and Digits (2019) Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering. In: CVPR, California, United States, pp. 1999–2007
Gao L, Guo Z, Zhang H, Xu X, Shen H (2017) Video captioning with attention-based LSTM and semantic consistency. IEEE Trans Multimed 19(9):2045–2055
Article Google Scholar
Gao J, Ge R, Chen K, Nevatia R (2018) Motion-Appearance Co-Memory Networks for Video Question Answering. In: CVPR, Utah, United States, pp. 16576–6585
Garcia-Hernando G, Yuan Sh, Baek S, Kim T (2018) First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations. In: CVPR, Salt Lake City, Utah, United States. pp. 409–419
Hashemi Hosseinabad S, Safayani M, Mirzaei A (2020) Multiple answers to a question: a new approach for visual question answering. Vis Comput 37:119–131
Article Google Scholar
He K, Zhang X, Ren Sh, Sun J (2016) Deep Residual Learning for Image Recognition. In: CVPR, Las Vegas, Nevada, United States, pp. 770–778
Hu G, Cui B, Yu S (2020) Joint learning in the Spatio-temporal and frequency domains for skeleton-based action recognition. IEEE Trans Multimed 22(9):2207–2220
Article Google Scholar
Jiasen Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical Question-Image Co-Attention for Visual Question Answering. In: 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. pp. 1–9
Lei J, Yu L, Bansal M, Berg TL (2018) TVQA: Localized, Compositional Video Question Answering. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. pp. 1369–1379
Li L, Gong B (2019) End-to-End Video Captioning with Multitask Reinforcement Learning. In: IEEE Winter Conference on Applications of Computer Vision (WACV), Hilton Waikoloa Village, Hawaii. pp. 339–348
Li W, Guo D, Fang X (2018) Multimodal Architecture for Video Captioning with Memory Networks and an Attention Mechanism. Pattern Recogn Lett 105:23–29
Article Google Scholar
Lin C-Y (2004) Rouge: A package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out, WAS 2004, Barcelona, Spain
Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, Pietikäinen M (2020) Deep Learning for Generic Object Detection: A Survey. Int J Comput Vis 128:261–318
Article MATH Google Scholar
Miao Q, Li Y, Ouyang W, et al (2017) Multimodal gesture recognition based on the resc3d network. In: CVPR, Hawaii, United States
Nabati M, Behrad A (2020) Video captioning using boosted and parallel Long Short-Term Memory networks. Comput Vis Image Underst 190:102840
Article Google Scholar
Narayana P, Beveridge JR, Bruce AD (2018) Gesture Recognition: Focus on the Hands. In: CVPR, Utah, United States. pp. 5235–5244
Neves G, Ruiz M, Fontinele J, Oliveira L (2020) Rotated object detection with forward-looking sonar in underwater applications. Expert Syst Appl 140:112870
Article Google Scholar
Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly Modeling Embedding and Translation to Bridge Video and Language. In: CVPR, Las Vegas, Nevada, United States. pp. 4594–4602
Pan P, Xu Zh, Yang Y, Wu F, Zhuang Y (2016) Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. In: CVPRW, Las Vegas, Nevada, United States. pp. 1029–1038
Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL ‘02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia. pp. 311–318
Peris A, Bolanos M, Radeva P, Casacuberta F (2016) Video Description Using Bidirectional Recurrent Neural Networks. Artif. Neural Networks Mach. Learn. pp. 3–11
Rastgoo R, Kiani K, Escalera S (2018) Multi-modal deep hand sign language recognition in still images using restricted Boltzmann machine. Entropy 20(11):809
Article Google Scholar
Rastgoo R, Kiani K, Escalera S (2020) Hand sign language recognition using multi-view hand skeleton. Expert Syst Appl 150:113336
Article Google Scholar
Rastgoo R, Kiani K, Escalera S (2020) Video-based isolated hand sign language recognition using a deep cascaded model. Multimed Tools Appl 79:22965–22987
Article Google Scholar
Rastgoo R et al (2021) Sign language production: a review. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. pp. 3446–3456. https://doi.org/10.1109/CVPRW53098.2021.00384
Rastgoo R, Kiani K, Escalera S (2021) ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos. arXiv:2108.10059
Rastgoo R, Kiani K, Escalera S (2021a) Hand pose aware multimodal isolated sign language recognition. Multimed Tools Appl 80(1):127–163. https://doi.org/10.1007/s11042-020-09700-0
Article Google Scholar
Rastgoo R, Kiani K, Escalera S (2021b) Sign language recognition: a deep survey. Expert Syst Appl Elsevier ltd 164(July 2020):113794. https://doi.org/10.1016/j.eswa.2020.113794
Article Google Scholar
Rastgoo R, Kiani K, Escalera S, Sabokrou (2021) Multi-modal zero-shot sign language recognition. https://doi.org/10.48550/arXiv.2109.00796
Rastgoo R, Kiani K, Escalera S, Athitsos V, Sabokrou M (2022) All You Need In Sign Language Production. arXiv:2201.01609v2
Rastgoo R, Kiani K, Escalera S (2022) Real-time isolated hand sign language recognition using deep networks and SVD. J Ambient Intell Humaniz Comput Springer Berlin Heidelberg 13(1):591–611. https://doi.org/10.1007/s12652-021-02920-8
Article Google Scholar
Rastgoo R, Kiani K, Escalera S (2022) A Non-Anatomical Graph Structure for isolated hand gesture separation in continuous gesture sequences. https://doi.org/10.48550/arXiv.2207.07619
Rastgoo R, Kiani K, Escalera S (2022) Word separation in continuous sign language using isolated signs and post-processing. https://doi.org/10.48550/arXiv.2204.00923
Ren F, Bao Y (2020) A review on human-computer interaction and intelligent robots. Int J Inf Technol Decis Mak 19(1):5–47
Article MathSciNet Google Scholar
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: CVPR, Boston, USA. pp. 4566–4575
Wan J, Zhao Y, Zhou Sh, Guyon I, Escalera S, Li S-L (2016) ChaLearn Looking at People RGB-D Isolated and Continuous Datasets for Gesture Recognition. In: CVPR workshop, Nevada, United States
Wang H, Wang P, Song Z, Li W (2017) Large-scale multimodal gesture recognition using heterogeneous networks. In: CVPR, Hawaii, United States
Wang J, Wang W, Huang Y, Wang L, Tan T (2018) M3: Multimodal Memory Modelling for Video Captioning. In: CVPR, Utah, United States. pp. 7512–7520
Wang W, Huang Y, Wang L (2020) Long video question answering: A Matching-guided Attention Model. Pattern Recogn 102:107248
Article Google Scholar
Wang W, Huang Y, Wang L (2020) Long video question answering: A Matching-guided Attention Model. Pattern Recogn 102:107248
Article Google Scholar
Wu Ch, Liu J, Wang X, Li R (2019) Differential Networks for Visual Question Answering. In: Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), Hawaii, USA, pp. 8997–9004
Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In: CVPR, Las Vegas, NV, USA
Xu D, Zhao Zh, Xiao J, Wu F, Zhang H, He X, Zhuang Y (2017) Video Question Answering via Gradually Refined Attention over Appearance and Motion. In: ACM Multimedia Conference, California, USA, pp. 1645–1653
Yao L, Torabi A, Chao K, et al (2015) Describing videos by exploiting temporal structure. In: ICCV, Las Condes, Chile. pp. 4507–4515
Yi K, Wu J, Gan Ch, Torralba A, Kohli P, Tenenbaum JB (2018) Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding. In: 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada, pp. 1–12
Yu H, Wang J, Huang Zh, Yang Y, Xu W (2016) Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks. In: CVPR, Las Vegas, Nevada, United States. pp. 4584–4593
Zeng K-H, Chen T-H, Chuang Ch-Y, Liao Y-H, Niebles JC, Sun M (2017) Leveraging video descriptions to learn video question answering. In: AAAI’17: Proceedings of the thirty-first AAAI conference on artificial intelligence, San Francisco, California, USA, pp. 4334–4340
Zha Z, Liu J, Yang T, Zhang Y (2019) Spatiotemporal-Textual Co-Attention Network for Video Question Answering. ACM Trans Multimed Comput Commun Appl 15:53
Article Google Scholar
Zhang L, Zhu G, Shen P, Song J, Shah SA, Bennamoun M (2017) Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In: CVPR, Hawaii, United States
Zhao Zh, Jiang X, Cai D, Xiao J, He X, Pu S (2018) Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18), pp. 3690–3696
Zhu X, Mao Z, Chen Z, Li Y, Wang Z, Wang B (2020) Object-difference drived graph convolutional networks for visual question answering. Multimed Tools Appl 80:16247–16265
Article Google Scholar

Download references

Acknowledgments

This work has been partially supported by the Spanish project PID2019-105093GB-I00, ICREA under the ICREA Academia programme, and High Intelligent Solution (HIS) company in Iran.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Semnan University, Semnan, Iran
Razieh Rastgoo & Kourosh Kiani
University of Barcelona and Computer Vision Center, Barcelona, Spain
Sergio Escalera

Authors

Razieh Rastgoo
View author publications
You can also search for this author in PubMed Google Scholar
Kourosh Kiani
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Escalera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kourosh Kiani.

Ethics declarations

Competing interests

The authors certify that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A

Here, we present some additional experimental results on our framework.

Table 13 Analysis on different views in the MV-VQA framework: Single-view projection of skeleton (SVPS), Two-view projection of skeleton (TVPS), Three-views projection of skeleton (THVPS)

Full size table

Table 14 Details of the parameters used in the MV-VQA framework

Full size table

Table 15 Hand detection analysis on different proposed models using different visual and lingual models. The analysis of our final model, Model 6, are shown in bold

Full size table

Table 16 Confusion matrix of the proposed model on isoGD dataset

Full size table

Table 17 Confusion matrix of the proposed model on RKS-PERSIANSIGN dataset

Full size table

Table 18 Confusion matrix of the proposed model on First-Person dataset

Full size table

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Rastgoo, R., Kiani, K. & Escalera, S. A deep co-attentive hand-based video question answering framework using multi-view skeleton. Multimed Tools Appl 82, 1401–1429 (2023). https://doi.org/10.1007/s11042-022-13573-w

Download citation

Received: 01 December 2020
Revised: 24 June 2022
Accepted: 18 July 2022
Published: 01 August 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s11042-022-13573-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A deep co-attentive hand-based video question answering framework using multi-view skeleton

Abstract

Access this article

Similar content being viewed by others

A Multi-level Mesh Mutual Attention Model for Visual Question Answering

Question Type Guided Attention in Visual Question Answering

Enhancing machine vision: the impact of a novel innovative technology on video question-answering

Data availability

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Appendix A

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A deep co-attentive hand-based video question answering framework using multi-view skeleton

Abstract

Access this article

Similar content being viewed by others

A Multi-level Mesh Mutual Attention Model for Visual Question Answering

Question Type Guided Attention in Visual Question Answering

Enhancing machine vision: the impact of a novel innovative technology on video question-answering

Data availability

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Appendix A

Appendix A

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation