Skip to main content
Log in

A deep co-attentive hand-based video question answering framework using multi-view skeleton

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In this paper, we present a novel hand –based Video Question Answering framework, entitled Multi-View Video Question Answering (MV-VQA), employing the Single Shot Detector (SSD), Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Bidirectional Encoder Representations from Transformers (BERT), and Co-Attention mechanism with RGB videos as the inputs. Our model includes three main blocks: vision, language, and attention. In the vision block, we employ a novel representation to obtain some efficient multiview features from the hand object using the combination of five 3DCNNs and one LSTM network. To obtain the question embedding, we use the BERT model in language block. Finally, we employ a co-attention mechanism on vision and language features to recognize the final answer. For the first time, we propose such a hand-based Video-QA framework including the multi-view hand skeleton features combined with the question embedding and co-attention mechanism. Our framework is capable of processing the arbitrary numbers of questions in the dataset annotations. There are different application domains for this framework. Here, as an application domain, we applied our framework to dynamic hand gesture recognition for the first time. Since the main object in dynamic hand gesture recognition is the human hand, we performed a step-by-step analysis of the hand detection and multi-view hand skeleton impact on the model performance. Evaluation results on five datasets, including two datasets in VideoQA, two datasets in dynamic hand gesture, and one dataset in hand action recognition show that MV-VQA outperforms state-of-the-art alternatives.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

  1. Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning. In: CVPR, California, United States, pp. 12487–12496

  2. Cerda P, Varoquaux G, Kégl B (2018) Similarity encoding for learning with dirty categorical variables. Mach Learn Springer Verlag 107:1477–1494

    Article  MathSciNet  Google Scholar 

  3. Chai J, Li A (2019) Deep Learning in Natural Language Processing: A State-of-the-Art Survey. In: International Conference on Machine Learning and Cybernetics (ICMLC), Kobe, Japan. pp. 1–6

  4. Chen D-L, Dolan W-B (2011) Collecting Highly Parallel Data for Paraphrase Evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human language technologies, Portland, Oregon, USA. pp. 190–200

  5. D’Souza J (2020) An Introduction to Bag-of-Words in NLP. Medium. [Online]. Available: https://medium.com/greyatom/an-introduction-to-bag-of-words-in-nlp-ac967d43b428. Accessed 28 Jul 2022

  6. Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, Maryland, USA. pp. 376–380

  7. Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of NAACL-HLT, Minneapolis, Minnesota. pp. 4171–4186

  8. Duan J, Zhou Sh, Wan J, Guo X, Li SZ (2016) Multi-Modality Fusion based on Consensus-Voting and 3D Convolution for Isolated Gesture Recognition. arXiv:1611.06689v2

  9. El Adlouni Y, Rodríguez H, Meknassi M, El Alaoui SA, En-nahnahi N (2019) A multi-approach to community question answering. Expert Syst Appl 137:432–442

    Article  Google Scholar 

  10. Fan JCh, Zhang X, Zhang Sh, Wang W, Zhang Ch, Huang H, COM, JD. and Digits (2019) Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering. In: CVPR, California, United States, pp. 1999–2007

  11. Gao L, Guo Z, Zhang H, Xu X, Shen H (2017) Video captioning with attention-based LSTM and semantic consistency. IEEE Trans Multimed 19(9):2045–2055

    Article  Google Scholar 

  12. Gao J, Ge R, Chen K, Nevatia R (2018) Motion-Appearance Co-Memory Networks for Video Question Answering. In: CVPR, Utah, United States, pp. 16576–6585

  13. Garcia-Hernando G, Yuan Sh, Baek S, Kim T (2018) First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations. In: CVPR, Salt Lake City, Utah, United States. pp. 409–419

  14. Hashemi Hosseinabad S, Safayani M, Mirzaei A (2020) Multiple answers to a question: a new approach for visual question answering. Vis Comput 37:119–131

    Article  Google Scholar 

  15. He K, Zhang X, Ren Sh, Sun J (2016) Deep Residual Learning for Image Recognition. In: CVPR, Las Vegas, Nevada, United States, pp. 770–778

  16. Hu G, Cui B, Yu S (2020) Joint learning in the Spatio-temporal and frequency domains for skeleton-based action recognition. IEEE Trans Multimed 22(9):2207–2220

    Article  Google Scholar 

  17. Jiasen Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical Question-Image Co-Attention for Visual Question Answering. In: 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. pp. 1–9

  18. Lei J, Yu L, Bansal M, Berg TL (2018) TVQA: Localized, Compositional Video Question Answering. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. pp. 1369–1379

  19. Li L, Gong B (2019) End-to-End Video Captioning with Multitask Reinforcement Learning. In: IEEE Winter Conference on Applications of Computer Vision (WACV), Hilton Waikoloa Village, Hawaii. pp. 339–348

  20. Li W, Guo D, Fang X (2018) Multimodal Architecture for Video Captioning with Memory Networks and an Attention Mechanism. Pattern Recogn Lett 105:23–29

    Article  Google Scholar 

  21. Lin C-Y (2004) Rouge: A package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out, WAS 2004, Barcelona, Spain

  22. Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, Pietikäinen M (2020) Deep Learning for Generic Object Detection: A Survey. Int J Comput Vis 128:261–318

    Article  MATH  Google Scholar 

  23. Miao Q, Li Y, Ouyang W, et al (2017) Multimodal gesture recognition based on the resc3d network. In: CVPR, Hawaii, United States

  24. Nabati M, Behrad A (2020) Video captioning using boosted and parallel Long Short-Term Memory networks. Comput Vis Image Underst 190:102840

    Article  Google Scholar 

  25. Narayana P, Beveridge JR, Bruce AD (2018) Gesture Recognition: Focus on the Hands. In: CVPR, Utah, United States. pp. 5235–5244

  26. Neves G, Ruiz M, Fontinele J, Oliveira L (2020) Rotated object detection with forward-looking sonar in underwater applications. Expert Syst Appl 140:112870

    Article  Google Scholar 

  27. Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly Modeling Embedding and Translation to Bridge Video and Language. In: CVPR, Las Vegas, Nevada, United States. pp. 4594–4602

  28. Pan P, Xu Zh, Yang Y, Wu F, Zhuang Y (2016) Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. In: CVPRW, Las Vegas, Nevada, United States. pp. 1029–1038

  29. Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL ‘02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia. pp. 311–318

  30. Peris A, Bolanos M, Radeva P, Casacuberta F (2016) Video Description Using Bidirectional Recurrent Neural Networks. Artif. Neural Networks Mach. Learn. pp. 3–11

  31. Rastgoo R, Kiani K, Escalera S (2018) Multi-modal deep hand sign language recognition in still images using restricted Boltzmann machine. Entropy 20(11):809

    Article  Google Scholar 

  32. Rastgoo R, Kiani K, Escalera S (2020) Hand sign language recognition using multi-view hand skeleton. Expert Syst Appl 150:113336

    Article  Google Scholar 

  33. Rastgoo R, Kiani K, Escalera S (2020) Video-based isolated hand sign language recognition using a deep cascaded model. Multimed Tools Appl 79:22965–22987

    Article  Google Scholar 

  34. Rastgoo R et al (2021) Sign language production: a review. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. pp. 3446–3456. https://doi.org/10.1109/CVPRW53098.2021.00384

  35. Rastgoo R, Kiani K, Escalera S (2021) ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos. arXiv:2108.10059

  36. Rastgoo R, Kiani K, Escalera S (2021a) Hand pose aware multimodal isolated sign language recognition. Multimed Tools Appl 80(1):127–163. https://doi.org/10.1007/s11042-020-09700-0

    Article  Google Scholar 

  37. Rastgoo R, Kiani K, Escalera S (2021b) Sign language recognition: a deep survey. Expert Syst Appl Elsevier ltd 164(July 2020):113794. https://doi.org/10.1016/j.eswa.2020.113794

    Article  Google Scholar 

  38. Rastgoo R, Kiani K, Escalera S, Sabokrou (2021) Multi-modal zero-shot sign language recognition. https://doi.org/10.48550/arXiv.2109.00796

  39. Rastgoo R, Kiani K, Escalera S, Athitsos V, Sabokrou M (2022) All You Need In Sign Language Production. arXiv:2201.01609v2

  40. Rastgoo R, Kiani K, Escalera S (2022) Real-time isolated hand sign language recognition using deep networks and SVD. J Ambient Intell Humaniz Comput Springer Berlin Heidelberg 13(1):591–611. https://doi.org/10.1007/s12652-021-02920-8

    Article  Google Scholar 

  41. Rastgoo R, Kiani K, Escalera S (2022) A Non-Anatomical Graph Structure for isolated hand gesture separation in continuous gesture sequences. https://doi.org/10.48550/arXiv.2207.07619

  42. Rastgoo R, Kiani K, Escalera S (2022) Word separation in continuous sign language using isolated signs and post-processing. https://doi.org/10.48550/arXiv.2204.00923

  43. Ren F, Bao Y (2020) A review on human-computer interaction and intelligent robots. Int J Inf Technol Decis Mak 19(1):5–47

    Article  MathSciNet  Google Scholar 

  44. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: CVPR, Boston, USA. pp. 4566–4575

  45. Wan J, Zhao Y, Zhou Sh, Guyon I, Escalera S, Li S-L (2016) ChaLearn Looking at People RGB-D Isolated and Continuous Datasets for Gesture Recognition. In: CVPR workshop, Nevada, United States

  46. Wang H, Wang P, Song Z, Li W (2017) Large-scale multimodal gesture recognition using heterogeneous networks. In: CVPR, Hawaii, United States

  47. Wang J, Wang W, Huang Y, Wang L, Tan T (2018) M3: Multimodal Memory Modelling for Video Captioning. In: CVPR, Utah, United States. pp. 7512–7520

  48. Wang W, Huang Y, Wang L (2020) Long video question answering: A Matching-guided Attention Model. Pattern Recogn 102:107248

    Article  Google Scholar 

  49. Wang W, Huang Y, Wang L (2020) Long video question answering: A Matching-guided Attention Model. Pattern Recogn 102:107248

    Article  Google Scholar 

  50. Wu Ch, Liu J, Wang X, Li R (2019) Differential Networks for Visual Question Answering. In: Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), Hawaii, USA, pp. 8997–9004

  51. Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In: CVPR, Las Vegas, NV, USA

  52. Xu D, Zhao Zh, Xiao J, Wu F, Zhang H, He X, Zhuang Y (2017) Video Question Answering via Gradually Refined Attention over Appearance and Motion. In: ACM Multimedia Conference, California, USA, pp. 1645–1653

  53. Yao L, Torabi A, Chao K, et al (2015) Describing videos by exploiting temporal structure. In: ICCV, Las Condes, Chile. pp. 4507–4515

  54. Yi K, Wu J, Gan Ch, Torralba A, Kohli P, Tenenbaum JB (2018) Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding. In: 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada, pp. 1–12

  55. Yu H, Wang J, Huang Zh, Yang Y, Xu W (2016) Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks. In: CVPR, Las Vegas, Nevada, United States. pp. 4584–4593

  56. Zeng K-H, Chen T-H, Chuang Ch-Y, Liao Y-H, Niebles JC, Sun M (2017) Leveraging video descriptions to learn video question answering. In: AAAI’17: Proceedings of the thirty-first AAAI conference on artificial intelligence, San Francisco, California, USA, pp. 4334–4340

  57. Zha Z, Liu J, Yang T, Zhang Y (2019) Spatiotemporal-Textual Co-Attention Network for Video Question Answering. ACM Trans Multimed Comput Commun Appl 15:53

    Article  Google Scholar 

  58. Zhang L, Zhu G, Shen P, Song J, Shah SA, Bennamoun M (2017) Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In: CVPR, Hawaii, United States

  59. Zhao Zh, Jiang X, Cai D, Xiao J, He X, Pu S (2018) Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18), pp. 3690–3696

  60. Zhu X, Mao Z, Chen Z, Li Y, Wang Z, Wang B (2020) Object-difference drived graph convolutional networks for visual question answering. Multimed Tools Appl 80:16247–16265

    Article  Google Scholar 

Download references

Acknowledgments

This work has been partially supported by the Spanish project PID2019-105093GB-I00, ICREA under the ICREA Academia programme, and High Intelligent Solution (HIS) company in Iran.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kourosh Kiani.

Ethics declarations

Competing interests

The authors certify that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A

Appendix A

Here, we present some additional experimental results on our framework.

Table 13 Analysis on different views in the MV-VQA framework: Single-view projection of skeleton (SVPS), Two-view projection of skeleton (TVPS), Three-views projection of skeleton (THVPS)
Table 14 Details of the parameters used in the MV-VQA framework
Table 15 Hand detection analysis on different proposed models using different visual and lingual models. The analysis of our final model, Model 6, are shown in bold
Table 16 Confusion matrix of the proposed model on isoGD dataset
Table 17 Confusion matrix of the proposed model on RKS-PERSIANSIGN dataset
Table 18 Confusion matrix of the proposed model on First-Person dataset

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rastgoo, R., Kiani, K. & Escalera, S. A deep co-attentive hand-based video question answering framework using multi-view skeleton. Multimed Tools Appl 82, 1401–1429 (2023). https://doi.org/10.1007/s11042-022-13573-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13573-w

Keywords

Navigation