Abstract
In this paper, we present a novel hand –based Video Question Answering framework, entitled Multi-View Video Question Answering (MV-VQA), employing the Single Shot Detector (SSD), Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Bidirectional Encoder Representations from Transformers (BERT), and Co-Attention mechanism with RGB videos as the inputs. Our model includes three main blocks: vision, language, and attention. In the vision block, we employ a novel representation to obtain some efficient multiview features from the hand object using the combination of five 3DCNNs and one LSTM network. To obtain the question embedding, we use the BERT model in language block. Finally, we employ a co-attention mechanism on vision and language features to recognize the final answer. For the first time, we propose such a hand-based Video-QA framework including the multi-view hand skeleton features combined with the question embedding and co-attention mechanism. Our framework is capable of processing the arbitrary numbers of questions in the dataset annotations. There are different application domains for this framework. Here, as an application domain, we applied our framework to dynamic hand gesture recognition for the first time. Since the main object in dynamic hand gesture recognition is the human hand, we performed a step-by-step analysis of the hand detection and multi-view hand skeleton impact on the model performance. Evaluation results on five datasets, including two datasets in VideoQA, two datasets in dynamic hand gesture, and one dataset in hand action recognition show that MV-VQA outperforms state-of-the-art alternatives.
Similar content being viewed by others
Data availability
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
References
Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning. In: CVPR, California, United States, pp. 12487–12496
Cerda P, Varoquaux G, Kégl B (2018) Similarity encoding for learning with dirty categorical variables. Mach Learn Springer Verlag 107:1477–1494
Chai J, Li A (2019) Deep Learning in Natural Language Processing: A State-of-the-Art Survey. In: International Conference on Machine Learning and Cybernetics (ICMLC), Kobe, Japan. pp. 1–6
Chen D-L, Dolan W-B (2011) Collecting Highly Parallel Data for Paraphrase Evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human language technologies, Portland, Oregon, USA. pp. 190–200
D’Souza J (2020) An Introduction to Bag-of-Words in NLP. Medium. [Online]. Available: https://medium.com/greyatom/an-introduction-to-bag-of-words-in-nlp-ac967d43b428. Accessed 28 Jul 2022
Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, Maryland, USA. pp. 376–380
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of NAACL-HLT, Minneapolis, Minnesota. pp. 4171–4186
Duan J, Zhou Sh, Wan J, Guo X, Li SZ (2016) Multi-Modality Fusion based on Consensus-Voting and 3D Convolution for Isolated Gesture Recognition. arXiv:1611.06689v2
El Adlouni Y, Rodríguez H, Meknassi M, El Alaoui SA, En-nahnahi N (2019) A multi-approach to community question answering. Expert Syst Appl 137:432–442
Fan JCh, Zhang X, Zhang Sh, Wang W, Zhang Ch, Huang H, COM, JD. and Digits (2019) Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering. In: CVPR, California, United States, pp. 1999–2007
Gao L, Guo Z, Zhang H, Xu X, Shen H (2017) Video captioning with attention-based LSTM and semantic consistency. IEEE Trans Multimed 19(9):2045–2055
Gao J, Ge R, Chen K, Nevatia R (2018) Motion-Appearance Co-Memory Networks for Video Question Answering. In: CVPR, Utah, United States, pp. 16576–6585
Garcia-Hernando G, Yuan Sh, Baek S, Kim T (2018) First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations. In: CVPR, Salt Lake City, Utah, United States. pp. 409–419
Hashemi Hosseinabad S, Safayani M, Mirzaei A (2020) Multiple answers to a question: a new approach for visual question answering. Vis Comput 37:119–131
He K, Zhang X, Ren Sh, Sun J (2016) Deep Residual Learning for Image Recognition. In: CVPR, Las Vegas, Nevada, United States, pp. 770–778
Hu G, Cui B, Yu S (2020) Joint learning in the Spatio-temporal and frequency domains for skeleton-based action recognition. IEEE Trans Multimed 22(9):2207–2220
Jiasen Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical Question-Image Co-Attention for Visual Question Answering. In: 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. pp. 1–9
Lei J, Yu L, Bansal M, Berg TL (2018) TVQA: Localized, Compositional Video Question Answering. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. pp. 1369–1379
Li L, Gong B (2019) End-to-End Video Captioning with Multitask Reinforcement Learning. In: IEEE Winter Conference on Applications of Computer Vision (WACV), Hilton Waikoloa Village, Hawaii. pp. 339–348
Li W, Guo D, Fang X (2018) Multimodal Architecture for Video Captioning with Memory Networks and an Attention Mechanism. Pattern Recogn Lett 105:23–29
Lin C-Y (2004) Rouge: A package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out, WAS 2004, Barcelona, Spain
Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, Pietikäinen M (2020) Deep Learning for Generic Object Detection: A Survey. Int J Comput Vis 128:261–318
Miao Q, Li Y, Ouyang W, et al (2017) Multimodal gesture recognition based on the resc3d network. In: CVPR, Hawaii, United States
Nabati M, Behrad A (2020) Video captioning using boosted and parallel Long Short-Term Memory networks. Comput Vis Image Underst 190:102840
Narayana P, Beveridge JR, Bruce AD (2018) Gesture Recognition: Focus on the Hands. In: CVPR, Utah, United States. pp. 5235–5244
Neves G, Ruiz M, Fontinele J, Oliveira L (2020) Rotated object detection with forward-looking sonar in underwater applications. Expert Syst Appl 140:112870
Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly Modeling Embedding and Translation to Bridge Video and Language. In: CVPR, Las Vegas, Nevada, United States. pp. 4594–4602
Pan P, Xu Zh, Yang Y, Wu F, Zhuang Y (2016) Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. In: CVPRW, Las Vegas, Nevada, United States. pp. 1029–1038
Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL ‘02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia. pp. 311–318
Peris A, Bolanos M, Radeva P, Casacuberta F (2016) Video Description Using Bidirectional Recurrent Neural Networks. Artif. Neural Networks Mach. Learn. pp. 3–11
Rastgoo R, Kiani K, Escalera S (2018) Multi-modal deep hand sign language recognition in still images using restricted Boltzmann machine. Entropy 20(11):809
Rastgoo R, Kiani K, Escalera S (2020) Hand sign language recognition using multi-view hand skeleton. Expert Syst Appl 150:113336
Rastgoo R, Kiani K, Escalera S (2020) Video-based isolated hand sign language recognition using a deep cascaded model. Multimed Tools Appl 79:22965–22987
Rastgoo R et al (2021) Sign language production: a review. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. pp. 3446–3456. https://doi.org/10.1109/CVPRW53098.2021.00384
Rastgoo R, Kiani K, Escalera S (2021) ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos. arXiv:2108.10059
Rastgoo R, Kiani K, Escalera S (2021a) Hand pose aware multimodal isolated sign language recognition. Multimed Tools Appl 80(1):127–163. https://doi.org/10.1007/s11042-020-09700-0
Rastgoo R, Kiani K, Escalera S (2021b) Sign language recognition: a deep survey. Expert Syst Appl Elsevier ltd 164(July 2020):113794. https://doi.org/10.1016/j.eswa.2020.113794
Rastgoo R, Kiani K, Escalera S, Sabokrou (2021) Multi-modal zero-shot sign language recognition. https://doi.org/10.48550/arXiv.2109.00796
Rastgoo R, Kiani K, Escalera S, Athitsos V, Sabokrou M (2022) All You Need In Sign Language Production. arXiv:2201.01609v2
Rastgoo R, Kiani K, Escalera S (2022) Real-time isolated hand sign language recognition using deep networks and SVD. J Ambient Intell Humaniz Comput Springer Berlin Heidelberg 13(1):591–611. https://doi.org/10.1007/s12652-021-02920-8
Rastgoo R, Kiani K, Escalera S (2022) A Non-Anatomical Graph Structure for isolated hand gesture separation in continuous gesture sequences. https://doi.org/10.48550/arXiv.2207.07619
Rastgoo R, Kiani K, Escalera S (2022) Word separation in continuous sign language using isolated signs and post-processing. https://doi.org/10.48550/arXiv.2204.00923
Ren F, Bao Y (2020) A review on human-computer interaction and intelligent robots. Int J Inf Technol Decis Mak 19(1):5–47
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: CVPR, Boston, USA. pp. 4566–4575
Wan J, Zhao Y, Zhou Sh, Guyon I, Escalera S, Li S-L (2016) ChaLearn Looking at People RGB-D Isolated and Continuous Datasets for Gesture Recognition. In: CVPR workshop, Nevada, United States
Wang H, Wang P, Song Z, Li W (2017) Large-scale multimodal gesture recognition using heterogeneous networks. In: CVPR, Hawaii, United States
Wang J, Wang W, Huang Y, Wang L, Tan T (2018) M3: Multimodal Memory Modelling for Video Captioning. In: CVPR, Utah, United States. pp. 7512–7520
Wang W, Huang Y, Wang L (2020) Long video question answering: A Matching-guided Attention Model. Pattern Recogn 102:107248
Wang W, Huang Y, Wang L (2020) Long video question answering: A Matching-guided Attention Model. Pattern Recogn 102:107248
Wu Ch, Liu J, Wang X, Li R (2019) Differential Networks for Visual Question Answering. In: Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), Hawaii, USA, pp. 8997–9004
Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In: CVPR, Las Vegas, NV, USA
Xu D, Zhao Zh, Xiao J, Wu F, Zhang H, He X, Zhuang Y (2017) Video Question Answering via Gradually Refined Attention over Appearance and Motion. In: ACM Multimedia Conference, California, USA, pp. 1645–1653
Yao L, Torabi A, Chao K, et al (2015) Describing videos by exploiting temporal structure. In: ICCV, Las Condes, Chile. pp. 4507–4515
Yi K, Wu J, Gan Ch, Torralba A, Kohli P, Tenenbaum JB (2018) Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding. In: 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada, pp. 1–12
Yu H, Wang J, Huang Zh, Yang Y, Xu W (2016) Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks. In: CVPR, Las Vegas, Nevada, United States. pp. 4584–4593
Zeng K-H, Chen T-H, Chuang Ch-Y, Liao Y-H, Niebles JC, Sun M (2017) Leveraging video descriptions to learn video question answering. In: AAAI’17: Proceedings of the thirty-first AAAI conference on artificial intelligence, San Francisco, California, USA, pp. 4334–4340
Zha Z, Liu J, Yang T, Zhang Y (2019) Spatiotemporal-Textual Co-Attention Network for Video Question Answering. ACM Trans Multimed Comput Commun Appl 15:53
Zhang L, Zhu G, Shen P, Song J, Shah SA, Bennamoun M (2017) Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In: CVPR, Hawaii, United States
Zhao Zh, Jiang X, Cai D, Xiao J, He X, Pu S (2018) Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18), pp. 3690–3696
Zhu X, Mao Z, Chen Z, Li Y, Wang Z, Wang B (2020) Object-difference drived graph convolutional networks for visual question answering. Multimed Tools Appl 80:16247–16265
Acknowledgments
This work has been partially supported by the Spanish project PID2019-105093GB-I00, ICREA under the ICREA Academia programme, and High Intelligent Solution (HIS) company in Iran.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors certify that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A
Appendix A
Here, we present some additional experimental results on our framework.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Rastgoo, R., Kiani, K. & Escalera, S. A deep co-attentive hand-based video question answering framework using multi-view skeleton. Multimed Tools Appl 82, 1401–1429 (2023). https://doi.org/10.1007/s11042-022-13573-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13573-w