Abstract
Visual Question Answering (VQA) expands on the Turing Test, as it involves the ability to answer questions about visual content. Current efforts in VQA, however, still do not fully consider whether a question about visual content is relevant and if it is not, how to edit it best to make it answerable. Question relevance has only been considered so far at the level of a whole question using binary classification and without the capability to edit a question to make it grounded and intelligible. The only exception to this is our prior research effort into question part relevance that allows for relevance and editing based on object nouns. This paper extends previous work on object relevance to determine the relevance for a question action and leverage this capability to edit an irrelevant question to make it relevant. Practical applications of such a capability include answering biometric-related queries across a set of images, including people and their action (behavioral biometrics). The feasibility of our approach is shown using Context-Collaborative VQA (C2VQA) Action/Relevance/Edit (ARE). Our results show that our proposed approach outperforms all other models for the novel tasks of question action relevance (QAR) and question action editing (QAE) by a significant margin. The ultimate goal for future research is to address full-fledged W5 + type of inquires (What, Where, When, Why, Who, and How) that are grounded to and reference video using both nouns and verbs in a collaborative context-aware fashion.
Similar content being viewed by others
References
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
de Vries H, Strub F, Chandar S, Pietquin O, Larochelle H, Courville A (2016) Guesswhat?! visual object discovery through multi-modal dialogue. arXiv:1611.08481
Geman D, Geman S, Hallonquist N, Younes L (2015) Visual turing test for computer vision systems. Proc Natl Acad Sci 112(12):3618–3623
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2017) LSTM: a search space odyssey. IEEE Trans Neural Netw Learn Syst 28 (10):2222–2232
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv:1512.03385
Huang Z, Xu W, Yu K (2015) Bidirectional lstm-crf models for sequence tagging. arXiv:1508.01991
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein M, Fei-Fei L (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations. arXiv:1602.07332
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
Mallya A, Lazebnik S (2016) Learning models for actions and person-object interactions with transfer to question answering. In: European conference on computer vision. Springer, pp 414–428
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4594–4602
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Empirical methods in natural language processing (EMNLP), pp 1532–1543. http://www.aclweb.org/anthology/D14-1162
Ray A, Christie G, Bansal M, Batra D, Parikh D (2016) Question relevance in vqa: identifying non-visual and false-premise questions. arXiv:1606.06622
Ronchi MR, Perona P (2015) Describing common human visual actions in images. In: Xianghua Xie MWJ, Tam GKL (eds) Proceedings of the British machine vision conference (BMVC 2015). BMVA Press, pp 52.1–52.12
Sharif Razavian A, Azizpour H, Sullivan J, Carlsson S (2014) Cnn features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 806–813
Shih KJ, Singh S, Hoiem D (2016) Where to look: focus regions for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4613–4621
Toor AS, Wechsler H (2017) Biometrics and forensics integration using deep multi-modal semantic alignment and joint embedding. Pattern Recogn Lett, https://doi.org/10.1016/j.patrec.2017.02.012. ISSN: 0167-8655
Toor AS, Wechsler H, Nappi M (2017) Question part relevance and editing for cooperative and context-aware vqa (c2vqa). In: Proceedings of the 15th international workshop on content-based multimedia indexing. ACM, p 4
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: Conference on computer vision and pattern recognition, vol 1, p 8
Acknowledgements
We appreciate assistance from George Mason University, which provided access to GPU-based servers. These experiments were run on ARGO, a research computing cluster provided by the Office of Research Computing at George Mason University, VA. (URL:http://orc.gmu.edu)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Toor, A.S., Wechsler, H. & Nappi, M. Question action relevance and editing for visual question answering. Multimed Tools Appl 78, 2921–2935 (2019). https://doi.org/10.1007/s11042-018-6097-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6097-z