Abstract
Visual Question Answering (VQA) poses a formidable challenge, necessitating computer systems to proficiently execute essential computer vision tasks, grasp the intricate contextual relationship between a posed question and a coexisting image, and produce precise responses. Nonetheless, a recurrent issue in numerous VQA models pertains to their inclination to prioritize language-based information over the rich contextual cues embedded within images, which are pivotal for answering questions comprehensively. To mitigate this limitation, this paper investigates the utility of image captioning-a technique that generates one or more descriptive sentences pertaining to the content of an image-as a means to augment answer quality within the framework of VQA, leveraging a language-centric approach. Towards this goal, we propose two model variants, namely BLIP-C and BLIP-CL, to aggregate the caption-grounded and vision-grounded representations to enrich the contextual question representation to improve the quality of answer generation. Experimental results on a public dataset demonstrate that utilizing captions significantly improves the accuracy and detail of answers compared to the baseline model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Barra, S., Bisogni, C., De Marsico, M., Ricciardi, S.: Visual question answering: which investigated applications? Pattern Recogn. Lett. 151, 325–331 (2021)
Changpinyo, S., Kukliansky, D., Szpektor, I., Chen, X., Ding, N., Soricut, R.: All you may need for VQA are image captions. arXiv preprint: arXiv:2205.01883 (2022)
Dinh, H.L., Phan, L.: A jointly language-image model for multilingual visual question answering. In: The 9th International Workshop on Vietnamese Language and Speech Processing (2022)
Dong, N.V.N., Loi: A multi-modal transformer-based method with object prefixes for multilingual visual question answering. In: The 9th International Workshop on Vietnamese Language and Speech Processing (2022)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint: arXiv:2010.11929 (2020)
Du, Y., Li, J., Tang, T., Zhao, W.X., Wen, J.R.: Zero-shot visual question answering with language model feedback. arXiv preprint: arXiv:2305.17006 (2023)
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint: arXiv:1606.01847 (2016)
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? Dataset and methods for multilingual image question. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, J.: Pixel-BERT: aligning image pixels with text by deep multi-modal transformers. arXiv preprint: arXiv:2004.00849 (2020)
Kafle, K., Kanan, C.: Visual question answering: datasets, algorithms, and future challenges. Comput. Vis. Image Underst. 163, 3–20 (2017)
Kim, H., Bansal, M.: Improving visual question answering by referring to generated paragraph captions. arXiv preprint: arXiv:1906.06216 (2019)
Li, C., et al.: SemVLP: vision-language pre-training by aligning semantics at multiple levels. arXiv preprint: arXiv:2103.07829 (2021)
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Li, Q., Fu, J., Yu, D., Mei, T., Luo, J.: Tell-and-answer: towards explainable visual question answering using attributes and captions. arXiv preprint: arXiv:1801.09041 (2018)
Li, Q., Tao, Q., Joty, S., Cai, J., Luo, J.: VQA-E: explaining, elaborating, and enhancing your answers for visual questions. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. Lecture Notes in Computer Science(), vol. 11211, pp. 552–567. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_34
Lu, J., Batra, D., Parikh, D., Lee, S.: VilBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1–9 (2015)
Pedersoli, M., Lucas, T., Schmid, C., Verbeek, J.: Areas of attention for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1242–1250 (2017)
Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Sharma, H., Jalal, A.S.: Image captioning improved visual question answering. Multimedia Tools Appl. 81, 1–22 (2021)
Singh, J., Ying, V., Nutkiewicz, A.: Attention on attention: architectures for visual question answering (VQA). arXiv preprint: arXiv:1803.07724 (2018)
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. arXiv preprint: arXiv:1908.08530 (2019)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. arXiv preprint: arXiv:1908.07490 (2019)
Teney, D., Anderson, P., He, X., Van Den Hengel, A.: Tips and tricks for visual question answering: learnings from the 2017 challenge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4223–4232 (2018)
Thai, T.M., Luu, S.T.: Integrating image features with convolutional sequence-to-sequence network for multilingual visual question answering. arXiv preprint: arXiv:2303.12671 (2023)
Tiong, A.M.H., Li, J., Li, B., Savarese, S., Hoi, S.C.: Plug-and-play VQA: Zero-shot VQA by conjoining large pretrained models with zero training. arXiv preprint: arXiv:2210.08773 (2022)
Truong, L.X., Pham, V.Q.: Multi-modal feature extraction for multilingual visual question answering. In: The 9th International Workshop on Vietnamese Language and Speech Processing (2022)
Wu, J., Chen, L., Mooney, R.J.: Improving VQA and its explanations\(\backslash \)by comparing competing explanations. arXiv preprint: arXiv:2006.15631 (2020)
Wu, J., Hu, Z., Mooney, R.J.: Joint image captioning and question answering. arXiv preprint: arXiv:1805.08389 (2018)
Wu, J., Hu, Z., Mooney, R.J.: Generating question relevant captions to aid visual question answering. arXiv preprint: arXiv:1906.00513 (2019)
Wu, Q., Shen, C., Wang, P., Dick, A., Van Den Hengel, A.: Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1367–1381 (2017)
Yang, Z., et al.: An empirical study of GPT-3 for few-shot knowledge-based VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 3081–3089 (2022)
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Truong, KA.T., Tran, TT., Van Thi Nguyen, C., Le, DT. (2024). Enhancing Visual Question Answering with Generated Image Caption. In: Hà, M.H., Zhu, X., Thai, M.T. (eds) Computational Data and Social Networks. CSoNet 2023. Lecture Notes in Computer Science, vol 14479. Springer, Singapore. https://doi.org/10.1007/978-981-97-0669-3_2
Download citation
DOI: https://doi.org/10.1007/978-981-97-0669-3_2
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0668-6
Online ISBN: 978-981-97-0669-3
eBook Packages: Computer ScienceComputer Science (R0)