Enhancing Visual Question Answering with Generated Image Caption

Truong, Kieu-Anh Thi; Tran, Truong-Thuy; Van Thi Nguyen, Cam-; Le, Duc-Trong

doi:10.1007/978-981-97-0669-3_2

Kieu-Anh Thi Truong¹⁰,
Truong-Thuy Tran¹⁰,
Cam- Van Thi Nguyen¹⁰ &
…
Duc-Trong Le¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14479))

Included in the following conference series:

International Conference on Computational Data and Social Networks

106 Accesses

Abstract

Visual Question Answering (VQA) poses a formidable challenge, necessitating computer systems to proficiently execute essential computer vision tasks, grasp the intricate contextual relationship between a posed question and a coexisting image, and produce precise responses. Nonetheless, a recurrent issue in numerous VQA models pertains to their inclination to prioritize language-based information over the rich contextual cues embedded within images, which are pivotal for answering questions comprehensively. To mitigate this limitation, this paper investigates the utility of image captioning-a technique that generates one or more descriptive sentences pertaining to the content of an image-as a means to augment answer quality within the framework of VQA, leveraging a language-centric approach. Towards this goal, we propose two model variants, namely BLIP-C and BLIP-CL, to aggregate the caption-grounded and vision-grounded representations to enrich the contextual question representation to improve the quality of answer generation. Experimental results on a public dataset demonstrate that utilizing captions significantly improves the accuracy and detail of answers compared to the baseline model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://vlsp.org.vn/vlsp2022/eval/evjvqa.

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Barra, S., Bisogni, C., De Marsico, M., Ricciardi, S.: Visual question answering: which investigated applications? Pattern Recogn. Lett. 151, 325–331 (2021)
Article Google Scholar
Changpinyo, S., Kukliansky, D., Szpektor, I., Chen, X., Ding, N., Soricut, R.: All you may need for VQA are image captions. arXiv preprint: arXiv:2205.01883 (2022)
Dinh, H.L., Phan, L.: A jointly language-image model for multilingual visual question answering. In: The 9th International Workshop on Vietnamese Language and Speech Processing (2022)
Google Scholar
Dong, N.V.N., Loi: A multi-modal transformer-based method with object prefixes for multilingual visual question answering. In: The 9th International Workshop on Vietnamese Language and Speech Processing (2022)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint: arXiv:2010.11929 (2020)
Du, Y., Li, J., Tang, T., Zhao, W.X., Wen, J.R.: Zero-shot visual question answering with language model feedback. arXiv preprint: arXiv:2305.17006 (2023)
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint: arXiv:1606.01847 (2016)
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? Dataset and methods for multilingual image question. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, J.: Pixel-BERT: aligning image pixels with text by deep multi-modal transformers. arXiv preprint: arXiv:2004.00849 (2020)
Kafle, K., Kanan, C.: Visual question answering: datasets, algorithms, and future challenges. Comput. Vis. Image Underst. 163, 3–20 (2017)
Article Google Scholar
Kim, H., Bansal, M.: Improving visual question answering by referring to generated paragraph captions. arXiv preprint: arXiv:1906.06216 (2019)
Li, C., et al.: SemVLP: vision-language pre-training by aligning semantics at multiple levels. arXiv preprint: arXiv:2103.07829 (2021)
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Google Scholar
Li, Q., Fu, J., Yu, D., Mei, T., Luo, J.: Tell-and-answer: towards explainable visual question answering using attributes and captions. arXiv preprint: arXiv:1801.09041 (2018)
Li, Q., Tao, Q., Joty, S., Cai, J., Luo, J.: VQA-E: explaining, elaborating, and enhancing your answers for visual questions. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. Lecture Notes in Computer Science(), vol. 11211, pp. 552–567. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_34
Chapter Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: VilBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1–9 (2015)
Google Scholar
Pedersoli, M., Lucas, T., Schmid, C., Verbeek, J.: Areas of attention for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1242–1250 (2017)
Google Scholar
Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Sharma, H., Jalal, A.S.: Image captioning improved visual question answering. Multimedia Tools Appl. 81, 1–22 (2021)
Google Scholar
Singh, J., Ying, V., Nutkiewicz, A.: Attention on attention: architectures for visual question answering (VQA). arXiv preprint: arXiv:1803.07724 (2018)
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. arXiv preprint: arXiv:1908.08530 (2019)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. arXiv preprint: arXiv:1908.07490 (2019)
Teney, D., Anderson, P., He, X., Van Den Hengel, A.: Tips and tricks for visual question answering: learnings from the 2017 challenge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4223–4232 (2018)
Google Scholar
Thai, T.M., Luu, S.T.: Integrating image features with convolutional sequence-to-sequence network for multilingual visual question answering. arXiv preprint: arXiv:2303.12671 (2023)
Tiong, A.M.H., Li, J., Li, B., Savarese, S., Hoi, S.C.: Plug-and-play VQA: Zero-shot VQA by conjoining large pretrained models with zero training. arXiv preprint: arXiv:2210.08773 (2022)
Truong, L.X., Pham, V.Q.: Multi-modal feature extraction for multilingual visual question answering. In: The 9th International Workshop on Vietnamese Language and Speech Processing (2022)
Google Scholar
Wu, J., Chen, L., Mooney, R.J.: Improving VQA and its explanations\(\backslash \)by comparing competing explanations. arXiv preprint: arXiv:2006.15631 (2020)
Wu, J., Hu, Z., Mooney, R.J.: Joint image captioning and question answering. arXiv preprint: arXiv:1805.08389 (2018)
Wu, J., Hu, Z., Mooney, R.J.: Generating question relevant captions to aid visual question answering. arXiv preprint: arXiv:1906.00513 (2019)
Wu, Q., Shen, C., Wang, P., Dick, A., Van Den Hengel, A.: Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1367–1381 (2017)
Article Google Scholar
Yang, Z., et al.: An empirical study of GPT-3 for few-shot knowledge-based VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 3081–3089 (2022)
Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

VNU University of Engineering and Technology, Ha Noi, Viet Nam
Kieu-Anh Thi Truong, Truong-Thuy Tran, Cam- Van Thi Nguyen & Duc-Trong Le

Authors

Kieu-Anh Thi Truong
View author publications
You can also search for this author in PubMed Google Scholar
Truong-Thuy Tran
View author publications
You can also search for this author in PubMed Google Scholar
Cam- Van Thi Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Duc-Trong Le
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kieu-Anh Thi Truong .

Editor information

Editors and Affiliations

National Economics University, Hanoi, Vietnam
Minh Hoàng Hà
Florida Atlantic University, Boca Raton, FL, USA
Xingquan Zhu
University of Florida, Gainesville, FL, USA
My T. Thai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Truong, KA.T., Tran, TT., Van Thi Nguyen, C., Le, DT. (2024). Enhancing Visual Question Answering with Generated Image Caption. In: Hà, M.H., Zhu, X., Thai, M.T. (eds) Computational Data and Social Networks. CSoNet 2023. Lecture Notes in Computer Science, vol 14479. Springer, Singapore. https://doi.org/10.1007/978-981-97-0669-3_2

Download citation

DOI: https://doi.org/10.1007/978-981-97-0669-3_2
Published: 29 February 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0668-6
Online ISBN: 978-981-97-0669-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Enhancing Visual Question Answering with Generated Image Caption