Skip to main content

Enhancing Visual Question Answering with Generated Image Caption

  • Conference paper
  • First Online:
Computational Data and Social Networks (CSoNet 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14479))

Included in the following conference series:

  • 106 Accesses

Abstract

Visual Question Answering (VQA) poses a formidable challenge, necessitating computer systems to proficiently execute essential computer vision tasks, grasp the intricate contextual relationship between a posed question and a coexisting image, and produce precise responses. Nonetheless, a recurrent issue in numerous VQA models pertains to their inclination to prioritize language-based information over the rich contextual cues embedded within images, which are pivotal for answering questions comprehensively. To mitigate this limitation, this paper investigates the utility of image captioning-a technique that generates one or more descriptive sentences pertaining to the content of an image-as a means to augment answer quality within the framework of VQA, leveraging a language-centric approach. Towards this goal, we propose two model variants, namely BLIP-C and BLIP-CL, to aggregate the caption-grounded and vision-grounded representations to enrich the contextual question representation to improve the quality of answer generation. Experimental results on a public dataset demonstrate that utilizing captions significantly improves the accuracy and detail of answers compared to the baseline model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://vlsp.org.vn/vlsp2022/eval/evjvqa.

References

  1. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)

    Google Scholar 

  2. Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)

    Google Scholar 

  3. Barra, S., Bisogni, C., De Marsico, M., Ricciardi, S.: Visual question answering: which investigated applications? Pattern Recogn. Lett. 151, 325–331 (2021)

    Article  Google Scholar 

  4. Changpinyo, S., Kukliansky, D., Szpektor, I., Chen, X., Ding, N., Soricut, R.: All you may need for VQA are image captions. arXiv preprint: arXiv:2205.01883 (2022)

  5. Dinh, H.L., Phan, L.: A jointly language-image model for multilingual visual question answering. In: The 9th International Workshop on Vietnamese Language and Speech Processing (2022)

    Google Scholar 

  6. Dong, N.V.N., Loi: A multi-modal transformer-based method with object prefixes for multilingual visual question answering. In: The 9th International Workshop on Vietnamese Language and Speech Processing (2022)

    Google Scholar 

  7. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint: arXiv:2010.11929 (2020)

  8. Du, Y., Li, J., Tang, T., Zhao, W.X., Wen, J.R.: Zero-shot visual question answering with language model feedback. arXiv preprint: arXiv:2305.17006 (2023)

  9. Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint: arXiv:1606.01847 (2016)

  10. Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? Dataset and methods for multilingual image question. In: Advances in Neural Information Processing Systems, vol. 28 (2015)

    Google Scholar 

  11. Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, J.: Pixel-BERT: aligning image pixels with text by deep multi-modal transformers. arXiv preprint: arXiv:2004.00849 (2020)

  12. Kafle, K., Kanan, C.: Visual question answering: datasets, algorithms, and future challenges. Comput. Vis. Image Underst. 163, 3–20 (2017)

    Article  Google Scholar 

  13. Kim, H., Bansal, M.: Improving visual question answering by referring to generated paragraph captions. arXiv preprint: arXiv:1906.06216 (2019)

  14. Li, C., et al.: SemVLP: vision-language pre-training by aligning semantics at multiple levels. arXiv preprint: arXiv:2103.07829 (2021)

  15. Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)

    Google Scholar 

  16. Li, Q., Fu, J., Yu, D., Mei, T., Luo, J.: Tell-and-answer: towards explainable visual question answering using attributes and captions. arXiv preprint: arXiv:1801.09041 (2018)

  17. Li, Q., Tao, Q., Joty, S., Cai, J., Luo, J.: VQA-E: explaining, elaborating, and enhancing your answers for visual questions. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. Lecture Notes in Computer Science(), vol. 11211, pp. 552–567. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_34

    Chapter  Google Scholar 

  18. Lu, J., Batra, D., Parikh, D., Lee, S.: VilBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  19. Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1–9 (2015)

    Google Scholar 

  20. Pedersoli, M., Lucas, T., Schmid, C., Verbeek, J.: Areas of attention for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1242–1250 (2017)

    Google Scholar 

  21. Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Advances in Neural Information Processing Systems, vol. 28 (2015)

    Google Scholar 

  22. Sharma, H., Jalal, A.S.: Image captioning improved visual question answering. Multimedia Tools Appl. 81, 1–22 (2021)

    Google Scholar 

  23. Singh, J., Ying, V., Nutkiewicz, A.: Attention on attention: architectures for visual question answering (VQA). arXiv preprint: arXiv:1803.07724 (2018)

  24. Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. arXiv preprint: arXiv:1908.08530 (2019)

  25. Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. arXiv preprint: arXiv:1908.07490 (2019)

  26. Teney, D., Anderson, P., He, X., Van Den Hengel, A.: Tips and tricks for visual question answering: learnings from the 2017 challenge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4223–4232 (2018)

    Google Scholar 

  27. Thai, T.M., Luu, S.T.: Integrating image features with convolutional sequence-to-sequence network for multilingual visual question answering. arXiv preprint: arXiv:2303.12671 (2023)

  28. Tiong, A.M.H., Li, J., Li, B., Savarese, S., Hoi, S.C.: Plug-and-play VQA: Zero-shot VQA by conjoining large pretrained models with zero training. arXiv preprint: arXiv:2210.08773 (2022)

  29. Truong, L.X., Pham, V.Q.: Multi-modal feature extraction for multilingual visual question answering. In: The 9th International Workshop on Vietnamese Language and Speech Processing (2022)

    Google Scholar 

  30. Wu, J., Chen, L., Mooney, R.J.: Improving VQA and its explanations\(\backslash \)by comparing competing explanations. arXiv preprint: arXiv:2006.15631 (2020)

  31. Wu, J., Hu, Z., Mooney, R.J.: Joint image captioning and question answering. arXiv preprint: arXiv:1805.08389 (2018)

  32. Wu, J., Hu, Z., Mooney, R.J.: Generating question relevant captions to aid visual question answering. arXiv preprint: arXiv:1906.00513 (2019)

  33. Wu, Q., Shen, C., Wang, P., Dick, A., Van Den Hengel, A.: Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1367–1381 (2017)

    Article  Google Scholar 

  34. Yang, Z., et al.: An empirical study of GPT-3 for few-shot knowledge-based VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 3081–3089 (2022)

    Google Scholar 

  35. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kieu-Anh Thi Truong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Truong, KA.T., Tran, TT., Van Thi Nguyen, C., Le, DT. (2024). Enhancing Visual Question Answering with Generated Image Caption. In: Hà, M.H., Zhu, X., Thai, M.T. (eds) Computational Data and Social Networks. CSoNet 2023. Lecture Notes in Computer Science, vol 14479. Springer, Singapore. https://doi.org/10.1007/978-981-97-0669-3_2

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-0669-3_2

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-0668-6

  • Online ISBN: 978-981-97-0669-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics